Circonus Analytics Query Language (CAQL) - Tutorial

 

Circonus has been offering a variety of powerful analytics tools as graph overlays for a while. With CAQL, we go one step further and give our users the flexibility to build customized analytics tailored for their needs. CAQL allows the user to formulate queries for complex transformations of collected data that can be used for graphing (and alerting, coming soon). Moreover, it exposes previously hidden parameters to allow more fine-tuning than ever before.


Behind the scenes, CAQL expressions are compiled into a tree of processing units that process a stream of data in a push based computation model and maintain a state.


This video provides a helpful introduction to using CAQL, which you can follow along. You can also get started with the written examples provided below.




Getting Started


To create your first CAQL statement, create a new graph and select "CAQL Statement" from the "Add Datapoint" menu. Expand the legend bar to see the CAQL input field. Clicking in the input field will bring up a tooltip that shows some of the most used CAQL functions.



The first thing we have to do is add a metric data source. This is accomplished using the "metric" statement:


    metric:<kind>(<checkuuid>, <metricname>)


This requires you to input the metric kind (e.g. "average"), the uuid of the check, and the metric name. You can find this information on the check details page. An easier way to enter this information is using the tooltip:

  1. In the metric f(x) row, click the kind of metric you want to add.
  2. Double click on <checkuuid> to get a search box for the available checks.
  3. Type a key word, press enter, and select a corresponding check. The corresponding check uuid will be filled in.
  4. Double click on <metricname> to get a dropdown menu with the available metrics for this check:



Congratulations, you have just entered your first complete CAQL statement!


When the focus leaves the input field, the values for the selected metric should show up in the graph.

If something went wrong, the UI will show an error message hinting at the source of the problem.


Now, let’s look at some examples of some common scenarios, and how we can use the features of CAQL in those situations.


Example 1: Peek latency over the last hour


Let's say we are measuring the round trip time (rtt) of a DNS server, and we are interested in the peek rrt we've seen in the last hour. You can compute this value using CAQL as follows:

       
metric:average(<checkuuid>, "rtt") | rolling:max(1h)


The screenshot below shows a graph with the original values and the computed maximum. Click the screenshot to view an interactive example:



Let's break that down:


metric:average(<checkuuid>, "rtt") -- This creates a stream of values of DNS rtts


| -- The pipe character takes the output of metric and passes it as input to further processing units.


rolling:max(1h) -- This function creates a processing unit, that computes the maximum value over a rolling window of 1h. You can supply arbitrary durations (e.g. `2h 3M`) as arguments to this function.


The rolling:max(<duration>) function has a cousin, window:max(<duration>), that computes the maximum over the last fully completed window. A full list of all available functions in CAQL can be found in the user manual.


Example 2: Working with Histograms


With CAQL, it is possible to display properties of histograms that were not available in the UI before.

Use this syntax:


      metric:histogram(<checkuuid>, <metricname>) 


This creates a stream of histogram values. (The histogram itself will not show in the UI.) The screenshot below shows a histogram of request latencies with the following transforms applied to it:


  • histogram:mean() -- the mean value of the histogram (shown below in green)
  • histogram:IQR() -- to inter-quartile-range (shown below in red)
  • histogram:rate() -- the number of samples per second (shown below in blue)


The last graph has furthermore been smoothed by applying a rolling mean.




Example 3: Finding anomalies in low-frequency data

 

Performing anomaly detection on low-frequency data is challenging. The graph below shows the web request rate of a web server, which is mostly idle.


The normal state of this metric is a flat 0, so any activity would be an anomaly for this metric.



 

The solution is to use windowed smoothing to identify a sensible “normal” state. The graph below shows the same data as above in green, a 12h rolling mean has been added in blue, and the anomaly score of the smoothed data is appears in red. The result is a more practical interpretation of where our anomalies are to be found.

 


 

This graph was produced with this query:

 

metric:counter('586d9436-27fa-4567-a6b9-cecaf49228e7','mobile`FRONTEND`req_tot')

| rolling:mean(12h)

| anomaly_detection(60)

 

Let’s break that down:

 

rolling:mean(12h) -- This is a processing unit that takes in data over one hour, and calculates the average.


anomaly_detection(60) -- This processing unit detects anomalies with the specified sensitivity. The sensitivity value can be set between 0 and 100.


Example 4: Monitoring uptime

 

In this example, let’s say we are monitoring uptime over the past day in order to verify an availability SLA. This will depend on our ability to account for missing data points, so we can use the following CAQL statement:

 

metric:<> | is_missing() | rolling:mean(1d)

 

Filling in the appropriate check uuid and metric name, this statement produces the following graph:



Let’s break down this formula. We’re already familiar with the source metric:<> and the rolling:mean.

 

ismissing() -- This processing unit returns 1 if there is a missing datapoint in the preceding data, and returns a 0 otherwise.

 

So this CAQL statement identifies the missing data in the selected metric. In this example, we simply add "100 - (" to the front of the CAQL statement and ") * 100" to the back, and we have our uptime report, with the graph measuring availability over the past day.

 

Example 5: Monitoring the ratio between two data-points with a delay

 

Monitoring ratio between two data-points when one is delayed is accomplished with a relatively simply query of the form:

 

metric:<A> / (metric:<B> | delay(1d))

Let’s plug in some metrics and look at an example graph:





Let’s breakdown this down.

 

delay -- This processing unit delays the stream by the specified number of samples.

 

The delay function supports duration literals. Internally we translate a duration to the number of samples within that duration, but you can use hm, and s to specify hours, minutes, and seconds. For example, “1d” = 1 day, which is translated to the corresponding number of seconds / samples (“1d” is equivalent to “1440” samples).

 

Warning: Regarding the duration syntax, the units have to directly follow the number, without spaces. For example “5 M” is not supported, but “5m” is supported.


Package CAQL - Functions

 

A table listing CAQL functions is maintained in the Circonus user manual.