(Formerly titled "Exposing the Sum of NNT Lag")

Currently, the Circonus Data Storage system does not expose the number of records by which it is behind its peers. However, even with the current way of exposing the average peer lag, there is a way to compute the expected catch up time of a Data Storage (Snowth) node, by using CAQL.

Here is an example, in which we will expose the sum of NNT Lag:


In the graph above, the black line is the average peer lag. The green line is the CAQL statement, that computes the expected time in hours until the peer lag reduces to 0.

Note that the metric took around 4h to catch up (16:00 - 20:00) and the green line shows a value of 4-5 on the right axis at the beginning of the catch up phase.

The formula used is a discrete variant of Newton's method for finding roots (see http://www.wikiwand.com/en/Newton's_method). This is the variant formula:

dt = d / (y[t-d] / y[t] - 1)

where:
y[t] - is the value of the metric at time t
t - is the current time
dt - is the time until the 0 value is reached: x[t + dt]  ~ 0

In the example, we take d to be 10M in order to keep it from being too sensitive to occasional hiccups. This parameter can be tuned.

This is not at all obvious, so we are working on simplifying this process and offering more robust means to estimate and alert on velocities of metrics. In the future, you should be able to use a CAQL statement like this to get the number of seconds until the metric reaches 0:
metric:<> | forecast:time_to_zero()