Tanzu Observability alerts enable users to monitor their environment around the clock without needing to be at a computer. Tanzu Observability understands that customer reliance on accurate alerting is crucial. So it stands to reason that an end user might have follow-up questions when an alert behaves differently from their expectation. In many cases, the data representation in your alerts and charts may support your underlying expectation upon review.
This scenario has led many a customer to contact Tanzu Observability support to determine "Why did my alert fire/not fire"? In this article, we'll highlight what typically leads to a difference in expectation vs reality in regards to Tanzu Observability alerting, and what you can investigate to gain clarity.
As a quick note, please keep in mind that each alerting use case is unique and therefore the following explanations and steps to take during investigation may not entirely answer your question. If you have additional questions after reviewing this article, then please contact Support for assistance!
What causes a difference in alerting expectation vs reality?
In the vast majority of cases, a difference in expectation vs reality can be contributed to 2 key points:
- A misunderstanding in how Tanzu Observability evaluates alerts
- Underlying data shape is not addressed properly in the query construct
A misunderstanding in how Tanzu Observability evaluates alerts
The alert evaluation process in Tanzu Observability is the same for all customers. You can review written documentation for Alert States and Lifecycles, or watch our Alerting Fundamentals video for an overview. Some commonly misunderstood concepts include:
- Alert checking frequency: Alerts are checked approximately once per minute
- Alert time window being reviewed: Default is 5 minutes, but can be (and often are) adjusted via the Alert Fire and Alert Resolve properties
- Minutely summarized values are being evaluated: If you're conditional query returns more than 1 value per minute, then Tanzu Observability will perform minutely aggregation (avg) before evaluating results.
- Required number of "true" values needed to trigger an alert: "True" values refers to any value that meets the specified condition in your query. A true value is any non-zero value returned by your alert query. Within the reviewed time window, an alert will trigger if there is at least 1 "true" value and 0 "false" values. A "false" value is any zero value returned by your alert query. An absence of a value is considered neither "true" nor "false". A true value is not required at each reporting interval for an alert to trigger.
- Alerts evaluate on real-time data: Reviewing data associated with a triggered alert may appear different than it did when the alert was evaluated in real-time. This can typically be contributed to delayed reporting of data or the query construct. Often times, reviewing the alert query in a Live 10-minute chart can shed light on this behavior.
Underlying data shape is not addressed properly in the query construct
Having a strong understanding of how Tanzu Observability evaluates alerts can go a long way in making sure your expectation matches reality. Equally as important though is understanding the data shape associated with your alert query, and properly addressing that shape through the query construct.
In this context, data shape simply refers to the reporting behavior of the associated alert data. Behavior such as how often data is reported, the intervals associated with reporting, and a lag in real-time data can all contribute to a difference in expectation vs reality.
For example, imagine you want to trigger an alert when the total number of errors reported across 10 VMs exceeds expectation. While this is a common use case that customers like to monitor, the associated data shape can impact how you should construct your conditional alert query. For this example in particular, any of the following data shape behaviors can cause a difference in expectation vs reality:
- Reporting behavior when no errors occur: Some customers will send a value of 0 if no errors occurred at a reporting interval, while others may simply omit a reported value. The latter may require the default() missing data function in order to correctly handle the omitted value.
- Reporting intervals associated with VMs: Even though each VM likely has the same reporting frequency, the reporting interval may be staggered. Since an aggregation function is likely needed to calculate the "total number of errors reported across 10 VMs", this staggered reporting could introduce interpolated values based on the selected aggregation function. This means that consideration would be needed in regards to whether a raw or non-raw aggregation function should be utilized.
- Lag in real-time data: If the associated error data is reported to Tanzu Observability with a 5-minute lag, then you'd need to consider that when setting the Alert Firing time window or constructing your query. If the alert is set to evaluate a 3-minute time window of real-time data, then there would be no reported values to evaluate during the check. Looking at the data 20-minutes after the fact may give the impression that the total number of errors were exceeded, but that wouldn't be the case when evaluating it in real-time. You'd want to increase the Alert Firing time window or utilize the lag() function.
Considering how alerts are evaluated in Tanzu Observability and the underlying data shape associated with your use case can go a long way in aligning expectation and reality.
What should I investigate when determining why my alert did or did not fire?
In Tanzu Observability, our customers typical review reported data via alert backtesting when determining why an alert did or did not fire. If expectation does not meet reality, then usually this means the data results observed during backtesting differ from the alert evaluation at that time. For example, the total number of errors exceeded your specified limit in backtesting view, but the alert did not trigger during that window of time.
While backtesting provides significant value to end users, it's important to note that it shouldn't be taken as 100% truth for what the data looked like at the time of evaluation. Whether you're reviewing data after the fact via backtesting or directly on a chart, and there is a difference in expectation vs reality, then one of the following scenarios are likely the source of that difference:
- Lag in reported data: If a post review of data shows data present but the associated alert did not trigger, then review the alert query on a live data chart. If a lag in reported data is present, then the alert may have an incomplete picture of data to evaluate. This could mean an alert doesn't trigger when it should simply because there were no data values to evaluate because the time window was to small to account for the reporting lag.
Additionally, a reporting lag can reduce the set of data values being evaluated. For example, if an alert is evaluating a 5-minute time window but there is a 4-minute lag in reporting, then the alert is only evaluating 1 value during each check. Even if a "true" value is going to eventually be followed by 4 "false" values, the alert can still trigger if the "false" values haven't been reported into the system yet because alerts trigger when there is at least 1 "true" value and no "false" values.
- Utilizing functions that can introduce interpolation: if(), arithmetic operators like + or -, and non-raw aggregation functions are some of the common examples of query functions that can introduce interpolation into your results. In Tanzu Observability, interpolation is the process of generating a made-up data value for 1 or more time series where they don't exist, and can only occur between two truly reported values within a series.
When using non-raw aggregation functions in your query, the process of interpolation can increase a displayed value in the past by including more made-up values in the calculation once a newly reported value comes into the system.
For example, imagine you are using sum() to aggregate 3 time series together. Each time series reports a value every 5 minutes, but the reporting interval is staggered. In this case, app-1 reports on the :00 and :05 minute boundaries, app-2 reports on the :01 and :06 minute boundaries, and app-3 reports on the :02 and :07 minute boundaries. If we were reviewing this data in real-time at 12:02p, then the aggregated value at 12:00p would represent the sum of 3 values. This occurs because Tanzu Observability could generate interpolated values at 12:00p for app-2 and app-3. However, the value displayed at 12:02p would only represent the sum of 1 value. This is because Tanzu Observability can't generate interpolated values for app-1 or app-2 at that boundary because the next reported values have not come in yet for either.
In this scenario, your most recent aggregated values at the time of the alert evaluation are going to typically be less than the value you'd expect to see if all 3 time series were accounted for. These temporary lower values can often fall below a specified limit or condition and cause an alert not to fire, but reviewing the data after the fact may show them exceed that specified limit due to interpolation. Using missing data functions or raw aggregation functions can help in these cases.
- Alert evaluation on minutely summarized values when data is reported more often: Is the associated alert query returning data values more often than once per minute? If so, then you may want to consider your use case and how you'd like the minutely summarized value to be returned.
Imagine you have an alert query that reports a latency value every 15 seconds, and you want to return a "true" value if that latency value exceeds 200ms for 3 minutes. Within a single minute, you may have reported values of 150ms (f), 190ms (f), 210ms (t), and 125ms (f). If a condition of > 200 is used, then these values would return as 0, 0, 1, and 0.
In this case, you may expect this alert not to trigger because there was a false value present. However, Tanzu Observability evaluates on minutely summarized (avg) values. In this case, the average of those 3 values would be ((0+0+1+0) / 4) = .25. Since .25 is a non-zero value returned by the alert query, this value would be evaluated as "true".
If your alert query is reporting values more often than once per minute, then you may want to consider applying the align() function to the entire alert query:
Ex. align(1m, min, ts("requests.latency") > 200)
In the example above, we're explicitly stating that we'd like the minutely summarized value to return the lowest (min) value within that 1-minute bucket. Since zero is the lowest returned value in our example above, this means that the minutely summarized value being evaluated would be zero and would be considered "false" for alerting purposes.
Summary & Additional Resources
In conclusion, alerts are a great way to monitor your environment around the clock. However, understanding the Tanzu Observability alert evaluation process along with your data shape and query construct can go a long way in making sure your expectation matches reality. A lag in reporting data, introduction of interpolation, and handling of sub-minute data are just a few of the scenarios to review when trying to understand why your alert did or did not fire.
If Tanzu Observability documentation and/or the information included in this documentation did not answer your questions, then please reach out to Support for help. Additionally, the following docs may be of assistance: