How to Account for Known Downtimes or Events in Uptime Queries

This article applies to:

Alerting

Product edition: All

Feature Category: Alerts, Maintenance Windows

 

Overview:

 

It is common to track service uptime. However, there are times when there are known and expected downtime periods. For instance, there may be maintenance or testing windows. What if you wanted to exclude these known downtimes from uptime calculations? This article covers one of the ways to approach this.

 

 

Accounting for Known Downtime:

At a high level, the approach here is to utilize Maintenance Windows and exclude periods of time when the Maintenance Window is active in the uptime calculation.

 

Use Maintenance Windows

Maintenance Windows is a handy feature in Wavefront that allows users to designate known time windows for maintenance or any other scheduled work so that alerts do not fire unnecessarily. Maintenance Windows can be set before a known event or retroactively. Therefore, if you didn't set up a Maintenance Window for a scheduled period of maintenance or testing, you can still create Maintenance Windows for time periods that have already passed. Once you have created Maintenance Windows for each of the time periods for which we do not want accounted for in the uptime calculation, we're ready to move to the next step.

 

Exclude Active Maintenance Windows in Queries 

The Wavefront Query Language includes functions for querying for events and for converting events to time series data. In this approach, we will be making use of the events() and ongoing() functions.

 

1. Query for the applicable Maintenance Window(s)

We can query for our Maintenance Window(s) using the events() function and filtering using the Maintenance Window names. For example, if my Maintenance Window name was "OS Upgrade", my query would look like:

events(name="OS Upgrade")

 

2. Determine time periods when the Maintenance Windows are inactive

When calculating uptime, we only care about time periods when there aren't active Maintenance Windows. Therefore, we will make use of the ongoing() function. This function will return a 1 whenever the underlying Maintenance Window is active and 0 otherwise. So, if we want to determine when the Maintenance Window is inactive, we can simply check for when the resulting value is 0:

ongoing(events(name="OS Upgrade")) = 0

 

3. (Optional) Match granularity of uptime calculation

The ongoing() function returns a continuous time series. This means that it returns data every second continuously. In order to use this data in uptime calculations, we will need to match the granularity of those calculations.

For example, if you are calculating uptime in terms of minutes, then we'll need the data telling us when Maintenance Windows are inactive to also be in terms of minutes. We'll use the align() function to accomplish this:

align(1m, min, ongoing(events(name="OS Upgrade")) = 0)

In this example, we assume that the uptime calculation is in terms of minutes. We are also assuming that if the Maintenance Window is active during any portion of a minute, we want to exclude that entire minute from uptime calculations. This why we specified a summarization strategy of minimum. If at any second within a minute the Maintenance Window is active, the ongoing query will return a 1. Therefore, when comparing that with 0, the result will be 0. By summarizing by minimum, the results of the align() function will return 0 for that minute.

 

4. Calculating uptime

This particular step will vary depending on how you are calculating uptime. For our example, we have a set of canary data that reports at 1-minute intervals whenever our service is up. In this step, we will just demonstrate how we account for periods of active Maintenance Windows.

Maintenance Window inactive = align(1m, min, ongoing(events(name="OS Upgrade")) = 0)
Service available = ts(service.available)

Service actually available = align(1m, ${Maintenance Window inactive} AND ${Service available})

By using a boolean AND, we only account for time periods when the Maintenance Window inactive query returns a 1. The resulting data specifies a 1 when there are no Maintenance Windows and the service is up. Utilizing this data, we can then calculate uptime by comparing the number of minutes the service was truly available with the time period of interest.

For example, to calculate uptime percentage over the last 24 hours, we would have something like this:

(msum(24h, ${Service actually available}) / (24 * 60)) * 100

msum() is used to determine how many minutes over a 24-hour time window the service was truly available. There are 24 * 60 minutes over a 24-hour time window. Finally, we multiply by 100 to get a percentage rather than a decimal.

 

 

Was this article helpful?
0 out of 0 found this helpful
Have more questions? Submit a request

Comments

Powered by Zendesk