Charts and dashboards are vital components that help users visualize metrics, histograms, counters and traces in Tanzu Observability. One of the key strengths of Tanzu Observability is it’s sub second latency to query back an ingested metric. However, if users use multiple advanced functions stacked on top of raw metrics alongside several charts at a dashboard, it can lead to less performant dashboards.
As a general note, if a query does not complete in 5 minutes, it is cancelled by the backend. This is to ensure a long running query does not impact the overall cluster performance. This is a compiled list of items that impact query/dashboard performance and includes a quick summary on how to improve them.
Factors that impact query/dashboard performance:
- Missing data functions utilized without appropriate time parameters
Missing data functions such as default(), last() and next() are used to handle gaps/delays in metrics however, many users do not fully utilize the time parameters. In case of default, users often use it in such manner; default(0, <tsExpression>). The default function takes additional time parameters as well; default([<timeWindow>,] [<delayTime>,] <defaultValue>, <tsExpression>).
By not specifying the time window and delay time, Tanzu Observability is forced to apply the default value for every second and for gaps up to 28 days. This impacts performance of the query and the overall dashboard. Additionally, a user may define aggregation functions on top of missing data functions. The aggregation process occurs at a specific point in time where one or more data values are present. Since missing data functions fill-in gaps of time with data values, this means that an aggregation query such as this could see performance impacts based on the increased amount of resources needed to display results. As a best practice, always specify the time window for an expected gap/delay and a delay time for missing data functions.
- Be specific in queries and avoid wildcards where possible
Dashboards and charts are effective in communicating and as well as performant if they show particular sets of data. As a best practice, users should try to be as specific in their queries as they can be. Filtering metrics based on source names makes them return faster as it gives Tanzu Observability specific information on where the metrics need to be fetched from.
Being as specific as a user can will have a good impact on performance as Tanzu Observability will have all the relevant information on fetching the underlying metrics i.e ts(user.metric, source="db-1" and env="prod"). On the other hand, defining something as ts(user.metric and not env="dev") will be more expensive due to 'and not' because now Tanzu Observability has to search through everything matching 'user.metric' which does not have a 'env=prod' tag. Additionally, filtering in the base query is better than using advanced filtering functions on top of a larger set of querying. Example 'sum(ts(user.metric, source=app-1)))' is better than retainSeries(sum(ts(user.metric)), source=app-1)).
Users should question using wildcards in queries at dashboards specially when there are thousands of time series. Displaying a high number of time series on a chart may not be the most effective way to communicate values to an end user and it can make the queries more expensive as well. There are certain wild card usage patterns that our Tanzu Observability team suggests to avoid. It’s not advisable to use a wild card at the beginning of a query such as ts(“*abc.xyz”). It is preferred to use delimiters around wildcards. ts(‘abc.*.xyz’) is preferred over ts(“abc*xyz”).
- When aggregating or working with different time series, aligning your data matters
When aggregation functions are used, it matters if the underlying time series are reported at the same interval or not. As an example, let’s say there are 2 time series being aggregated, one reports 5 seconds past the minute and the other reports 20 seconds past the minute. When aggregation functions are used, Tanzu Observability will interpolate between 2 reported values to determine what the value would have been for each time series at a given time. When users introduce aggregation to thousands of time series, queries can become expensive and hence, dashboards may take time to load.
Users can alleviate the impact of interpolation by first aligning the metrics to exactly the same times and then use aggregation functions on top of the aligned time series. This can be achieved via the align() function as first ‘align(1m, <tsExpression>)’ and then use an aggregation function on top of it as ‘sum(align(1m, <tsExpression>))’.
The other option to avoid interpolation is to use raw aggregation functions such as rawsum() instead of sum() or rawavg() instead of avg(). Raw aggregation functions do not interpolate values.
- Do Dynamic Dashboard Variables load quick enough?
Dynamic dashboard variables are used to display a list of possible values by using a Tanzu Observability query. Users can extract metric attributes such as sources, point tags and others to dynamically populate a list of possible values. However, sometimes the query itself for the dynamic dashboard variable can be expensive which will slow down the chart/dashboard loading for wherever it is called. If the dashboard variable is not loading quickly, the chart which has it referenced does not have the results returned.
Avoid using expensive queries for dynamic dashboard variables wherever possible. See if derived metrics can work in this scenario. The idea is to create a derived metric that reports less frequently and has the information user wants to extract from the query.
- Adjust the default time window of a dashboard appropriately
By default, a 2 hour time window is used by Tanzu Observability for dashboards. It is possible that teams need a different time window but, understand that if the dashboard contains charts with several queries and advanced functions, rendering the results can gradually become expensive in large time windows such as 4 weeks.
Larger time windows require additional metrics to be fetched from the backend and that can lead to a performance impact. If users are dealing with larger time windows, our recommendation is to filter via queries as much as possible.
- Do events need to be displayed at charts as well?
Metrics and events are separate ‘atoms’ within Tanzu Observability and querying them is a separate task. Often times, charts load times are impacted as the events query becomes a bottleneck for rendering a chart, not the actual Tanzu Observability query. If there is a chart that has ‘noisy’ sources being displayed, its it often a good practice to disable ‘Display Source Events’ at the format section of the chart. It can also be adjusted at the dashboard level at ‘Show Events’ at the top right section.
Ultimately, a lot depends on the underlying data shape as well. See this document for metrics naming best practices.