This article applies to:
- Data Ingestion, Querying
- Product edition: N/A
- Feature Category: N/A
When query performance is poor, the typical first area of investigation is around the query itself and how it can be improved - Am I using these functions properly? How can I write my query differently? Query performance, however, begins with the data ingested. That is, the "shape" of the data itself has a huge impact on query performance. Shape refers to how each component of a time series is designed and formatted. Data shape affects cardinality. While Tanzu Observability is designed to gracefully handle high cardinality, we can still ensure optimal query performance by designing our data shapes with care.
Shaping Your Data Effectively
Tanzu Observability by Wavefront has several indexes for retrieving data. One of the main indexes uses the metric name and source name combination. There is also another index that allows retrieval of data based on the point tag key and values combination. If we can take advantage of both of these indexes in our data shaping, we can improve query performance. Additionally, making sure that we don't introduce more time series than necessary will also improve query performance.
Is the data actually a time series?
Tanzu Observability is an observability platform that supports analytics of time series data. It is called time series because we are tracking some behavior over time. Each data point is a measurement at a particular point in time. We can connect data points together because we know that they are measuring the same behavior at different moments in time. TO by Wavefront is able to identify which data points are measuring the same behavior through the various components of each data point (metric name, source name, and point tags). The unique combination of these components describes what we are measuring. This collection of data points measuring a particular behavior is a time series.
If your data, however, is tracking very unique behavior such that no two data points (or very few of them) belong to the same time series, consider whether using Distributed Tracing makes more sense or whether the data can be tracked differently to ensure that we have a true time series. One way to detect if our data is of this nature and needs to be changed is if querying the raw data on a line chart returns a bunch of dots rather than actual lines. This happens because each data point belongs to a different series and therefore, we actually do not have true time series data.
This is extremely important for query performance because data is stored as time series. That is, data that belong to the same series are stored together. Increasing the number of time series unnecessarily slows down data retrieval as more time series required to be retrieved. This additionally has in impact as more time series may need to be scanned through, to find the time series that need to be retrieved.
How will the data be queried?
Understanding how the data is to be ingested and how it will be queried is key to the process. This will help with determining what should make up each component of the data points. We know that these components include the metric name, source name, and point tags.
In this section, we'll focus on the first two components since one of the main indexes uses the metric name and source name combination. At a high level, the key consideration is to optimize for the most frequent/common situation.
Let's start with the source name. It's important to think about how the metric will be queried and what will be the main (or most frequently used) dimension for filtering or grouping the data. This dimension will often be the best candidate to use as the source name. Doing this will ensure that we are making good use of the metric name/source name index and thereby, improving the retrieval speed of the data we care about and, naturally, improving query performance in the process.
For example, if I had an application with multiple services and I want to have a time series metric tracking request count, I should consider how I would typically use this data. Suppose that most of the time I would want to see request counts by service ie. what is the request count for each of my services. It would then make sense to set the service name as the source name of each data point. In my queries, I would then be able to easily see data for each of my services by using the service name as the source filter.
Suppose, instead, that I set the source name as the hostname of the server that is collecting this request count data. I would need to add a point tag to specify the service. When I run my queries to find request counts for a particular service, I would need to filter by the point tag. In most situations, this is the kind of query that I would run and this would render the information stored as the source name effectively useless. The metric name/source name index would also not be sufficient to retrieve the data I'm interested in. While I will still be able to obtain the results that I'm looking for when filtering by a point tag, it would be better to design my data shape to make the best use of all that is available.
Carefully determining the metric name is also important. As described in the best practices, it's helpful if the metric name reflects data that is comparable across various sources. What is also important is that the metric name is not too broad and shards the data appropriately.
Suppose I have a metric tracking response code counts. This would be comparable across various sources - I can compare how many responses each service had. However, let's say that in most situations, I would actually be comparing specific sets of response codes. For instance, I may want to compare just the response codes that correspond to errors. If my metric name was simply response_code, I would have to use a point tag to track the response code itself (ie. 202, 400, 401, 500, etc). Since I actually need to compare counts of a specific response code across services, most of the time my queries would require filtering with the metric name, the source name, and a point tag. This is certainly doable. What is lost out on, though, is making the most optimal use of the metric name and source name index.
Because the metric name itself is too broad, this index doesn't allow us to directly retrieve the data we're interested in. Additional filtering is needed. Suppose, instead, that my metric name was response_code.<code>. This is still comparable across various sources but it also allows us to more effectively use the metric name/source name index. We can now directly retrieve the data of interest without needing further filtering.
It is expected, of course, to have to do additional filtering in most cases. However, the goal is to optimize the use of all the indexes.
Is each data point highly unique?
Often, the first instinct is to store every attribute that we need for a particular set of data as a separate point tag key and value. However, it's also important to consider how frequently the values of the point tags change. If the value changes with each data point, this would make a very poor point tag because it would result in a very large number of time series, each with just a single data point. As described in the section above, this slows down data retrieval and therefore, query performance.
The same concept applies to metric names and source names. None of the components of a data point that describe what is being measured should be so unique that each data point effectively is its own time series. Therefore, things like timestamps or unique IDs should not be used in metric names, source names, or point tags.
Again, there are valid situations where we need to capture ephemeral information with point tags. The goal is to optimize where we can.