This article applies to:
- Troubleshooting/Data Ingestion
- Feature Category: Telegraf Agent (Third Party Integration)
When sending metrics from a Telegraf agent to Tanzu Observability, either directly or through a proxy, fine tuning may be necessary to remove bottlenecks and ensure all metrics are delivered in a timely manner. The article will outline how to monitor and fine tune the performance of the Telegraf agent in common scenarios.
The Telegraf default settings are tunable and can be adjusted as necessary. Typically it can be an absence of a particular metric, or intermittent delivery of a metric that might indicate a need to alter the default Telegraf settings. Certain input plugins can overwhelm the default Telegraf agent settings; the messages in the Telegraf logs can be useful to indicate which of the default settings need adjustment to handle the number of metrics being processed and sent.
Tanzu Observability can also be used to garner insights into the Telegraf performance. To collect metrics for the Telegraf agent performance, create a telegraf.conf file in /etc/telegraf/telegraf.d , add the following snippet, then restart the Telegraf service:
# Collect internal Telegraf statistics
## If true, collect Telegraf memory stats.
collect_memstats = true
name_prefix = "telegraf."
Once the internal Telegraf statistics have been enabled, the Telegraf dashboard in the integration section can be installed to view the relevant details.
After installing the Telegraf dashboard, charts can be used to troubleshoot issues with gathering and writing metrics like shown below.
Individual plugin metrics gathered per cycle can also be monitored for example, the rate of metrics collected for the vsphere plugin would be:
rate(ts("telegraf.internal.gather.metrics.gathered", input="vsphere")) * 60
Additionally the Telegraf log messages can also highlight issues indicating that some fine tuning of the Telegraf agent is necessary.
For example, the message below indicates that the buffer has been dropping metrics due to the buffer size not being sufficient to handle the number of metrics collected and flushed to the output plugin. In this scenario the metric_buffer_limit should be increased.
May 17 19:17:00 test-server1 telegraf: 2021-05-17T15:27:00Z W! [outputs.wavefront] Metric buffer overflow; 1956 metrics have been dropped
The message below indicates that the metric_batch_size is too large to be flushed within the configured flush interval and should be adjusted.
May 17 11:02:56 test-server2 telegraf: 2021-05-17T11:02:56Z W! [agent] ["outputs.wavefront"] did not complete within its flush interval
metric_buffer_limit : If the inputs collected exceed this buffer limit, than all the overflow data will simply be dropped/discarded. The size needs to be set large enough for the given interval not to lose any data.
See also :