Tanzu Observability Telegraf agent tuning

This article applies to:

  • Troubleshooting/Data Ingestion
  • Feature Category: Telegraf Agent (Third Party Integration)

Overview

When sending metrics from a Telegraf agent to Tanzu Observability, either directly or through a proxy, fine tuning may be necessary to remove bottlenecks and ensure all metrics are delivered in a timely manner. The article will outline how to monitor and fine tune the performance of the Telegraf agent in common scenarios.

Process

The Telegraf default settings are tunable and can be adjusted as necessary. Typically it can be an absence of a particular metric, or intermittent delivery of a metric that might indicate a need to alter the default Telegraf settings. Certain input plugins can overwhelm the default Telegraf agent settings; the messages in the Telegraf logs can be useful to indicate which of the default settings need adjustment to handle the number of metrics being processed and sent. 

Tanzu Observability can also be used to garner insights into the Telegraf performance. To collect metrics for the Telegraf agent performance, create a telegraf.conf file in /etc/telegraf/telegraf.d , add the following snippet, then restart the Telegraf service:

# Collect internal Telegraf statistics
[[inputs.internal]]
## If true, collect Telegraf memory stats.
collect_memstats = true

name_prefix = "telegraf."

Once the internal Telegraf statistics have been enabled, the Telegraf dashboard in the integration section can be installed to view the relevant details.

mceclip1.png

After installing the Telegraf dashboard, charts can be used to troubleshoot issues with gathering and writing metrics like shown below. mceclip0.png

Individual plugin metrics gathered per cycle can also be monitored for example, the rate of metrics collected for the vsphere plugin would be:

rate(ts("telegraf.internal.gather.metrics.gathered", input="vsphere")) * 60

 

Additionally the Telegraf log messages can also highlight issues indicating that some fine tuning of the Telegraf agent is necessary.

For example, the message below indicates that the buffer has been dropping metrics due to the buffer size not being sufficient to handle the number of metrics collected and flushed to the output plugin. In this scenario the metric_buffer_limit should be increased.

May 17 19:17:00 test-server1 telegraf[1562]: 2021-05-17T15:27:00Z W! [outputs.wavefront] Metric buffer overflow; 1956 metrics have been dropped

The message below indicates that the metric_batch_size is too large to be flushed within the configured flush interval and should be adjusted.

May 17 11:02:56 test-server2 telegraf[1468]: 2021-05-17T11:02:56Z W! [agent] ["outputs.wavefront"] did not complete within its flush interval
The default settings for the input and output plugins reside in the telegraf.conf file (located in Program Files\Telegraf\telegraf.conf for Windows installations and in /etc/telegraf/telegraf.conf for Linux installations). The following outlines the main parameters that may need adjustment. 
interval: This is the default data collection interval for all inputs used by Telegraf. This setting specifies how frequently to collect the metric data.
flush interval:  This setting specifies how frequently to flush the collected metric data to the output plugin.
metric_batch_size: This is the size of the number of metrics that will be written to the output plugin at once.

metric_buffer_limit : If the inputs collected exceed this buffer limit, than all the overflow data will simply be dropped/discarded. The size needs to be set large enough for the given interval not to lose any data. 

See also :

https://docs.wavefront.com/telegraf.html

https://github.com/influxdata/telegraf/blob/master/plugins/inputs/internal/README.md

https://github.com/influxdata/telegraf

Was this article helpful?
0 out of 0 found this helpful
Have more questions? Submit a request

Comments

Powered by Zendesk