Telegraf Agent Troubleshooting

This article applies to:

  • Troubleshooting/Data Ingestion
  • Feature Category: Telegraf Agent (Third Party Integration)

Overview

This KB details some troubleshooting steps to follow when there may be issues with the collection and sending of metrics to the VMware Aria Operations for Applications proxy from a telegraf agent. The symptoms reported may include metrics sent from the telegraf agent via a proxy are not seen when queried in the VMware Aria Operations for Applications user interface.

Procedure

Ensure the Telegraf agent is collecting metrics

  • Check that the intended input plugins are enabled and telegraf is able to collect their metrics by running the "telegraf --test" command as shown in the following example. Any start up errors related to misconfigured plugins, file permissions etc. should be investigated depending on information provided in error messages. 
    >telegraf --test
    2021-11-08T15:58:24Z I! Starting Telegraf 1.19.0
    2021-11-08T15:58:24Z I! Using config file: /etc/telegraf/telegraf.conf
    > net,host=centos-2021,interface=virbr0 bytes_recv=0i,bytes_sent=0i,drop_in=0i,drop_out=0i,err_in=0i,err_out=0i,packets_recv=0i,packets_sent=0i 1636387105000000000
    > net,host=centos-2021,interface=ens33 bytes_recv=235865896i,bytes_sent=13420041i,drop_in=0i,drop_out=0i,err_in=0i,err_out=0i,packets_recv=169362i,packets_sent=23833i 1636387105000000000
    > net,host=centos-2021,interface=all icmp_inaddrmaskreps=0i,icmp_inaddrmasks=0i,icmp_incsumerrors=0i,icmp_indestunreachs=92i,icmp_inechoreps=0i,icmp_inechos=1i,icmp_inerrors=34i,icmp_inmsgs=93i,icmp_inparmprobs=0i,icmp_inredirects=0i,icmp_insrcquenchs=0i,icmp_intimeexcds=0i,icmp_intimestampreps=0i,icmp_intimestamps=0i,icmp_outaddrmaskreps=0i,icmp_outaddrmasks=0i,icmp_outdestunreachs=95i,icmp_outechoreps=1i,icmp_outechos=0i,icmp_outerrors=0i,icmp_outmsgs=96i,icmp_outparmprobs=0i,icmp_outredirects=0i,icmp_outsrcquenchs=0i,icmp_outtimeexcds=0i,icmp_outtimestampreps=0i,icmp_outtimestamps=0i,icmpmsg_intype3=92i,icmpmsg_intype8=1i,icmpmsg_outtype0=1i,icmpmsg_outtype3=95i,ip_defaultttl=64i,ip_forwarding=1i,ip_forwdatagrams=0i,ip_fragcreates=0i,ip_fragfails=0i,ip_fragoks=0i,ip_inaddrerrors=0i,ip_indelivers=23399i,ip_indiscards=0i,ip_inhdrerrors=0i,ip_inreceives=23419i,ip_inunknownprotos=0i,ip_outdiscards=0i,ip_outnoroutes=65i,ip_outrequests=21702i,ip_reasmfails=0i,ip_reasmoks=0i,ip_reasmreqds=0i,ip_reasmtimeout=0i,tcp_activeopens=511i,tcp_attemptfails=243i,tcp_currestab=6i,tcp_estabresets=10i,tcp_incsumerrors=0i,tcp_inerrs=0i,tcp_insegs=21710i,tcp_maxconn=-1i,tcp_outrsts=230i,tcp_outsegs=24581i,tcp_passiveopens=5i,tcp_retranssegs=37i,tcp_rtoalgorithm=1i,tcp_rtomax=120000i,tcp_rtomin=200i,udp_incsumerrors=0i,udp_indatagrams=926i,udp_inerrors=0i,udp_noports=27i,udp_outdatagrams=455i,udp_rcvbuferrors=0i,udp_sndbuferrors=0i,udplite_incsumerrors=0i,udplite_indatagrams=0i,udplite_inerrors=0i,udplite_noports=0i,udplite_outdatagrams=0i,udplite_rcvbuferrors=0i,udplite_sndbuferrors=0i 1636387105000000000
  • Validate the telegraf input plugins are correctly configured. The log output may return more detailed information for guidance.
    Refer to the applicable telegraf GitHub for documentation related to input and output plugins for examples: https://github.com/influxdata/telegraf/tree/master/docs.
  • If there are a small amount of missing metrics from a particular plugin, check that the configuration for the plugin is not configured to exclude those metrics. For example, in the vsphere input plugin configuration in the telegraf.conf (typically located in /etc/telegraf/telegraf.conf) file shown below, there are virtual machine and host metrics excluded on purpose:
    ...
    # Read metrics from one or many vCenters

    [[inputs.vsphere]]
    vm_metric_exclude = ["cpu.idle.summation","cpu.readiness.average","cpu.ready.summation","cpu.run.summation"]
    host_metric_exclude =["cpu.idle.summation","cpu.readiness.average","cpu.ready.summation","cpu.wait.summation"]
     
  • The Linux agent does not log to /var/log/telegraf.log by default, but to syslog (/var/log/syslog), or equivalent depending on the OS distribution. You can override this setting by uncommenting the log file line in telegraf.conf, as seen below to send telegraf related log information to its own log file. Debug mode can also be enabled to increase verbosity.
    ...
    ## Name of the file to be logged to when using the "file" log target. If set to
    ## the empty string then logs are written to stderr.
    logfile = "/var/log/telegraf/telegraf.log"

    ## Run telegraf in debug mode
    debug = false
    ## Run telegraf in quiet mode
    quiet = false

  • If the plugins appear to be loading as expected but the metrics are not being seen, enable capturing of all points being sent from telegraf to a file on the local drive to verify the data is being captured and in the format expected by adding the following in the telegraf.conf file. 
    [[outputs.file]]
    ## Files to write to, "stdout" is a specially handled file.
      files = ["stdout", "/tmp/metrics.out"]
    In the example,  the file "/tmp/metrics.out", will have all the telegraf data being sent to the proxy captured for review; this can be used to confirm the telegraf input plugins are gathering the metric data with appropriate tagging prior to being sent.

  • Not all errors in the logs may be actionable, and not all plugins may be required, for example the error message below is not a proxy related error. If this error is not desirable and the outputs.influxdb plugin is not needed, you can disable the outputs.influxdb plugin by commenting (#) out the corresponding outputs.influxdb plugin lines in the telegraf.conf and restarting the telegraf agent. 
    Jan 27 15:14:22 hostname telegraf[22421]: 2021-01-27T22:14:22Z E! [outputs.influxdb] When writing to [http://localhost:8086]: Post "http://localhost:8086/write?db=telegraf": dial tcp 127.0.0.1:8086: connect: connection refused

    Jan 27 15:14:22 hostname telegraf[22421]: 2021-01-27T22:14:22Z E! [agent] Error writing to outputs.influxdb: could not write any address

Verify connectivity of Telegraf agent to VMware Aria Operations for Applications Proxy

  • The typical reasons for the telegraf agent failing to send metrics to the proxy are network related; DNS resolution failures, invalid hostnames or IP addresses configured for the proxy defined in the telegraf configuration files, or firewalls across the network blocking the communication to the configured port on the proxy. The errors supplied in the logs for such scenarios will look like below:
    Error writing to outputs.wavefront: Wavefront sending error: unable to connect to Wavefront proxy 
    Jan 27 15:14:52 hostname telegraf[22421]: 2021-01-27T22:14:52Z E! [agent] Error writing to outputs.wavefront: Wavefront sending error: unable to connect to Wavefront proxy at address: 10.10.100.100:2878, err: "dial tcp 10.10.100.100:2878: i/o timeout"

    Jan 27 15:14:52 hostname telegraf[22421]: 2021-01-27T22:14:52Z E! [outputs.influxdb] When writing to [http://localhost:8086]: Post "http://localhost:8086/write?db=telegraf": dial tcp 127.0.0.1:8086: connect: connection refused
  • Confirm the accuracy of hostname or IP address configured in telegraf.conf for the proxy and then troubleshoot the environment for connectivity issues between the telegraf agent and the wavefront proxy (nslookup, ping, telnet, packet capture etc.).

  • Intermittent broken pipe messages logged from telegraf are typically due to firewalls closing a socket connection see below for example: 

    telegraf[xxx]: 2021-03-03T09:25:19Z E! [agent] Error writing to outputs.wavefront: Wavefront sending error: write tcp 192.168.xxx.xxx:36948->192.168.xxx.xx:2878: write: broken pipe
    telegraf[xxx]: 2021-03-03T09:26:10Z I! connected to Wavefront proxy at address: 192.168.xxx.xx:2878

    In these cases enabling the wavefront output plugin to use the HTTP protocol instead of the socket connection may be necessary, see Telegraf connection errors when using load balancer with VMware Aria Operations for Applications proxy for more details.

  • Should no errors be seen in the telegraf logs, ensure the data being captured from the telegraf plugin is in an acceptable format for the proxy. The latest versions of the proxy will log any issues with the metric data quality sent and indicate issues with its format. Check /var/log/wavefront/wavefront.log on the proxy machine for errors in the logs, for example an error message like below may be recorded:
    021-11-08 15:32:24,337 INFO  [AbstractReportableEntityHandler:reject] [2878] blocked input: [WF-300 Cannot parse metric: ""Update" source="rules-service"

See Also

Was this article helpful?
0 out of 0 found this helpful
Have more questions? Submit a request

Comments

Powered by Zendesk