Contents
Introduction
VMware Aria Operations for Applications provides an integration for monitoring the health and performance of your Kubernetes environments (including TKG, OpenShift, etc).
Download and Installation
- Go to your TO Instance https://CLUSTER.wavefront.com/ and click on the integrations button in the top menu bar
- Click on the Kubernetes tile (you can find it in the Featured section)
- Click on the Setup tab and click on "Add Integration"
- Select the option that is most applicable to you and click Next.
- Install in Tanzu Cluster - select this if you are using TKG
- Install in OpenShift Cluster - select this if you are using OpenShift
- Install in Kubernetes Cluster - select this for any other flavor of Kubernetes
- Let's walk through the steps for Install in Kubernetes Cluster. First, ensure that you have Helm installed. You can verify that Helm is installed by running this command on the command line:
helm version
You should see an output similar to the following:version.BuildInfo{Version:"v3.8.2", GitCommit:"6e3701edea09e5d55a8ca2aae03a68917630e91b", GitTreeState:"clean", GoVersion:"go1.18.1"}
- Run the following command to ensure that the Wavefront repository is configured within Helm:
helm repo add wavefront https://wavefronthq.github.io/helm/ && helm repo update
You should see an output similar to the following:"wavefront" has been added to your repositories
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "wavefront" chart repository
Update Complete. ⎈Happy Helming!⎈ - Enter your cluster name into the specified field. If you are installing on a test cluster that does not have authentication configured, click on "Additional Settings" and check off "Use Kubelet's read-only port". Then, copy the corresponding kubectl command that is generated.
- Paste the command into your command line and run it. You should see an output similar to the following:
NAME: wavefront
LAST DEPLOYED: Tue May 10 19:29:32 2022
NAMESPACE: wavefront
STATUS: deployed
REVISION: 1
NOTES:
Wavefront is setup and configured to collect metrics from your Kubernetes cluster. You
should see metrics flowing within a few minutes.
You can visit this dashboard in Wavefront to see your Kubernetes metrics:
https://CLUSTER.wavefront.com/dashboard/integration-kubernetes-summary - Installation is now complete. After a couple of minutes, you should see that the Kubernetes integration tile is marked with a green checkmark to indicate that data is flowing successfully.
- See the Out-of-the-Box Dashboards section to learn more about exploring the available out-of-the-box dashboards.
Out-of-the-Box Dashboards
Once the integration is complete you are now ready to monitor your Kubernetes environment using Out-of-the-Box Dashboards.
You can find these dashboards in the Kubernetes integration tile:
- After logging in click on Integrations in the top menu bar.
- Search for the Kubernetes integration and click it.
- Click the Dashboards tab.
Here is a summary of the available dashboards:
Dashboard | Description |
---|---|
Kubernetes Summary Health summary of all Kubernetes clusters and workloads |
Example |
Kubernetes Clusters Detailed health overview of cluster-level components
|
![]() |
Kubernetes Nodes Detailed health of Nodes |
![]() |
Kubernetes Pods Detailed health of your pods broken down by node and namespace. |
![]() |
Kubernetes Containers Detailed health of your containers broken down by namespace, node, and pod. |
![]() |
Kubernetes Namespaces Details of your pods or containers broken down by namespace. |
|
Wavefront Collector for Kubernetes Metrics Internal stats of the Wavefront Collector for Kubernetes. |
|
Kubernetes Control Plane
|
|
Out-of-the-Box Alerts
In order to access the Out-of-the-Box alerts:
- After logging in click on Integrations in the top menu bar.
- Search for the Kubernetes integration and click it.
- Click the Alerts tab and then click on the green Install All button.
- Once the alerts are installed, there will be an "edit" link next to each alert.
Note: There will also be a message in the top right corner indicating that the Alerts were installed without targets - Click edit on the alert of interest, scroll to the "Recipients" field and add alert target(s) to specify where notifications should go:
- Click save in the upper right corner.
- Once an alert fires, the notification will include a link to the Alert Viewer page, which allows you to investigate the alert.
Example: Investigating an Alert
One key out-of-the-box alert ("K8s too many pods crashing") tracks when too many pods start to crash. In this section, we'll walk through investigating this alert as an example of how you can use the out-of-the-box dashboards and alerts to monitor your Kubernetes environments.
This alert triggers when the following condition has been met:
count(ts(kubernetes.pod.status.phase, phase="Running" or phase="Succeeded"), cluster) / count(ts(kubernetes.pod.status.phase), cluster) < 0.8
- This is an example of the notification email sent when the alert triggers. This will serve as the starting point of our investigation.
- From the notification, we know that the affected cluster is "mk" (this is specified in the Sources/Labels Affected field). Using the "Kubernetes Pods" out-of-the-box dashboard, we can filter for the "mk" cluster and quickly examine all the pod states in this cluster. This allows us to identify which pods are not in the running state. Locate the "Pods Pending" and "Pods Pending" charts on the dashboard.
In this example, the alert was triggered because these pods are in the Pending State. - From the previous chart, we notice that the pods are pending because 0/1 nodes are available and 1 node is unschedulable. However, why didn't the pods get scheduled on other nodes? Using the "Kubernetes Summary " out-of-the-box dashboard and filtering for the "mk" cluster, we find that this is because there is only one node in the cluster.
- With the out-of-the-box alerts and dashboards, we've been able to catch the fact that many pods were crashing. Very quickly, we were able to identify the affected cluster, the affected pods, and the affected node. We now know that the pods were crashing because the cluster only has one node and that node is unschedulable. Armed with this information, you can pinpoint your further investigation into why that node is unschedulable and determine the appropriate remediation:
- Is there an upgrade underway?
- Is there hardware replacement in-process?
- Why does this cluster have only a single node?
Comments