Monitoring is a core observability practice that encapsulates alerting, logging, and tracing as well as time-series events. Monitoring attempts to address the questions “What is broken?” and “Why is it broken?”. The former attempts to identify components in a system that are problematic through the diagnosis of the various symptoms observed. The latter tries to explain why those system components are behaving abnormally.
The famous Site Reliability Engineering Book from Google outlines the “what” versus “why” as important distinctions in creating good monitoring solutions. Ultimately, monitoring gives insight into the behavior of your services once deployed in your production environments. But in order to effectively apply monitoring practices, you need to set targets and goals for key application and system metrics using service level objectives (SLOs).
Understanding Service Level Objectives
Any test of performance and stability must have clear established goals to effectively evaluate progress. In the world of reliability engineering, monitoring is usually paired with service level objectives (SLOs), which are used in organizations to place availability, reliability, and performance targets on predetermined components and systems you need to monitor. These targets inform engineers whether new deployments to the system are showing improvements or deterioration in service availability, reliability, and performance.
Monitoring metrics are tied to service level objectives, providing a guideline on the expectation of a particular service item being monitored. For instance, the latency on the 99th percentile can be 300 ms, meaning 1 in 100 latency measures can hit a maximum of 300 ms, while the rest fall under this limit. New updates to your existing infrastructure should either maintain this SLO or improve on it.
Monitoring can exist in multiple different engineering domains, but here in this article, we’ll focus on monitoring in a Kubernetes environment.
Monitoring in a Kubernetes Environment
In the past few years, Kubernetes has taken the DevOps world by storm with its standardized approach to container orchestration. Today, more and more applications are deployed as microservices to take advantage of the orchestration and scheduling facilities provided by the Kubernetes platform, which extends beyond on-premises environments to the cloud. With this increase of distributed services being deployed and the fact that the Kubernetes platform itself contains several moving parts, it is essential to understand how your application and its environment are behaving to ensure maximal availability and quality of service for your users.
But before we jump into monitoring implementations, let us highlight the different components and levels in a Kubernetes system that should be monitored.
What to Monitor in Kubernetes
Proponents for monitoring argue that monitoring is best applied at a system and application level. Application-level monitoring is argued to be of higher priority due to its direct link to customer satisfaction, but here, we’ll delve into both system-level and application-level monitoring.
The kube-system namespace contains objects created by the Kubernetes system and is a critical component to monitor, as the health and state of your cluster impacts your application directly. Key resources in the kube-system to monitor are:
The network is one of the more popular components to monitor. Latency is a core metric that establishes the time interval between an initial request and response, and actions used to reduce latency as far as possible typically have the greatest impact on customer satisfaction. The Kubernetes ecosystem has also opted for service mesh tools such as Linkerd and Istio to provide more granular monitoring capabilities such as HTTP latencies and error codes.
Nodes, pods, and containers are where your application is run. Metrics such as the number of nodes, pods, and containers in your cluster may be of use, as well as resource and network utilization of these components.
Application monitoring places emphasis on the metrics observed in your application. One key way to monitor applications is through the use of application performance management (APM). This relies on the detection and diagnosis of application-level performance problems to achieve or maintain desired performance SLOs. The application level, particularly areas that have direct customer impact, is arguably the most critical area on which to perform monitoring. We should note here that Brendan Burns, a co-creator of Kubernetes, emphasizes monitoring attributes that directly impact user experience.
Key metrics you can observe using APM are Request Per Second (RPS), Error rates, and other application-specific metrics. You can also import monitoring libraries, such as Prometheus, directly into the code to enable richer custom logging and create fine-grained observability metrics that can assist in assessing the performance and reliability of your application.
Kubernetes Monitoring Implementations
Method 1: Vanilla Kubernetes Monitoring
A vanilla deployment of Kubernetes offers limited monitoring solutions that are primarily focused on resource utilization information instead of providing the overall system and application metrics. Three examples of tooling provided in a vanilla distribution of Kubernetes are Kubernetes Metrics Server, Node Problem Detector and cAdvisor.
The Metrics Server is a scalable container resource metrics tool that scans the kubelet resources running in each node for CPU, memory, network, and disk usage and funnels the results to the Kubernetes API Server for querying, shown in Figure 1. Node Problem Detector runs as a DaemonSet that collects node health metrics from various nodes and sends them to the Kubernetes API Server for querying.
Container Advisor, or cAdvisor, is a tool originally developed by Google and used to expose container resource and performance metrics. cAdvisor runs as a daemon and collects metrics for an individual container that can be viewed in a simplified dashboard, as seen in Figure 2 below.
cAdvisor was not built with a Kubernetes environment in mind–hence its lack of Kubernetes context–so metrics collected by it may be insufficient in scope. This lack of context was originally alleviated by using an additional tool called Heapster. Heapster provided monitoring and performance analysis of containers at a cluster level but has since been deprecated by the Kubernetes team in favor of the Kubernetes Metrics Server and Prometheus.
All the tools mentioned above are limited to resource utilization and are not recommended as a stand-alone monitoring solution particularly in large distributed clusters, as they do not extensively provide thorough cluster-wide, pod-level, or application-level monitoring.
Method 2: Custom Monitoring Solutions
The Cloud Native ecosystem has been shown to prefer the use of third-party monitoring solutions such as Prometheus, other open-source solutions found in the CNCF as de facto monitoring solutions, and Epsagon.
Here are additional resources on how to set up monitoring for Kubernetes.
Best Practices for Monitoring Kubernetes
Monitoring Visualization and Alerting
The output for most monitoring solutions is typically structured text, which either requires querying or parsing through to get useful information. Open-source tools like Grafana bring life to the data through intuitive visualizations on different metrics being monitored, allowing engineers to categorize and prioritize various views where they see fit. Monitoring is often also paired with alerting tools that notify engineers when certain metrics have exceeded or fallen below the limits specified by the SLO. An open-source tool like Prometheus Alertmanager can be used to create alerts based on the metrics monitored by the Prometheus agent and notify engineers on SLO breaches.
It is tempting to build systems that monitor every aspect of an application in order to gain insightful telemetry related to your system. In most cases, metrics are rarely used and add significant storage overhead. Brendan Burns, a co-creator of Kubernetes, has stated that monitoring fatigue can also lead to significant increases in noise. Too much noise decreases sensitivity to actual problems, leading to engineers ignoring crucial alerts. It also makes it difficult for engineers to respond adequately to the problem.
Monitoring is an essential part of building distributed reliable systems as it answers the “what” and the “why” questions regarding anomalies in your system. When choosing to adopt monitoring solutions for your Kubernetes environment, make sure your team has discussed or set SLOs that are both attainable and able to serve as a benchmark to test the availability and reliability of your systems.
If you decide to go with a native Kubernetes solution, give Kubernetes Metrics Server, Node Problem Detector, and cAdvisor a look. However, the industry is also showing great interest in dedicated tooling for custom monitoring solutions such as Prometheus, as it provides several added features including visualizations (with help from Grafana) and alerting to enhance your monitoring experience.