In this article, we’ll help you understand the key metrics and components to monitor in a Kubernetes environment as well as explain two open-source solutions to effectively monitor your Kubernetes cluster: Prometheus and Grafana.
Kubernetes is an open-source container management platform that helps you run containers at scale. Kubernetes simplifies the management of your containerized cloud-native ecosystem, but it also introduces complexity with a large number of components that need to be constantly monitored.
Leveraging metrics to get insights into your Kubernetes cluster is critical for production workloads. You also need to have the right monitoring strategy to ensure that all the key metrics are collected from various sources, aggregated, and visualized.
Sources of Metrics in Kubernetes
In Kubernetes, you can fetch system-level metrics from various out-of-the-box sources like cAdvisor, Metrics Server, and Kubernetes API Server. You can also fetch application-level metrics from integrations like kube-state-metrics and Prometheus Node Exporter.
Let’s explore all of these a bit more in detail.
Container Advisor is a great monitoring tool that provides container-level metrics and exposes resource usage and performance data from running containers. It provides quick insight into CPU usage, memory usage, and network receive/transmit of running containers. cAdvisor is embedded into the kubelet, hence you can scrape the kubelet to get container metrics, store the data in a persistent time-series store like Prometheus/InfluxDB, and then visualize it via Grafana.
Metrics Server is a cluster-wide aggregator of resource usage data and collects basic metrics like CPU and memory usage for Kubernetes nodes, pods, and containers. It’s used by Horizontal Pod Autoscaler and the Kubernetes dashboard itself, and users can access these metrics directly by using the kubectl top command. Metrics Server replaces Heapster as the primary metrics aggregator in the cluster, which has been marked as deprecated in the newer version of Kubernetes.
Kubernetes API Server
The frontend of the Kubernetes control plane is the API server exposing all the capabilities that Kubernetes provides. It’s like a gateway and communication hub for the entire Kubernetes cluster. All the user requests pass through the API server, where it performs the client request validation and interacts with etcd for persistence of cluster state. All the cluster components communicate with each other via the API Server.
The API server is responsible for processing the API operations and storing the API objects into a persistent storage backend. It also provides a number of metrics that are critical from a cluster-operation perspective–mainly the Request Rate, Error Rate, and Duration (RED metrics)–for Kubernetes resources.
Node Exporter is the Prometheus exporter for hardware and operating system metrics. It allows you to monitor node-level metrics such as CPU, memory, filesystem space, network traffic, and other monitoring metrics, which Prometheus scraps from a running node exporter instance. You can then visualize these metrics in Grafana.
Kube-state-metrics is an add-on agent that listens to the Kubernetes API server. It generates metrics about the state of the Kubernetes objects inside the cluster like deployments, replica sets, nodes, and pods.
Metrics generated by kube-state-metrics are different from resource utilization metrics, which are primarily geared more towards CPU, memory, and network usage. Kube-state-metrics expose critical metrics about the condition of your Kubernetes cluster:
- Resource requests and limits
- Number of objects–nodes, pods, namespaces, services, deployments
- Number of pods in a running/terminated/failed state
Top Kubernetes Metrics to Monitor
Now that you know where to get your metrics, let’s take a look at which metrics you should be monitoring. Here below are our “Top 5.”
Kubernetes Cluster Metrics
From a monitoring perspective, it’s important to have complete visibility into the state of your Kubernetes cluster. Having an overview of the number of running nodes, pods, and containers can help discover at what capacity the resources are running and give a clear representation of the deployed workload. Some of the other critical cluster metrics to look into are CPU usage, memory usage, network I/O pressure, and disk usage, all of which will indicate if resource utilization in the cluster is accurate.
Kubernetes Internal Metrics
The master nodes run the Kubernetes Control plane, which is responsible for monitoring the cluster, making scheduling decisions, and ensuring that the cluster runs in its desired state. Hence it’s critical to collect key metrics and monitor the control plane components, like API Server, Scheduler, Controller, and Etcd, and visualize them in one place–preferably in a Grafana dashboard via a Prometheus integration. These metrics provide a detailed view of the cluster performance and also assist in troubleshooting issues.
Kubernetes Node Metrics
Each Kubernetes Node has a finite capacity of CPU and memory that can be leveraged by the running pods, so these two need to be monitored carefully. Other important metrics to monitor are disk-space usage and node-network traffic (receive and transmit). There are a number of node “conditions” defined that describe the status of the running nodes like Ready, MemoryPressure, DiskPressure, NetworkUnavailable, OutOfDisk, etc.
Kubernetes Pod/Container Metrics
From a pod-monitoring perspective, resource allocation is key. It is important to be cognizant of the pods that are either under-provisioned or over-provisioned from a CPU/Memory perspective since it can directly impact your application performance. Having metrics available to track containers’ restart activity and throttled containers is helpful while troubleshooting issues.
You should leverage RED metrics (Request Rate, Error Rate, and Duration) for instrumenting the services running in Kubernetes and build dashboards for real-time monitoring. You should monitor a few other application metrics as well, like JVM, Memory, Heap, and Threads, to ensure that the services are running correctly.
Kubernetes Monitoring with Prometheus
In a dynamic cloud environment, where you have a large number of microservices and container workloads running, applications generate a huge amount of data–a challenge from a monitoring viewpoint. But Prometheus can help and is the industry-leading solution for monitoring containerized cloud-native environments.
Prometheus is an open-source monitoring system that features a functional query language called PromQL (Prometheus Query Language). This lets a user choose time-series data to aggregate and then view the results as tabular data or graphs in the Prometheus expression browser; results can also be consumed by the external system via an API. Prometheus is primarily focused on the metrics space and is more suited for operational monitoring.
Prometheus is a pull-based monitoring system where you can expose the metrics as an HTTP endpoint. Its server can scrape metrics from your services running in Kubernetes via service discovery and collects and stores the metrics in a local time-series database. They can then be made available via an API and directly queried using PromQL or viewed in Grafana dashboards.
Alerting is an important component in Kubernetes monitoring. When there are issues in your application stack, you need to alert the relevant teams about the problem as soon as possible. This enables them to quickly act on the problem statement and fix the root cause to minimize application downtime.
You can define the alert rules in the Prometheus configuration. If the alerting conditions are met, Prometheus sends alerts to AlertManager, which manages the alerts by performing operations such as deduplication, silencing, grouping, and rate-limiting. It then sends the right notifications via email, Opsgenie, PagerDuty, and other notification systems.
The expression browser in Prometheus is beneficial for running ad hoc queries, writing PromQL expressions, debugging, and viewing data stored inside Prometheus in a graphical representation.
Kubernetes Data Visualization with Grafana
Grafana is an open-source data visualization and analytics tool that can monitor your time-series data. It allows you to query several datastores, visualize, send alerts, and understand the metrics. Grafana has native Prometheus support and also supports a large number of databases, including InfluxDB, Elasticsearch, AWS CloudWatch, Graphite, etc. Plus, it comes with many built-in reusable dashboards to bring your data together and share it.
Creating Dashboards in Grafana
Grafana dashboards provide you with deeper insights into the health and performance of your Kubernetes cluster and the applications running in it. The dashboards in Grafana come with a variety of panels that can fetch data from underlying data sources. You can combine different Kubernetes metrics and display them in a consolidated dashboard, which you can use for real-time monitoring and alerting when thresholds are met. This is very helpful for troubleshooting any production issues, understanding the metrics, and diagnosing the root cause.
Grafana offers several official and community-built dashboards here, which you can leverage without having to create dashboards from scratch by yourself. These out-of-the-box dashboards are pre-configured and allow you to view metrics from many data sources. While designing your dashboard, you can choose from a number of visualization options, or Panels, like Graph, Singlestat, Gauge, Table, Text, Heatmap, Alert List, etc. And there are a variety of plugins available here to enhance the visualization of your data as well.
Self-Hosted vs. Managed Kubernetes
Enterprises are looking forward to leveraging the benefits of Kubernetes to scale and grow their application landscape. You can either manage the Kubernetes environment yourself or go with a managed solution.
Having a self-managed Kubernetes cluster gives you a solid understanding of the core components and experience in handling the infrastructure. However, it does introduce the complexity of building, managing, and operating your own Kubernetes environment. Right off the bat, Kubernetes is hard to deploy and also difficult to operate at scale. Upgrading Kubernetes versions, applying the latest patches, and managing the entire node lifecycle can get overwhelming. Additionally, it is a challenge to find IT infrastructure personnel with the right domain expertise who can help you build and maintain a Kubernetes environment that is highly available, resilient, performant, observable, and secure.
So, what to do?
What Is Managed Kubernetes?
Managed Kubernetes is when third-party providers set up and manage the operation of your Kubernetes workload, including processes like high availability, scalability, version upgrades, security, and more. Luckily, almost all of the major cloud providers provide you with an option to host managed Kubernetes without the need of being an expert in Kubernetes infrastructure maintenance.
The main idea behind leveraging a managed Kubernetes platform is to let your development teams focus on business capabilities that you can deploy without having to invest time in creating, monitoring, and maintaining the cluster. This is also a good risk-reduction measure because you are handing over cluster maintenance to trusted providers who will ensure that your environment is highly scalable, secure, resilient, and up-to-date. In turn, this helps you accelerate the development of your cloud-native applications, easily manage your infrastructure, and, as a result, increase your time to market.
In the next section, we’ll discuss how to eliminate the heavy lifting required to run a Kubernetes cluster using an AWS managed offering called Amazon Elastic Kubernetes Service.
Using AWS Managed Services for Hosting a Kubernetes Workload
Amazon EKS provides a platform for enterprises to run production-grade Kubernetes workloads. Out of the box, it gives you a Multi-AZ and highly available architecture that meets a 99.9% service uptime requirement for every cluster you run. EKS gives you a managed Kubernetes control plane with highly available master and etcd components, while worker nodes will need to run as EC2 instances in your own account. Another big consideration for organizations is their need to meet security and compliance requirements. They can do this by hosting their sensitive and regulated workloads on EKS since it is already HIPAA, ISO, and PCI compliant.
Amazon EKS makes it easy to run your production services at cloud-scale. If you have other workloads running in AWS, you will get seamless integration with AWS services, such as compute, logging, monitoring, auditing, security, storage, routing, and more. This means you will have the flexibility and convenience of using the capabilities of the cloud to build your application architecture. It also provides zero-downtime version upgrades, patching, security hardening, and better support for faster resolution of issues.
Amazon EKS allows developers to focus on business capabilities rather than an infrastructure’s setup. And when there is a spike in application traffic, workloads are autoscaled, resulting in better cost management and performance. Amazon EKS is geared towards leveraging the best practices to optimize the cost of running a Kubernetes workload in the cloud. AWS also actively contributes bug fixes, tooling improvements, and security patches to the open-source Kubernetes community to maximize functionality for its users.
You can provision an EKS Cluster by using:
- Amazon EC2 – Deploy Worker Nodes
- Amazon Fargate – Deploy Serverless Containers
Fargate is a serverless computing platform for containers in AWS. There is no need to provision and manage servers–you only pay for the resources you need to run. So, you can run Kubernetes in a cost-optimized and highly available cluster, without having in-depth knowledge of Kubernetes operations. A big advantage of using Fargate is that you don’t have to worry about scaling, patching, or securing the EC2 instances running your Kubernetes workloads.
Kubernetes has revolutionized the container ecosystem by simplifying the deployment of containerized applications at scale. However, it does introduce operational complexity and observability challenges. When you are creating the monitoring strategy for Kubernetes-based production workloads, it’s important to keep in mind the top metrics to monitor along with the various monitoring tools we discussed in this article.