Kubernetes is an open-source platform used to orchestrate containerized applications. It supports various container runtimes, but the most popular one is Docker. Most applications are now developed using Docker, as it provides the flexibility to build your application once and run it anywhere without any modifications. However, managing Docker containers at runtime involve several challenges, such as communication between the containers, load balancing, service discovery, self-healing, and more. Kubernetes addresses these concerns and has thus become the de-facto platform for running Docker containers.
Several cloud products, including EKS, AKS, and GKE, have evolved in the market for Kubernetes solutions. But EKS has created a special niche for itself with a fully managed Kubernetes-as-a-service offering. In this article, we’ll discuss logging, monitoring, and metrics for EKS. But first, let’s get to know what EKS is all about.
High-Level Amazon EKS Features
- Kubernetes has two major components: the Control Plane and Worker Nodes (Cluster). EKS manages the control plane by itself, taking away the mammoth task of configuring and managing it from your DevOps team.
- EKS provides an option to run Kubernetes on a serverless architecture using Fargate, where the team doesn’t even have to worry about the worker node setup.
- EKS is very well integrated with other AWS Services, like CloudWatch, IAM, VPC, Auto Scaling Group, and ELB, providing a seamless experience for high availability, load balancing, monitoring, and security.
- EKS’ recent enhancement in September 2019 provides support for assigning IAM permissions to service accounts that give fine-grained pod-level access control when running multiple pods in an EC2. Before that, you could only assign IAM permissions at the EC2 level, and all the pods on that instance would have the same role access.
- EKS integrates with AWS App Mesh, which is built on top of the Kubernetes service mesh, allowing for Kubernetes’ native experience of observability, traffic controls, and security features.
- EKS is a certified Kubernetes conformant, so you can get all the benefits of Kubernetes open-source tools from the community.
Having all of these features makes EKS a unique product for Kubernetes-as-a-service.
Monitoring in Amazon EKS
As described earlier, EKS has two major components: the Control Plane and Worker Nodes. Logging and monitoring for both components work differently, so let’s explore them one by one.
Control Plane Logging and Monitoring
The control plane is managed by AWS itself, so it doesn’t provide granular control to the DevOps team. However, like any other AWS service, EKS also has an integration with CloudWatch for logging and monitoring of the control plane, where the EKS control plane sends audit and diagnostic logs to CloudWatch Logs.
There are many log types available to choose from to enable EKS control-plane logging. For each cluster, there is an option available to enable or disable the log type using the AWS Management Console or CLI.
There are various log types–listed below–that are each related to a given component on the Kubernetes control plane:
- API server component logs: The API server is the Kubernetes component that exposes the K8s APIs. It emits logs with the API request details made to the EKS cluster.
- Audit logs: These logs provide data on every individual user, admin, or system component that interacts with the EKS cluster via the API.
- Authenticator logs: When an authentication request is performed using Kubernetes Role-Based Access Control with IAM credentials, it captures those requests as records; only EKS has such logs.
- Controller manager logs: The controller manager monitors the state of the cluster and makes changes to move it to the desired state. Controller manager logs offer data on these actions.
- Scheduler logs: These record the activity of the scheduler component that takes care of running the pods in the cluster. It records when and where pods are running.
Note: By default, control plane logs are not sent to CloudWatch Logs. You can use Log Group and Log Stream in CloudWatch Logs to view these logs; however, you must enable each type individually.
Monitoring Amazon EKS Metrics
Performance monitoring is an essential part of any platform’s operation. In EKS, the control plane is managed by AWS, but operators will still want to know how it’s performing. Similarly, operators need to keep a watchful eye on the performance of cluster nodes, pods, and other components to ensure they are not having any negative impact on applications. All in all, there are several metrics you have to monitor. Here below, we’ll talk about some of the most important ones.
Control Plane Metrics
As operators won’t have access to most of the components, like the API server, scheduler, control manager, etc., you won’t know how they’re performing. But there are a few metrics exposed through the API server that can give you an idea as to how things are going:
apiserver_request_latencies_sum gives you visibility into how much time a request is taking to be processed by the API server.
rest_client_request_latency_microseconds_sum tells you how much latency is observed by the controller manager.
Similarly, etcd_request_latencies_summary_sum shows latency-related data observed by the etcd.
All of these metrics help you understand how control plane connectivity is working in the EKS cluster. Any surge or spike indicates that there are issues in need of attention.
Cluster State Metrics
Cluster state metrics provide information on the state of various objects in Kubernetes. The most important objects to monitor for the performance of clusters are pods and nodes, as they give an almost complete picture of a production environment’s performance.
The most popular tool to get these metrics is kube-state-metrics, a service from Kubernetes that gives you data on objects by listening to API servers.
The table below shows some of the important metrics to watch:
|Node||kube_node_status_condition||The status of several node conditions; value would be true/false/unknown|
|kube_node_spec_unschedulable||Whether a node is ready to schedule new pods or not|
|Deployment||kube_deployment_status_replicas||How many pods are running in the deployment|
|kube_deployment_spec_replicas||How many pods are configured as desired for a deployment|
|kube_deployment_status_replicas_available||How many pods are available for a deployment|
|kube_deployment_status_replicas_unavailable||Pod status, if they are down and cannot be used for a deployment|
|kube_deployment_spec_strategy_rollingupdate_max_unavailable||Maximum number of unavailable pods during a rolling update of a deployment|
|Pod||kube_pod_status_ready||If a pod is ready to serve client requests|
|kube_pod_status_phase||Current status of the pod; value would be pending/running/succeeded/failed/unknown|
|kube_pod_container_status_waiting_reason||Reason a container is in a waiting state|
|kube_pod_container_status_terminated||Whether the container is currently in a terminated state or not|
Table 1 – kube-state-metrics to be monitored for EKS Cluster performance
Resource metrics are all about CPU, memory, and other resource utilization. When you compare the available resources with the requested resources, you get an idea as to whether or not the cluster has the capacity to accept new workloads and can run the current state without failures. Resource utilization metrics are emitted from individual containers and can be retrieved at the pod level, which will be the sum of all the containers’ resources.
|Pod||kube_pod_container_resource_requests||Number of requested resources by a container; e.g., the sum of memory resources requested by a namespace and a pod in a particular node|
|kube_pod_container_resource_limits||Limit requested for each resource of a container|
|Node||kube_node_status_capacity||Capacity of each resource in the node with the unit quantity, e.g., pod-5|
|kube_node_status_allocatable||Number of different allocatable resources of a node that are available for scheduling|
EKS uses EBS volume for storage. When a root volume running with low disk space, the kubelet component in node detects it and the scheduler would not be able to schedule new pods in that node. To monitor disk utilization, you need to watch for metrics like nodefs.available and imagefs.available. These metrics help monitor the available disk space in nodes.
CloudWatch Container Insights Monitoring
So far, we’ve discussed all important metrics available using kube-state-metrics, which is Kubernetes’ native way of monitoring. However, AWS launched its Container Insights service last year in CloudWatch to add detailed monitoring for EKS. The key features of this service include:
- Metrics like CPU, memory, disk, and network utilization
- Diagnostic information to identify and resolve issues faster, e.g., logs for container restart failures
- Alarms on metrics collected
- Dashboards visualizations
- Support for log and metric encryption using AWS KMS customer master keys (CMKs)
- Logs and metrics using containerized CloudWatch Agent and FluentD
To show the Container Insights metrics on the CloudWatch dashboard, you need to make sure you have the CloudWatch Agent and FluentD set up.
You can view Container Insight metrics by going to the “Performance Monitoring” tab in CloudWatch and selecting the “resource type” to view. You can then use CloudWatch Logs Insights to query the metrics data and generate dashboards.
Prometheus Metrics Monitoring for Amazon EKS
Prometheus is a well-known monitoring tool for metrics that you can use in Amazon EKS to monitor control plane metrics. The Kubernetes API server exposes several metrics through a metrics endpoint (/metrics). This endpoint is exposed over the EKS control plane.
To enable Prometheus metrics in EKS, you need to deploy Prometheus on EKS using Helm Charts. Once installed, you can view the Prometheus console on your browser.
Prometheus used to be integrated with the Grafana tool to show the metrics on neat and nice dashboards. However, AWS has now launched the Container Insights Prometheus Metrics Monitoring (Beta version) feature that automates the discovery of Prometheus metrics from containerized workloads. It also supports ingesting custom metrics in CloudWatch. Once this feature is GA, it will reduce the number of tools required to monitor the cluster.
Observability within Amazon EKS
Kubernetes clusters, deployments, pods, and containers, as you know, can be hard to configure and maintain. Epsagon leverages the power of Prometheus (without the need to manage or provision it) to provide best-in-class monitoring and alerting with an extensive experience and user interface.
With its integration into your cloud environment, Epsagon provides ease of management with its sleek dashboard and ability to see everything in production, automated monitoring, a wide variety of correlated performance metrics, alerting integrated with your communication channels, and fast troubleshooting—in seconds.
Epsagon’s applied observability platform provides an automated correlation of traces, metrics, and logs within a single dashboard. The correlation between the application-level tracing and the infrastructure-level metrics enables developers to see a cluster or a pod and go right to the relevant traces, without tedious and time-consuming searches through hundreds, even thousands of log, to identify issues and their sources.