Monitoring containers orchestrators, such as Amazon ECS, can be difficult due to the number of components involved. ECS has three distinct layers: cluster, workload, and applications, each requiring its own monitoring strategy. This article will deep dive into each ECS layer, describe best practices for monitoring ECS, and review common monitoring tools you can leverage.
ECS is a container orchestrator created by Amazon. The purpose of ECS is to schedule and execute containers. End-users interact with ECS using a declarative API, which allows them to specify a Docker image, how many instances of that image should be running, and the resources that are available to each instance. ECS takes this declaration and brings the cluster to the specified state. For example, if three containers are specified, ECS will start three instances. If one instance stops, ECS will replace it so that the cluster state is always converging on the declaration.
A Note on Reservation & Utilization
Two particularly important concepts for monitoring ECS are reservation and utilization. Tasks need to specify their desired resources, and if the cluster has resources available, it “reserves” those resources for the task. Reservation is calculated by the total resources reserved by tasks over the total available resources in the cluster. Consider a cluster with 10 GB of memory and 10 CPUs; if a task requests 5 GB of memory and 5 CPUs, the cluster will reserve that amount and those resources will not be available for other tasks. In this case, the cluster is 50% reserved. Reservation helps to determine if a cluster is underprovisioned.
Utilization is a similar concept and is calculated by taking the total resources being used over the total available resources. Utilization is applicable to both clusters as a whole and individual services and helps to determine if a service or cluster is overprovisioned.
There are a number of tools available to monitor ECS and fall into two major classes: first-party tools, which come from Amazon itself, and third-party tools, which provide increased features, experience, and/or UI when compared with Amazon.
CloudWatch is Amazon’s built-in metrics solution. All Amazon services emit metrics by default or are able to emit metrics. CloudWatch provides metrics visualization, dashboarding, and alerting capabilities. Amazon also maintains documentation for all CloudWatch metrics for each service. The complete list of ECS metrics is available here.
CloudWatch metrics are pretty sparse out of the box, only providing GPU, CPU, and memory stats per cluster and per service. CloudWatch also offers a logging solution, enabled by default. This provides a location for container application logs and is important to monitor at the application level.
Epsagon integrates directly with your Amazon account to receive ECS metrics. It provides an enhanced user experience over CloudWatch, curated views into ECS, and deeper integrations to provide a full-picture view into ECS and the applications that it runs. Epsagon also tracks container resources and tracing to help users understand how ECS affects their application performance.
Container Insights is an Amazon offering that provides a uniform view into cluster and task metrics. It enhances the default ECS metrics to include visibility into tasks and service states. Prior to Container Insights, it was difficult to obtain a full view of ECS cluster health. Figure 3 shows how Container Insights can provide a single view into cluster, workload, and application dimensions of ECS.
Monitoring Amazon ECS
ECS is composed of three logical layers, each requiring its own metrics and monitoring strategy:
- Cluster: The underlying container instances, i.e., EC2, which contain the resources required to execute services/tasks.
- Workload: The tasks and services that the ECS cluster is responsible for executing.
- “Userspace”/Applications: The end-user Docker applications executing in containers, which comprise the ECS workload.
Since the cluster is composed of EC2 instances (container instances), monitoring focuses on EC2 instances and system resource usage. There are also derived stats, like utilization, that are important. Common questions encountered when monitoring ECS clusters are:
- How many container instances are there?
- What are the container instance sizes?
- What is the cluster utilization in total and per container instance?
- What is the cluster reservation in total and per container instance?
- Which OS resources is a container instance using?
The metrics required to answer these questions are:
- Instance Resource Usage (CPU, memory, disk, network)
- Cluster utilization/host
- Cluster reservation/host
Auto Scaling Events
Understanding metrics is essential for autoscaling. Autoscaling is triggered by CloudWatch alarms, which set conditions on important metrics, such as CPU utilization. When those conditions are triggered, autoscaling can happen at the service level or at the cluster level. Service-level autoscaling will add or remove service instances, and cluster scaling will add or remove the capacity to run more services by modifying the EC2 container instance count.
ECS Auto Scaling sends events to CloudWatch when instances change. When your services grow and multiply, so do your containers. ECS instances run container agents as they’re created, and these instances also support core tasks. It’s useful to track any instance terminations and launches as your ecosystem changes, and monitoring offers a glimpse into cluster group behavior over time.
These scaling events are automated, as they reflect the resource demands of your systems. You need more instances as you deploy more applications, and instances containing deleted applications should terminate automatically.
The workload is the tasks that ECS is executing. Common issues and questions that arise when monitoring ECS workloads are:
- How many tasks are running on the cluster? By service? By container instance?
- What are the states of the tasks running?
- How long have tasks been running?
- What are the resource requirements of tasks?
- Is a service overprovisioned?
The metrics required to answer these questions are:
- Number of tasks by service/name
- Utilization by service
- Task state
The final dimension of ECS metrics is the application layer, which are the Docker containers running on ECS. Common questions that arise when monitoring the application layer are:
- Which business events is a service generating (i.e., user logins, transactions, or any other custom application-level event)?
- How many instances of an application are running?
- How much CPU or system-level resources (such as file descriptors) is an individual task using?
And the important metrics to answer these questions are:
- Resource usage from Docker perspective (memory, CPU, network, disk, system)
- Application-level business metrics
- Transactions (tracing)
Alerting provides a way to automate monitoring. Instead of manually looking at metrics, alerting sets up conditions on metrics; when those conditions are encountered, an event is emitted that can be used to trigger notifications (commonly via email, incident management service, or chat) or autoscaling events. Amazon supports alerts through CloudWatch alarms.
Successfully monitoring ECS requires defining the target metrics to alert on and the actions required to resolve those alerts. This may require human intervention to investigate and resolve, while other alerts may trigger autoscaling events.
Successful ECS monitoring requires an understanding of the logical level of ECS (cluster, workload, and application) and the important metrics for each. It’s important to define a monitoring strategy and establish which tools will contain which levels metrics and how those metrics will be dashboarded and presented to minimize confusion during incidents. The final consideration for monitoring should establish which metrics should generate alerts and how those alerts should be resolved.