Amazon’s Elastic Container Service (ECS) is a container orchestrator that provides a low-friction way to run Docker images within the AWS ecosystem; ECS is responsible for scheduling and executing Docker containers, called tasks. While containers are able to produce a number of organizational benefits, including increased velocity, cost savings, and team/service autonomy, they also carry challenges in terms of observability and monitoring.
When ECS executes tasks, it does so on a “cluster,” and clusters execute on a “launch type.” When you create an ECS cluster, you first need to choose a launch type for that cluster. Before Fargate arrived in 2017, Amazon only offered a single launch type, EC2, which requires managing EC2 servers (starting/stopping/sizing/monitoring). Fargate, on the other hand, enables teams to run containers on Amazon-controlled (serverless) resources, allowing ECS to schedule tasks on Amazon-controlled hosts and completely removing the need for EC2.
It’s critical to understand the difference between Fargate and EC2 in order to understand the key metrics of each. The methods for monitoring EC2 and Fargate also differ significantly. This blog briefly reviews EC2 and Fargate, in terms of ECS, and then describes key metrics necessary for monitoring ECS, Fargate, and EC2.
Monitoring: ECS, EC2 & Fargate
Here below is a basic diagram of the basic components of an ECS API to better understand the framework upon which monitoring needs to take place.
Successful monitoring is about answering specific questions in order to narrow down the scope of issues. Common questions that arise when operating ECS are:
- Why is a task crash-looping?
- Why did a task get killed, rescheduled, etc?
- How much memory is a task using?
- Are there enough container instances for the current workload?
- Why is a task stuck in a pending state?
- How many underlying instances will be necessary to scale a task by X?
With all these questions, it’s helpful to start at the level closest to the end-user (the ECS API) and then walk through each deeper level of abstraction. Running an application on ECS requires interaction with multiple functional levels:
- Task – ECS
- Reserved Resources (cgroups)
- Active Resources (Docker)
- Host (VM)
Since this article focuses on ECS, we’ll ignore application-level metrics and focus on the task and its underlying metrics.
ECS: The Common Denominator
Since ECS is the interface to Fargate and EC2 launch types and Docker, it provides a good foundation for metrics. The whole purpose of ECS is to manage tasks, which makes monitoring them a priority. But first, you need to understand the current state of ECS, that is, “What’s happening right now in the cluster?” To answer this question, you need to understand:
- Traffic: How many services/tasks is the cluster running?
- Latency: For how long are tasks running?
- Results (Errors): Are the tasks healthy? What are their states?
- Saturation: How many outstanding tasks are there?
These metrics are so common that they are referred to by Google as “the four golden signals.” At any given time, it’s critical to know where each of these metrics stands, as they are able to answer: “What’s actually happening, right now, in the cluster?” This is a critical question, as most monitoring is about hunting down issues with tasks on an ECS cluster, meaning you first need to know the state of a given task and what the ECS cluster is doing.
Epsagon makes viewing this information clear and simple via its Container Services page. This gives a bird’s eye view of all ECS services and their tasks, task states, and how long they’ve been running.
A Level Deeper: Tasks
The next critical metrics for ECS surround individual tasks. These include:
- Resource limits requested by the task (Task Size)
- Resources in use by the task (Container Resource Usage)
For Docker-based orchestrators like ECS, containers can specify soft and hard resource limits. These limits help ECS place tasks across the cluster in a balanced way and also provide hard, enforceable constraints in the case of Docker. Epsagon shows the task limits on the Container Services overview page:
Epsagon also shows task-specific resource usage as a time series on the task metrics page:
Remember, when monitoring ECS, it’s helpful to start with high-level abstractions of ECS primitives and then slowly work down through the progressive levels of abstraction.
Now, let’s look at monitoring for both launch types previously discussed: EC2 and Fargate.
When ECS first launched, it only supported a single Launch Type: EC2. This launch type requires companies to provision and manage EC2 servers (called Container Instances), which together form an ECS cluster. The key metrics to monitor these are very similar to other VMs or instances. AWS offers support for these through CloudWatch metrics:
- Memory: Memory being used, buffered, cached; total and percentages
- CPU: Active, idle, system, user, iowait, etc.
- Network: Rates of bytes sent/received, packets dropped/errored, TCP connections and their states, etc.
- Disk: Total, free, in-use, percentages, IOPS, read/write rate, byte count, etc.
In addition, it’s also essential to understand how weighted each node is in terms of ECS, that is, how many tasks there are per container instance. These metrics are only relevant to the EC2 launch type because Amazon manages the underlying EC2 instances when using Fargate.
To determine on which EC2 host tasks should be placed, ECS uses a series of task placement algorithms. During incidents, it’s helpful to understand how many tasks a given host has and the task limits. A view of this can be seen in Epsagon’s container services page as well:
The EC2 launch type places the burden of scaling on the end-user. It’s the end user’s responsibility to make sure there is enough capacity to run the desired workload. For example, say there are three tasks that require 4 GB of memory apiece, but there are only two container instances in the cluster, each with 6 GB of memory. The cluster would be unable to run the three tasks because there aren’t enough resources in the cluster! This is because container instances need to have enough memory to accommodate a full task. Therefore, one 4 GB task gets placed on one of the 6 GB container instances, and the other 4 GB task gets placed on the second container instance, leaving the third task with no place to go.
When you use Fargate, you need to specify one of the supported Fargate task sizes:
What’s amazing about Fargate is that Amazon runs the underlying EC2 container instances! That means there is no need to provision EC2 servers, monitor their resources, or worry about autoscaling, provisioning, updating, fixing security issues, etc. The tradeoff is that Fargate is more expensive than the equivalent EC2 containers, making cost a key metric to monitor for Fargate.
Since Fargate has a fixed cost based on CPU or Memory, it’s important to make sure that tasks aren’t overprovisioned. This can be done at the task level by checking the current and historical task-resource usage and making sure that the minimum necessary is reserved. If a task specifies 4 GB of memory, but under a normal workload doesn’t reach 2 GB, it is running at less than 50% capacity. Reducing the provisioned memory to 2 GB will save over 50% in Fargate costs!
Special Consideration: Fargate Daemon Mode
Many monitoring strategies require installing a daemon to collect and submit metrics. When using container orchestrators like ECS, there may be reasons to run a single daemon instance per VM host. This daemon instance would be shared for all containers running on a given VM host, a practice so common that ECS and Kubernetes even have a special name for it–daemon mode and DaemonSet respectively.
Daemon mode is not available on Fargate because AWS controls the underlying hardware. The only thing an end-user can do is provision tasks/services. In Fargate, it’s common to see the monitoring agent running as a sidecar (called a container definition in ECS). However, you need to be careful with this because some monitoring providers charge per the number of daemon processes running!
This post reviewed EC2 and Fargate, in terms of ECS, and described the key metrics necessary for monitoring ECS, Fargate, and EC2. Successfully monitoring ECS requires knowing what’s going on inside of the cluster. In the case of the EC2 launch type, it also requires understanding what’s going on in each individual node inside of an ECS cluster. Fargate, on the other hand, removes a lot of burden from monitoring because Amazon takes responsibility for managing the underlying ECS nodes. Still, due to the price of Fargate, it’s much more important–than with EC2–to keep a pulse on Fargate’s costs by monitoring container usage and size.
As with any monitoring approach, it’s essential to start with the four golden signals:
- How much?
- How long?
- What are the results?
- What work is outstanding?