Amazon Elastic Container Service (ECS) is a fully managed and highly scalable container orchestration service that you can use to run, manage, and deploy your mission-critical applications. You can launch and scale your containers seamlessly using ECS, abstracting away the complexity of infrastructure management. With Amazon ECS, you have visibility into your cluster state and can monitor your containers via AWS CloudWatch. It also gives you the flexibility to integrate with a large variety of services in the AWS ecosystem.
In this article, we’ll cover the challenges of monitoring Amazon ECS, gain an understanding of key ECS metrics you need to monitor, and explore Epsagon’s microservice-based observability platform to monitor your Amazon ECS environment seamlessly.
Monitoring Amazon ECS: Challenges
While troubleshooting issues with your ECS cluster, it’s helpful to have historical performance data under various load scenarios. This will help you identify anomalies and assist you in addressing issues impacting your services.
With a lot of moving components, monitoring your ECS cluster can become challenging. That is why you need to be well aware of the resources running in your cluster, know how to monitor your cluster, know when things go wrong, and receive timely alerts to quickly address cluster issues.
Below are a few of the most common error messages you’ll find in your ECS service event logs:
- Service is unhealthy in the target-group due to (reason request timed out).
- Service is unhealthy in the target-group due to (reason health checks failed).
- Service task failed container health checks.
- Service was unable to place a task because no container instance met all of its requirements. The closest matching container-instance has insufficient CPU units available.
- Service was unable to place a task because no container instance met all of its requirements. The closest matching container-instance has insufficient memory available.
- Service was unable to place a task because no container instance met all of its requirements. The closest matching container-instance doesn’t have the agent connected.
- Service is unable to consistently start tasks successfully.
The default monitoring capabilities that come with Amazon ECS are elementary and do not provide deep enough insights into your cluster metrics. Hence, it’s challenging to address issues within your ECS clusters.
Cloud-native applications have a lot of moving parts that increase the complexity of troubleshooting and monitoring issues, so having the ability to identify and fix problems quickly is critical.
Key Amazon ECS Metrics to Monitor
There are two sets of metrics that you need to capture while monitoring your Amazon ECS cluster to help you identify any upfront issues:
- CPU and memory reservation metrics
- CPU and memory utilization metrics
Getting complete visibility into your ECS cluster is necessary to determine if the ECS infrastructure is configured correctly to handle your workload. To do this, some other metrics to monitor are:
- Number of services/tasks/containers running
- Auto-scaling policies for your services
- Resource metrics at the container level
- Network traffic in and out of your cluster
Monitoring ECS with Epsagon
Epsagon’s automated approach for cloud monitoring provides you with complete visibility into your application and infrastructure performance. Epsagon helps you address the key challenges in monitoring ECS environments by ingesting Amazon ECS metrics into the Epsagon platform and transforming them into centralized dashboards and other visualizations, letting you make better data-driven decisions and solve issues quicker.
Here below, we’ll cover several top features that Epsagon provides for monitoring Amazon ECS clusters.
First off, Epsagon gives you an automated approach to monitoring by unifying all necessary metrics (logs, traces, payloads) into a single platform. It also comes with a responsive user interface to view and explore your application metrics, allowing you to effectively display critical metrics and highlight the ones that require attention.
Built-in visualization features to monitor the health and performance of your ECS clusters in real-time are an added bonus, along with auto-discovery and monitoring of every running container inside your ECS cluster. Epsagon even comes with a Service Map that displays a real-time view of your overall architecture, showcasing the interaction between various components and letting you trace every request flowing through your system.
Out-of-the-box dashboards for monitoring your cluster infrastructure make troubleshooting any cluster issues easy work, while you can also take advantage of tracing requests across the containers/services running in your ECS cluster for troubleshooting and root-cause analysis.
Epsagon’s trace-based metrics and trace-based alerts let you monitor issues and send alerts when error/latency thresholds are not met. The trace data is stored in Elastic and provides an excellent search experience. Meanwhile, using the trace detail view, you can track latency issues across your microservices stack and identify performance issues. You can check out this post here to learn the basics of distributed tracing.
Epsagon enables you to search your trace data based on a large number of filters as well, such as AWS resource name/ID, operation, labels, error code, and application. Once you’re able to drill down to your event, you can look at the metrics associated with that specific request.
Finally, Epsagon displays all open issues with your applications in the “Issues Manager” screen, giving you a consolidated view of current issues and the ability to set up alerts.
Achieving Complete Visibility of Your ECS Clusters with Epsagon
You can also integrate your AWS account directly with Epsagon and ingest ECS metrics to visualize enhanced monitoring capabilities via Epsagon’s platform. Let’s walk through the Epsagon console and note some best practices for monitoring ECS clusters.
You can get an overview of the ECS clusters and their respective status, services, task count, and CPU/memory usage:
You can visualize the details of the service including auto-scaling policies and a count of running/pending/desired tasks:
You can also get an insight into the resource utilization of containers, network and I/O utilization, and the status of the ECS container agent:
Plus, you have access to details on the number of tasks running per service (including corresponding traces and logs), task status, and resource limits:
With the growth in containers, cloud platforms, and microservice architectures, making sure you get insights into your system’s performance and availability has become a key factor. Epsagon provides you an integrated monitoring solution for your services deployed in Amazon ECS, with all the necessary monitoring ECS metrics available in a single platform. Teams leveraging such automated monitoring solutions can experience higher productivity, a lower rate of error, faster time to market, and a reduced MTTR (mean-time-to-resolution) during incident management.