Amazon Elastic Container Service (ECS) is a fully managed and highly scalable container orchestration service that you can use to run, manage, and deploy your mission-critical applications. You can launch and scale your containers seamlessly using ECS, abstracting away the complexity of infrastructure management. With Amazon ECS, you have visibility into your cluster state and can monitor your containers via AWS CloudWatch. It also gives you the flexibility to integrate with a large variety of services in the AWS ecosystem.

In this article, we’ll cover the challenges of monitoring Amazon ECS, gain an understanding of key ECS metrics you need to monitor, and explore Epsagon’s microservice-based observability platform to monitor your Amazon ECS environment seamlessly.

Monitoring Amazon ECS: Challenges

While troubleshooting issues with your ECS cluster, it’s helpful to have historical performance data under various load scenarios. This will help you identify anomalies and assist you in addressing issues impacting your services.

With a lot of moving components, monitoring your ECS cluster can become challenging. That is why you need to be well aware of the resources running in your cluster, know how to monitor your cluster, know when things go wrong, and receive timely alerts to quickly address cluster issues.

Below are a few of the most common error messages you’ll find in your ECS service event logs: 

  • Service is unhealthy in the target-group due to (reason request timed out).
  • Service is unhealthy in the target-group due to (reason health checks failed).
  • Service task failed container health checks.
  • Service was unable to place a task because no container instance met all of its requirements. The closest matching container-instance has insufficient CPU units available. 
  • Service was unable to place a task because no container instance met all of its requirements. The closest matching container-instance has insufficient memory available. 
  • Service was unable to place a task because no container instance met all of its requirements. The closest matching container-instance doesn’t have the agent connected. 
  • Service is unable to consistently start tasks successfully. 

The default monitoring capabilities that come with Amazon ECS are elementary and do not provide deep enough insights into your cluster metrics. Hence, it’s challenging to address issues within your ECS clusters.

Cloud-native applications have a lot of moving parts that increase the complexity of troubleshooting and monitoring issues, so having the ability to identify and fix problems quickly is critical.

Key Amazon ECS Metrics to Monitor 

There are two sets of metrics that you need to capture while monitoring your Amazon ECS cluster to help you identify any upfront issues:

  • CPU and memory reservation metrics 
  • CPU and memory utilization metrics 

Getting complete visibility into your ECS cluster is necessary to determine if the ECS infrastructure is configured correctly to handle your workload. To do this, some other metrics to monitor are:

  • Number of services/tasks/containers running
  • Auto-scaling policies for your services
  • Resource metrics at the container level
  • Network traffic in and out of your cluster

Figure 1: AWS Console ECS overview

Monitoring ECS with Epsagon

Epsagon’s automated approach for cloud monitoring provides you with complete visibility into your application and infrastructure performance. Epsagon helps you address the key challenges in monitoring ECS environments by ingesting Amazon ECS metrics into the Epsagon platform and transforming them into centralized dashboards and other visualizations, letting you make better data-driven decisions and solve issues quicker.

Here below, we’ll cover several top features that Epsagon provides for monitoring Amazon ECS clusters. 

First off, Epsagon gives you an automated approach to monitoring by unifying all necessary metrics (logs, traces, payloads) into a single platform. It also comes with a responsive user interface to view and explore your application metrics, allowing you to effectively display critical metrics and highlight the ones that require attention. 

Built-in visualization features to monitor the health and performance of your ECS clusters in real-time are an added bonus, along with auto-discovery and monitoring of every running container inside your ECS cluster. Epsagon even comes with a Service Map that displays a real-time view of your overall architecture, showcasing the interaction between various components and letting you trace every request flowing through your system.

Figure 2: Service Map showing the trace flow

Out-of-the-box dashboards for monitoring your cluster infrastructure make troubleshooting any cluster issues easy work, while you can also take advantage of tracing requests across the containers/services running in your ECS cluster for troubleshooting and root-cause analysis.

Figure 3: Distributed tracing with Epsagon

Epsagon’s trace-based metrics and trace-based alerts let you monitor issues and send alerts when error/latency thresholds are not met. The trace data is stored in Elastic and provides an excellent search experience. Meanwhile, using the trace detail view, you can track latency issues across your microservices stack and identify performance issues. You can check out this post here to learn the basics of distributed tracing.

Figure 4: Epsagon trace view

Epsagon enables you to search your trace data based on a large number of filters as well, such as AWS resource name/ID, operation, labels, error code, and application. Once you’re able to drill down to your event, you can look at the metrics associated with that specific request.

Figure 5: Analyze traces using the Trace Search page

Finally, Epsagon displays all open issues with your applications in the “Issues Manager” screen, giving you a consolidated view of current issues and the ability to set up alerts.

Figure 6: Issues Manager shows open issues and enables alerting functionality

Achieving Complete Visibility of Your ECS Clusters with Epsagon 

You can also integrate your AWS account directly with Epsagon and ingest ECS metrics to visualize enhanced monitoring capabilities via Epsagon’s platform. Let’s walk through the Epsagon console and note some best practices for monitoring ECS clusters.

You can get an overview of the ECS clusters and their respective status, services, task count, and CPU/memory usage:

Figure 7: Epsagon ECS overview page

You can visualize the details of the service including auto-scaling policies and a count of running/pending/desired tasks:

Figure 8: Epsagon’s ECS Services

You can also get an insight into the resource utilization of containers, network and I/O utilization, and the status of the ECS container agent:

Figure 9: Epsagon container metrics

Plus, you have access to details on the number of tasks running per service (including corresponding traces and logs), task status, and resource limits:

Figure 10: Epsagon metrics for ECS tasks

Conclusion

With the growth in containers, cloud platforms, and microservice architectures, making sure you get insights into your system’s performance and availability has become a key factor. Epsagon provides you an integrated monitoring solution for your services deployed in Amazon ECS, with all the necessary monitoring ECS metrics available in a single platform. Teams leveraging such automated monitoring solutions can experience higher productivity, a lower rate of error, faster time to market, and a reduced MTTR (mean-time-to-resolution) during incident management.

To take the next step, start your Epsagon 14-day free trial here. To learn more about monitoring your ECS clusters with Epsagon, check out the onboarding documentation here.

Read More:

Amazon ECS Quick Start Guide (Tutorial)

Deeper Visibility into ECS and Fargate Monitoring

How We Improved Epsagon’s Trace Search Using Epsagon

How to Troubleshoot API Errors with Epsagon