In the third and last article of our detailed series we’ll focus on AWS CloudWatch metrics and dashboards. Part 1 (Logs and Insights) and Part 2 (Alarms and Alerts) are also available in case you want to catch up with the series.
So what are metrics anyway?
Metrics are time-series data taken at a regular interval and which are identified by a name and (optionally) dimensions. A metric name could be “CPUUtilization” or “disk_used_percent.” Dimensions could be for example “InstanceId”, which would separate metric data for two different EC2 instances. Additionally for AWS, metric data is stored inside a “namespace,” which acts like a folder. Examples of namespaces are:
- EC2: Where EC2-related metrics will be stored.
- RDS: Where RDS-related metrics will be stored.
- CWAgent: Where metrics reported by the Amazon CloudWatch Agent are reported by default.
CloudWatch also offers dashboards, which allow you to have a quick overview of how certain parts of your AWS workloads are performing. Dashboards offer a graphical view of how selected metrics are evolving over various periods of time and can help you to quickly identify anomalies and issues.
How AWS CloudWatch Handles Metrics
Metrics are stored in namespaces, which are akin to folders. Metrics reported by various AWS services (such as EC2, RDS, and Lambda) are stored in their respective namespaces, which are usually named after the AWS service that sends the metrics (more on that below). You can create your own custom metrics and store them in any namespace of your choosing, but it would be more advisable to create a brand new namespace in such a case.
Names and Dimensions
Metrics are uniquely identified through their names and dimensions. Dimensions are optional, but usually, at least one dimension will be attached to the metric to differentiate it from other metrics with the same name.
For example, the screenshot below shows a metric in the “EC2” namespace, with the “CPUUtilization” name and the “InstanceId” metric (please note the “Instance Name” is automatically looked up by CloudWatch for convenience and is not an actual dimension). As you can expect, the “InstanceId” dimension is necessary to avoid blending the CPU metrics reported for two different instances.
The combination of namespace, metric name, and dimensions uniquely identify a metric. When specifying a metric programmatically or using the AWS command-line interface, all of these must be specified. It is not possible to “filter” metrics by only specifying, for example, a subset of its dimensions.
Popular Use Case: EC2 Metrics (Standard vs. Advanced)
By default, when you create an EC2 instance, it will have EC2 metrics reported every five minutes and this option is free. You can enable detailed monitoring for your EC2 instance as well, which will report the metrics every sixty seconds. This will cost you extra, though. Detailed monitoring is usually unnecessary, but if your workload is quite sensitive and business-critical, it might be better to enable it.
The Amazon CloudWatch Agent
The Amazon CloudWatch Agent can be installed on EC2 instances to report additional and useful metrics. Indeed, some very important metrics, such as RAM and disk utilization, are otherwise inaccessible to the AWS infrastructure.
The Amazon CloudWatch Agent can be configured to report metrics such as CPU, RAM and disk utilization, swap usage, disk I/O, etc. Additionally, it can also forward logs to CloudWatch logs, as detailed in a previous article, and can be installed on a variety of operating systems, including the standard Linux distributions (Amazon Linux, Ubuntu, CentOS, etc.) and Microsoft Windows.
Finally, the Amazon CloudWatch Agent is able to collect local custom metrics and forward them to CloudWatch. This is done either through collectd or by having the Amazon CloudWatch agent acts as a StatsD service. This feature is seldom used, but it is good to know it exists.
Many AWS services publish metrics. Each AWS service will typically publish metrics in a namespace named after it and prefixed with “AWS/.” For example, EC2 metrics are stored in the “AWS/EC2” namespace, RDS metrics are stored in the “AWS/RDS” namespace, and so forth.
Searching for AWS CloudWatch Metrics
Browsing metrics on the CloudWatch console is very easy. You need to first log in to the AWS console and navigate to the CloudWatch service. Then, in the left pane, click on “Metrics.” You will then be able to see the various metric namespaces.
To browse for metrics, click on the relevant namespace. The CloudWatch console shows custom namespaces (such as “CWAgent”) and AWS namespaces a little bit differently. Because it knows about the names and dimensions of the AWS metrics, it is able to show them in a more user-friendly way. For example, clicking on the “EC2” namespace (which is really “AWS/EC2”, but appears as “EC2” in the console) will show you some user-friendly options:
In contrast, clicking the “CWAgent” namespace will simply show you the various dimensions:
In any case, you can enter some keywords in the search bar at any point to further filter the metrics.
If you click on the “EC2” namespace, and then on the “Per-Instance Metrics” link, you will see a pretty long list of reported metrics for your current and past EC2 instances. You can scroll the list to find the metrics you want to display, but it is usually useful to enter some search keywords, such as the instance ID. You can then select one or more metrics to display:
The time options circled in pink above, allow you to select various durations of time to inspect the chosen metric(s).
CloudWatch also offers dashboards, which display smaller versions of the metric graphs and allow you to quickly view how your workloads are performing. Here is what a dashboard looks like:
How to Create a Dashboard
To create a dashboard, navigate to the CloudWatch console, then click on “Dashboards” in the left pane, and then on the “Create Dashboard” button. You can then add widgets such as graphs, numbers, free text, and even CloudWatch Logs Insight query results (CloudWatch Logs Insight has been covered in a previous article).
When adding a widget that shows the underlying metric(s), you will then need to select which metric(s) to show in the widget. The screen shown is the same as when selecting individual metrics to display outside a dashboard. You can show multiple metrics in one widget, as lines, stacked areas, or just numbers (in which case only the latest value will be shown).
Please note a little gotcha about dashboards: Dashboards will be displayed no matter which region you select, but you can add metrics only for the current region. So if you want to add a widget for a certain metric you have in mind and this metric doesn’t show up in the selection screen, make sure you are in the correct region for that metric.
Advanced Metric Usage
CloudWatch offers additional advanced features regarding the metrics it handles. First of all, CloudWatch allows you to perform math on the metrics you want to display. To add a math expression (whether part of a dashboard or not), select the “Graphed metrics” tab, and click on the “Math expression” drop-down menu:
You can then select the math function you want to use or start from scratch and write the math expression yourself. Please refer to the AWS documentation for further details.
Another advanced feature is to use search expressions in graphs. Search expressions are a type of math expression that allows you to quickly select related metrics. This is quite an advanced topic and outside the scope of this article.
Finally, CloudWatch is able to generate metrics from logs. Using log filters, CloudWatch can generate metrics to count, for example, the number of log events, occurence of a term in the logs, or HTTP 5xx errors. Again, please refer to the AWS documentation for more details.
Grafana is a visualization tool often used in conjunction with Prometheus to visualize the metrics collected by Prometheus. Grafana can essentially be compared to CloudWatch dashboards. Overall, Grafana’s capabilities are only slightly better than CloudWatch, and few people will actually require the additional features provided by Grafana.
Prometheus is an open-source software used to collect, store, and retrieve metric data. It is a very popular solution for handling metrics and is renowned for its very efficient time-series database system, which allows a single Prometheus machine to handle very sizable amounts of data. This makes Prometheus a viable alternative for all but very large workloads.
The advantage of CloudWatch over Prometheus is the tight integration with all AWS services. Prometheus requires either HTTP endpoints or agents running on servers to report the metrics, which require additional work and integration. In addition, if Prometheus is unable to reach the HTTP endpoint, or if the agent is unable to report the metrics to Prometheus, Prometheus will simply not have the data at hand. This can be contrasted with EC2 metrics (such as CPU usage and disk I/O), which are always available whether the Operating System running on the instance is healthy or not. Finally, Prometheus is inherently hard to scale out. CloudWatch on the other hand works perfectly at scale and has been designed from the ground up to handle huge workloads.
AWS ServiceLens is a brand new service from AWS that allows you to have more visibility into the inner workings of your AWS workloads. It combines data collected from CloudWatch and AWS X-Ray to present you with graphs and data that are actually understandable by a human being.
AWS ServiceLens provides you with an end-to-end view of your workloads and thus allows you to locate bottlenecks. It also shows you the different services you are using and the latency between each hop in the logical data flow of your users’ requests. You can easily inspect each hop via the corresponding metrics, logs, and X-ray traces.
AWS ServiceLens mostly depends on AWS X-Ray and improves on it by integrating CloudWatch metrics and logs, rendering it more user-friendly in terms of showing relevant information about how your workloads are performing.
Epsagon is the only fully-managed provider that integrates metrics, logs, traces, and alerts in the same dashboard. Epsagon takes all the heavy-lifting part of shipping logs, creating dashboards, defining alerts, and generating distributed traces – into a unified dashboard that allows DevOps and engineers to operate faster. It is doing so by integrating to CloudWatch, other AWS services, and Kubernetes clusters so it removes any configuration or maintenance. Epsagon shows every end-to-end trace in your application, allowing you to easily pinpoint and troubleshoot the most complex applications, as seen here:
CloudWatch’s handling of metrics is pretty good and should satisfy most workloads. It has some limitations, such as the obligation to specify all dimensions to specify a given metric, even though some of those dimensions are actually useless.
Compared to other solutions, CloudWatch provides much tighter integration with other AWS services and is thus easier to use. For most workloads running on AWS, CloudWatch should be complete enough and thus should be considered first because of the following advantages:
- Low cost compared to the competition.
- Simplicity and ease of use.
- Scalability without any extra effort or cost.