This is the second article out of three investigating what AWS CloudWatch can offer us, this time focusing on alarms and alerts.
AWS CloudWatch can monitor metrics and generate alarms when they cross a certain threshold. Additionally, CloudWatch can take various actions when this occurs, such as alerting humans through email, SMS, or Slack messages. CloudWatch typically provides alerts by posting a message to an SNS (Amazon Simple Notification Service) topic, which would then dispatch the message via a variety of mediums, such as email, SMS, and Lambda functions. Additional actions that CloudWatch can take when an alarm goes off are typically auto-scaling events, such as scaling workloads out or back in.
Let’s dive into more details.
The most important tip of this article is that CloudWatch can alert you when your bill is likely to become too high. You should never run any workload on AWS without this alarm, as it is quite easy to forget resources spun up for testing purposes. To create such a billing alarm, you need to log in to the AWS console using either the root user or an IAM user who has permission to access the billing section of the console.
The first step is to enable billing alerts. Click on your user name in the top-right corner of the screen, then click on “My Billing Dashboard.” In the left pane, select “Billing preferences,” then enable “Receive Billing Alerts,” and finally click on “Save preferences.”
The second step is to actually create a billing alarm. Head over to the CloudWatch service. In the left pane, click on “Alarms,” and then click on the “Create Alarm” button. Choose the “Billing, Total Estimated Charge” metric, and configure the alarm to go off when this metric is greater than a monthly amount you would like to be alerted; click “Next.” Then select “Create a New SNS Topic,” and enter a name for the SNS topic and your email address; click “Create topic.” Click “Next,” enter a name for the alarm, click “Next” again, and, finally, “Create Alarm.” That’s it!
How Are AWS CloudWatch Alarms Evaluated?
Evaluation Periods and DatapointsToAlarm
There are three settings that control when an alarm goes off:
- Period is the length of time over which the underlying metric is evaluated. The alarm period does not need to match the underlying metric period, but it needs to be at least as long as the metric period. CloudWatch will essentially generate one alarm data point per alarm period, based on the value(s) of the underlying metric during that period.
- Evaluation period is the number of alarm periods (or alarm data points) to take into account when determining whether the alarm is triggered or not.
- DatapointsToAlarm is the number of alarm data points that must breach the threshold during the evaluation period for the alarm to go off.
For example, let’s assume that the alarm period is the same as the underlying metric period, the evaluation period is five, and the DatapointsToAlarm is three. For any five consecutive alarm data points, the alarm will go into the ALARM state if at least three out of five data points breach the alarm threshold. If for a given set of five consecutive alarm data points, only two or less breach the threshold, the alarm will not be in the ALARM state.
These settings are quite useful to filter out normal spikes, like CPU. It’s normal for the CPU to run very high for short periods of time because the OS is, for example, performing some maintenance tasks, and you probably wouldn’t want to be alerted for such things.
What Happens When There Is Missing Data?
There are various reasons why the underlying metric might have missing data points during an evaluation period, including:
- The service or EC2 instance is just starting and hasn’t yet reported enough metrics.
- The EC2 instance is rebooting and can’t report the metric during the reboot time.
- Some networking parameters (such as a security group or network access control list) have been modified, preventing the EC2 instance from connecting to CloudWatch.
- There are some network glitches (typically if the metric is reported from outside AWS).
You can configure each alarm to consider missing data points as:
- Breaching: The missing data point is assumed to breach the alarm threshold.
- Not breaching: The missing data point is assumed to be within the alarm threshold.
- Ignore: The alarm state is left untouched.
- Missing: This data point will not be considered when evaluating the alarm.
Which option to choose depends on the underlying metric reports. For example, if your alarm is for CPU, RAM, or disk utilization, it’s most probably safer to consider missing data as breaching, as it is likely that your EC2 instance is going through a rough time. Some metrics are reported only when errors occur (such as ThrottledRequests in Amazon DynamoDB), in which case missing data would be considered as not breaching.
Importantly, CloudWatch tries to use missing data points as little as possible, no matter how you configure the alarm to treat them. It does so by retrieving more data points than the evaluation period (this is an evaluation range) and tries to use as many valid data points as possible within the evaluation range.
Alarms have a typical resolution of sixty seconds. If the underlying metric is a high-resolution one, you can use a high-resolution alarm with a resolution of ten or thirty seconds. Although they come at a higher cost, these could be useful if you need very swift action.
Advanced AWS CloudWatch Alarms Evaluation
CloudWatch can combine metrics using math. This will be explored in a subsequent article, but what is relevant here is that you can create alarms on the output of the math expression.
AWS released a new service that generates alarms based on anomalous patterns in the underlying metric. In order to use this service, you simply create an alarm as usual and select “Anomaly detection” under “Conditions.” Alternatively, when you display a metric in CloudWatch, select the “Graphed metrics” tab and click on the wave icon next to the metric name (see screenshot below).
Using anomaly detection, you can benefit from the years of experience that AWS has accumulated in monitoring a variety of workloads. It allows you to leverage machine learning in a very intuitive and user-friendly way. Anomaly detection uses machine learning and statistical analysis to determine the validity or “usual” range for a metric. It will then alert you whenever the metric goes out of range and is thus suspicious.
How to Alert a Human Being
CloudWatch can take a variety of actions when an alarm goes off, such as triggering an auto-scaling event or sending a message through a medium likely to attract the attention of a human. Typically, you would need to create an SNS topic and add subscriptions to that SNS topic. Each subscription represents a channel to which the alarm message will be forwarded. In the screenshot below, there are a number of subscriptions to the “Alarms” SNS topic. Any message sent to that SNS topic will be forwarded to all subscribers, whether by email, SMS, Lambda invocation, etc.
The next step is to link the alarm with the SNS topic. So when you create or edit the alarm, you just need to select the correct SNS topic. After that, every time the alarm goes off, CloudWatch will send a message to SNS, which will forward it to you via email, SMS, etc. You can even create a Lambda function that will post the message on a Slack channel.
AWS CloudWatch Alarms Cost Considerations
Don’t Create a Loop!
This piece of advice is fairly obvious, and it is to avoid creating a loop. For example, you might have an alarm on a Lambda function that sends a message to Slack. This could go into an infinite loop where the Lambda triggers the alarm, which triggers the Lambda function, etc. This is a silly example, and such situations are unlikely to happen in real life, but this is still something to keep in mind. Obviously, such a loop is likely to cost you a lot of money!
Avoid Sending Too Many Text Messages
Pricing information for SNS can be found here. The free tier allows you to send 100 SMSes (to US phone numbers) and 1,000 emails per month. After that, you will have to pay per SMS, which is reasonably cheap for US numbers but can have significant costs for other countries.
If you’re using Slack in your organization, a good way to minimize these costs is to post a message to a Slack channel instead.
The pricing page for CloudWatch can be found here.
High-resolution alarms are about three times more expensive than regular alarms, so use them only if necessary. The costs for alarms and alerts are usually quite low, so you wouldn’t need to worry about those unless you have a large amount of them.
Another point to consider with CloudWatch is that everything needs to be taken care of manually, and there is a chance to miss or misconfigure an alert.
CloudWatch vs. Competition
There is a large offering of additional monitoring software tools out there, both in the open-source community and from paid-for suppliers.
The open-source community has a number of very capable software solutions. Prometheus is well known and able to ingest and process metric data; its time-series database is very efficient. Prometheus works well in most workloads but doesn’t scale very well for very large workloads. In contrast, CloudWatch has gigantic scaling capabilities built-in, so scale should never be a worry. In addition, Prometheus can only deal with metrics that are actually reported; if the agent running on the EC2 instance has some problems and doesn’t report metrics to the Prometheus server, you’re out of luck. You will also need a separate piece of software for visualization, such as Grafana.
Other software such as Zabbix and Nagios focus more on system monitoring, including probing and periodic testing of whether services are up or not. One of the main difficulties with both of these mentioned is that they are quite complex and pretty difficult to set up and maintain.
Third-party solutions are likewise numerous and include Sumo Logic and Epsagon. Most of these are very good, and going into detail on them would be beyond the scope of this article. Generally speaking, these solutions offer turn-key, easy-to-use features to analyze your metrics.
The way CloudWatch treats missing data can be annoying, especially if you want missing data points to always trigger an alarm since CloudWatch might use non-missing data points to evaluate the alarms, which might mean that the alarm does not go off. Nevertheless, CloudWatch Alarms is a very powerful and capable solution. It certainly holds its ground against dedicated solutions, such as Epsagon, especially if your needs are simple, which is the case for most workloads.
Additionally, AWS CloudWatch integrates very well with other AWS services, both for input (generating alarms) and output (taking action when alarms go off). In conclusion, you should probably evaluate CloudWatch in light of your requirements as the first port of call for monitoring your workload, especially if it is run on AWS.