For many companies around the world, websites have become a core technology in their business model. If the servers hosting these websites crash, sites can get bugs or take too long to load properly, resulting in frustrated customers and companies losing money. That’s why the field of “Site Reliability Engineering,” or SRE for short, was created.
SRE is meant to keep a website running smoothly. SRE teams monitor the servers and check if performance is acceptable, systems are stable, and all requests can be handled reliably.
SRE is also about reducing costs. If you monitor your whole infrastructure, you get a better feeling for how much of that infrastructure you actually need. It could very well be that you spent money on too many servers or on servers that were too powerful for your needs. Overestimation of capacity is a huge problem in the software as a service business and can get expensive quickly.
In this blog, we’ll explain what SRE teams should be monitoring, following the “Four Golden Signals.” Then, we’ll deep dive into how Epsagon, a microservice-native APM platform, enables SRE and engineering teams to instantly understand, monitor, and troubleshoot these complex systems using the “Four Golden Signals”.
Prefer a video? Feel free to check out our session “Monitoring the Golden Signals for SRE in Microservices” presented at EuropeClouds Summit.
The Four Golden Signals
For SRE, measuring a system’s status can differ depending on the services your company offers, from a website to a fancy video streaming service. You need to check what metrics are relevant to your given customers and focus on these, but finding these metrics can be a difficult task. Luckily, SRE practitioners found four metrics that are crucial for any website company to monitor.
Monitoring the “Four Golden Signals” are not the only relevant metrics to measure, but if money is tight, you should focus on at least these four: latency, traffic, errors, and saturation.
Latency refers to how long a process takes to complete. In terms of a UI, it could be the delay from when someone clicks on a “load data” button when the data is finally displayed.
Many things can have an impact on latency:
- What type of data is it?
- Does it have to be calculated on-demand? Is it inherently big like videos?
- Where is the data stored?
- Is it on the client’s device already? Is the data somewhere across the globe? Is it in an SQL database or a key/value store?
- How fast are the data sources in delivering data?
- Are we talking two-digit milliseconds or multiple seconds? Maybe even minutes?
- How should the data be displayed?
- Will it be rendered as is? Will the client generate some graphics for it? Will it be simple 2D charts or a complex 3D animation?
A click on a button isn’t really just a click; it’s an event that travels from the tip of a user’s finger all the way around the world to a server and then back to that user’s eyes.
The job of an SRE team is to measure how high latency is and improve it when it gets too high. Depending on the SRE scope, this can include network latency, database latency, or even the time it takes to render the data on a client.
Finding a bottleneck in a specific action isn’t a trivial problem, and guessing can lead to wrong assumptions that cost a load of money.
Traffic is the demand on your system. This can be a different metric depending on your service but includes HTTP requests per second sent to your website or API, network I/O, and parallel active sessions.
This metric gives a rough overview of “how much” is happening for a given period of time. Are more users active in the morning or the evening? The weekend or workdays? Summer or winter?
Sudden changes in traffic can implicate good or bad things.
- If your traffic goes to zero after you deployed a new version was released, the chances are good that you messed up your monitoring.
- If your traffic suddenly spikes, it could be that a marketing campaign worked well, and you got new customers.
- If you get unusually high traffic on protected parts of your website, you could have security problems.
Overall, traffic gives you a sense of what part of your service is in high demand and which parts are uninteresting to your users.
Monitoring errors per a given period of time helps to identify how well your system works or how well it is used.
There are different types of errors:
- One of your services is down:
- A bug crashes a server, and requests fail until it is restarted.
- One of your services is overloaded:
- A service can only handle so many requests, and everything above that threshold will be counted as an error.
- Some clients misuse the service:
- If clients send invalid requests, this could be an issue with documentation or versioning. In this case, your services are okay, but your customers still won’t get what they want.
Saturation measures the usage of your system relative to its maximum capacity, be it memory, I/O, or CPU time.
This metric can help determine the optimal provisioning of infrastructure. You don’t want to pay for resources you aren’t using, but you also don’t want to have a system that is at maximum saturation all the time, as the next traffic spike could lead to an overload.
In well-designed systems, overloads are handled gracefully. But systems are often not well-designed, which can lead to a cascade of problems. If one server goes down due to an overload, suddenly the other servers get flooded and can crash because of this additional traffic.
Monitoring the Golden Signals with Epsagon
Observability is a crucial part of Site Reliability Engineering as systems become more distributed and complex. Epsagon is a microservice-native APM platform with out-of-the-box visualizations that provide context between traces, metrics, and logs all in one unified view. Let’s see how monitoring the golden signals takes place when using Epsagon.
In highly complex distributed environments, we also want to pinpoint where in our services time is being spent. This is equally critical in cloud transformations and for continued service excellence. If 40% of users will leave a site or application with a latency of 3 seconds or greater, this becomes crucial in saving revenue and ensuring a satisfactory end-user experience.
Service maps can be a one-stop-shop for pinpointing where time is most often time being within these large-scale applications to help determine where to be spending your time on improving for potential inefficiencies.
A tracing solution should also pinpoint where, in the code, time is being spent. Is a significant portion of time being spent waiting for an external API to respond, or perhaps we have an inefficient database call being made that needs refactoring? These waterfall visualizations make it easy. At Epsagon, our waterfall visualization can also provide you with the rich context of the header metadata to begin doing real-time troubleshooting.
Similarly, when it comes to traffic, we need to not only look at the traffic of our application holistically (which is, of course, still very important), but we also have to look at it in the context of each service and component. If we have a payments service that handles all of our processing, and the traffic there drops to zero, it’s a huge problem — even if traffic in every other service of our application is fine.
And so we need to have the flexibility to slice and dice these data in any number of ways, with a couple of examples shown here. We may want to look at the traffic at an application level, a service level, or even a resource type level.
It’s going to be a similar idea when it comes to errors. Not all errors are the same. An error that is critical in service A might be able to be ignored in service B. And so again, when it comes to really understand the error state of your application, you need to be able to look at it from several different viewpoints. Maybe that’s breaking it out by error code (a user being rate limited is important information, but it’s not as critical as seeing a huge amount of 500s).
Further, we may want to look at errors not based on the error code but based on the type of exception that is being generated.
There are several ways you may want to visualize, track, and understand errors throughout your system, and it’s the job of a good observability tool to allow you to do just that.
When it comes to saturation, the picture gets muddled. In many cases, the infrastructure is being abstracted away and handled entirely by a cloud provider (Fargate, or serverless). Or it’s being handled by Kubernetes, which is in charge of keeping enough resources spun up based on predefined rule sets. And so, a good observability tool will provide you deep insights into what is happening at the resource level for something like your Kubernetes implementation.
You’ll want to be able to see information on the cluster, nodes, pods, containers as well as deployment. You’ll want to understand what your resource usage level is at each: things like how much memory or CPU your cluster is using or how much network traffic or a disc read/write hitting each node. And you’ll want to have full context into the traces (those associated with a given cluster or those associated with a given pod) and container logs as well. A good observability tool will provide this out of the box.
It’s going to be a very similar story when we talk about containers hosted outside of Kubernetes, whether that’s on something like ECS (leveraging Fargate, EC2) or otherwise. You’ll again want to have that full insight into what is going on at the resource level for your clusters, services, instances, and tasks. You’ll want the ability to dive into the traces associated with a given cluster or service, and you’ll want to be able to dig into container-level logs to understand what’s happening there.
Like Kubernetes, a good observability tool will provide this level of insight and the correlation required to realize maximum value and efficiency.
Site Reliability Engineering is a crucial part of any company that provides online services since websites serve as a core part of their business model. Be it an API for developers, software as a service, or simply a store website to sell physical products over the Internet. The “Four Golden Signals” are paramount for Site Reliability Engineering and should be used as focus points when starting to measure metrics in this area.
In a world where applications are becoming increasingly complex, it’s also important to have the tools needed to evolve in that direction. When it comes to monitoring the “Four Golden Signals” and understanding them on a service by service basis, true observability will only continue to grow in importance.