Business applications have to be up and running, and issues need to be fixed quickly — this well-known fact also applies, of course, to companies running distributed and high-scale applications in production today.
These requirements lead to the following operational practices:
- Monitoring: making sure that the business application is working;
- Troubleshooting: fixing production performance issues quickly (when done in the dev/staging environments, it is sometimes called debugging).
These two requirements are critical — they have direct implications on the revenue generated by the company’s software, as well as the customer’s satisfaction.
Tracing/Logging/Instrumentation – Who Cares?
Traditionally, different people in the organization were responsible for each of these two requirements: Monitoring is the responsibility of operations and troubleshooting eventually escalates to engineering.
To accommodate these responsibilities, two types of tools are being used:
- Application Performance Monitoring (APM)
- Log aggregation
The two types of tools are very different. APM tools use code instrumentation, for example, in order to extract events from the applications as it is running, monitor application-level metrics and alert the user when something is wrong.
Log-aggregation tools digest huge amounts of data, store it and allow quick searching and tagging. The logs, of course, can also be used for monitoring, but the quality of the monitoring will depend on the quality of the data which is found in these logs.
The Shift to Microservices
In recent years, companies have begun to design, develop and operate microservices applications. They offer several benefits, including developer velocity and software scalability. While this is an exciting transition, the basic requirements of monitoring and troubleshooting haven’t changed substantially.
Let’s see how the shift to microservices affect this, both from an operational perspective and from a cultural perspective.
An architectural shift
Surprisingly, the tools haven’t changed. Monitoring is still dominated by APM and troubleshooting — by log aggregation tools.
But the circumstances have changed! As applications become more distributed, two things happen:
- Monitoring every service individually is no longer enough. A business transaction involves more than one service — it sometimes involves dozens of them. Therefore, to monitor the health of the business application, it is necessary to monitor the distributed flow from end-to-end;
- Logs are written individually by each service. As a consequence, they don’t tell the story of a microservices architecture. When pushing all the logs to a central location, the result is, usually, a big mess. It doesn’t allow engineers to resolve issues quickly, hurts their productivity and impacts the customer experience.
A cultural shift
In addition to the fact that the approach of monitoring metrics and aggregating logs does not apply effectively for monitoring and troubleshooting distributed applications, there is another interesting shift: Organizations that adopt microservices require modern technologies, which are very often in the cloud. Such technologies include containers, and function-as-a-service (e.g. AWS Lambda).
As they become more prominent, the result is that developers are becoming more powerful in the company. Serverless is the perfect example. A serverless application is a composition of code and configuration deployed directly to production. When and how are Ops involved? As a matter of fact, sometimes they are not. And the result? The developers become accountable for the software they are writing and deploying, which means — yes — monitoring it.
What it means for developers
But wait — developers are used to working with logs. So they find workarounds. They use their log aggregation tools for monitoring distributed business flows. And guess what? It doesn’t work. And when they realize it, they start to spend time implementing distributed tracing techniques in their logs — if you’ve heard about writing Correlation IDs in the logs and then searching for them, you’re not the only one. I’ve seen it over and over again during the last year. The time being spent is enormous.
Unified Tracing and Logging
When we founded Epsagon, we decided to take a different approach to tracing, monitoring and logging. Everything is a trace, and a log is just another type of trace. Using this methodology, we were able to design and implement a unified product for monitoring and troubleshooting.
Distributed tracing, distributed logging
Distributed tracing is the main approach to troubleshooting distributed applications
Using automated distributed tracing, every request that comes into the system is traced by Epsagon from end-to-end. Every payload is tracked and can be examined to get insights about the issue that we’re trying to solve.
However, it’s not just payloads. Next to each trace, there is a button named “See logs,” which will then show the corresponding log entry (in case this is an AWS Lambda function, this is the CloudWatch log), which will complete the story.
Logs are coupled with traces
Searching the data
Drilling-down into a particular request is a great way to root-cause complex problems. However, sometimes it is unknown where the request actually is. If you have any kind of information about the issue you are looking for — an HTTP body field, an SNS payload, or anything else — you can search for it. Built on top of Elasticsearch, Epsagon automatically indexes any payload or log so you can search for it. Then, you can jump directly to the corresponding transaction.
Elasticsearch as the main database to search across traces and logs data
It’s all about developer velocity. Companies are betting on AWS Lambda to optimize their most expensive resource — engineers (no, not their cloud bill). Don’t manage infrastructure, just write software and deploy it. It makes total sense.
However, these companies are not taking into consideration the fact these highly effective developers are now spending more and more of their time doing Ops.
Takeaway: Unified Future
We should target a unified dev+ops culture. A culture where Dev, Ops, and DevOps are working together to solve common problems. In this world, where teams are working together, better integration of tools makes a lot of sense.
Eventually, the important things are time to market and high-quality customer experience. To achieve this, modern cloud technologies can be utilized, but the mindset of monitoring vs. logging has to change. In this world of distributed cloud applications and mixed teams, it’s time to think of distributed tracing as the enabler of both operations and developers.