Microservice Architecture introduces operational complexity when it comes to monitoring service-to-service communication and diagnosing performance issues. From an observability perspective, it is imperative to have in-depth visibility into your systems to ensure debugging is convenient and that you can recover from failure scenarios faster. 

In this comparison of distributed tracing vs. logging, we discuss techniques to improve the observability of services in a distributed world. As we transition from monoliths to microservices, it is important to understand the difference between distributed tracing and logging, implementation challenges, and how we can build a consolidated approach using logs and traces for effectively debugging distributed systems

Why are Tracing and Logging Important?

Observability has evolved in the journey from monoliths to microservices. In monolithic systems, the transaction happens in the same machine, and traditional logging generally provides the full execution stack trace, which can assist in troubleshooting any service error. However, as the industry starts adopting microservice architectures, logging alone cannot effectively troubleshoot issues. You now have to handle multiple services communicating with each other and keep track of how a request traverses across various services/functions.

What is a Log? 

A log can be defined as a specific timestamped event that happened to your system at a particular time. When there is an application issue, logs are your best friends and help to identify errors and understand what exactly went wrong. Troubleshooting issues is difficult without access to application logs. 

But one problem with logging is the sheer amount of data that is logged and the inability to efficiently search through it all. Storing and parsing log data is an expensive operation, so it’s crucial to log only information that can help you identify issues and keep it manageable. Logging levels allow you to categorize log messages into priority buckets. It’s critical to filter log messages into various logging levels, such as Error, Warn, Info, Debug, and Trace, as this helps developers understand the data better and set up necessary monitoring alerts. These logging levels can be changed on the fly and do not require a change to the application source code.

Logging does consume disk space, so you also need to maintain a balance when it comes to how much detail you want to capture and segregate the noise. 

Structured Logging 

Logs are unstructured text data, which makes them challenging from a querying perspective. Structured logging solves this problem by storing the records in a structured format that can be easily parsed. The standard format for structured logging is JSON, but you can also leverage a standard logging library, such as log4j, log4net, and slf4j, and send the logs to a central log management system. Below is an example of how these libraries store the log information and send it to the log management system:

{
"timestamp":"1590104690016",
"container_id":"b114fd8746783e9971598a5f594f",
"container_name":"/ecs-customer",
"source":"stdout",
"level":"ERROR",
"message":"Setup of JMS message listener invoker failed for destination",
"logger":"org.springframework.jms.listener.DefaultMessageListenerContainer",
"thread":"DefaultMessageListenerContainer-421721",
"class":"org.springframework.jms.listener.DefaultMessageListenerContainer"
}

Structured logging allows you to easily use your system for monitoring, troubleshooting, and business analytics. Having a standardized way of logging goes a long way in achieving consistency and provides better insight into your system.

What is a trace? 

A distributed trace is defined as a collection of spans. A span is the smallest unit in a trace and represents a piece of the workflow in a distributed landscape. It can be an HTTP request, call to a database, or execution of a message from a queue. 

A trace provides visibility into how a request is processed across multiple services in a microservices environment. Every trace needs to have a unique identifier associated with it. 

From the context of an external request, a trace ID is generated when the first request is made, whereas a span ID is created as the request reaches each microservice. The trace below shows a request that took 6.99 ms and traversed across four services with a total span count of seven.

Tracing an external request using Jaeger

Figure 1: Tracing an external request using the Jaeger UI

What’s Distributed Tracing? 

Distributed tracing is a critical component of observability in connected systems and focuses on performance monitoring and troubleshooting. You can use it to know how long a request took to process and identify a slow service in a microservice environment. Or you can track latency issues and gain valuable insights by tracing your call amidst the dependent components in the entire application stack.

The Role of Tracing in Microservice Architecture

With the adoption of microservice architecture, distributed tracing is gaining popularity and slowly becoming an essential observability tool to troubleshoot and identify performance issues. As the number of microservices in your organization increases, they introduce additional complexity from a system-monitoring perspective. Metrics and logs by themselves fail to provide in-depth visibility across all the services, and this is where distributed tracing comes to the rescue.

Challenges with Distributed Tracing 

There are challenges to adding instrumentation to your application code across your entire stack. You will be required to add the code to each of the service endpoints, and if your applications are polyglot, the code may slightly differ and thus be prone to error. 

Distributed tracing in a microservices architecture will be beneficial only when you implement it in most of your services. Depending on the number of services you have, the effort to do this can be sizable. In a service mesh architecture, you can leverage Envoy to be run as a sidecar alongside your service, which will take care of functionalities like tracing without the need for making any application code change. 

Distributed Tracing vs. Logging

Metrics and logging provide context from a single application, whereas distributed tracing helps track a request as it traverses through many inter-dependent applications. Both logs and traces help in debugging and diagnosing issues. Logging and tracing allow you to not only monitor systems in real-time but also go back in time and investigate service issues.

Logs capture the state of the application and are the most basic form of monitoring. Tracing is beneficial when you have a request which spans across multiple systems. A trace tells you how long a request took, which components it interacted with, and the latency introduced during each step. It provides you an insight into an application’s health end to end. However, traces don’t explain the root cause of a service error or latency. For this, you need to investigate the application logs. 

To summarize, tracing helps you pinpoint where the issue is, and logging provides additional details about the service issue.

You’ll need to instrument your application code to enable both logging and tracing. Compared to logging, tracing adds more complexity to the application and is thus more expensive. Based on your application landscape, you can determine if tracing provides added value from a monitoring perspective. If you have a microservices architecture, enabling tracing makes more sense than in a monolithic application. In a distributed system, your development teams will require a combination of logs, traces, and metrics to debug errors and diagnose production issues. 

Scalability and Automation

When adopting cloud technologies, most organizations have two things in mind: infrastructure cost and operational speedWhen considering operational speed, it is up to the organization to build, deploy, and operate their software faster. The approaches that are popular in the cloud today, such as microservices, APIs, managed services, and serverless, exist to increase this speed — which designates as developer velocity.

DIY distributed tracing is a step in the right direction, but getting valuable information depends on putting in the time, effort, and resources to build something truly useful. Lack of tool automation has meant searching logs for what needs fixing, which is highly manual and slow. Even open tracing frameworks require extensive training, manual implementation, and maintenance. According to the results of an Epsagon survey of companies using modern cloud technologies, engineers spend 30% to 50% of their building time implementing observability tools. That’s a huge drain on productivity and resources that are often overlooked.

Epsagon

The good news is that there is a better approach that gives you the ultimate solution. Instead of trying to repurpose your existing tools or methods or building your own, you can use a cloud-based service such as Epsagon. Epsagon provides everything you need to perform automated distributed tracing through major cloud providers without having to write a single line of code. This solution can also handle synchronous events, asynchronous events, and message queues.

Epsagon Kubernetes Trace View

Epsagon Kubernetes Trace View

By choosing Epsagon, you can automatically monitor any request generated by your software and track it across multiple systems. Any data recorded by the distributed system can also be viewed, analyzed, and presented in a number of visual formats and charts. Using modern, standard approaches to cloud software development can both improve your building speed and reduce the setup and maintenance of observability, as it will be automated by corresponding modern tools.

Conclusion 

In this comparison of distributed tracing vs. logging, we discussed the differences between a log, a structured log, and a trace. We looked at the importance of logging and distributed tracing, its use cases, and the challenges associated with its implementation in a distributed system. With the growth of microservices and containers, monitoring requirements have grown more complex. Metrics, logs, and traces together form the “Three Pillars of Observability” and help to build better production-grade systems.

Read More:

Distributed Tracing: the Right Framework and Getting Started

Introduction to Distributed Tracing in Modern Applications

Distributed Tracing: Manual vs. Automatic

Common Design Patterns in Distributed Architectures