Introduction to Distributed Tracing

In “2001: A Space Odyssey,” the course of human history is managed by strange and forbidding black monoliths. Monolithic applications can often seem as strange and enigmatic as Kubrick’s stone slabs. Both types of monoliths often have no clear purpose, and it is impossible to figure out how they work. One of the film’s central characters is HAL 9000, a large sentient mainframe that goes haywire in spectacular fashion.

Since the film’s release over fifty years ago, computer hardware has been miniaturized, and software has been virtualized and distributed across multiple locations. In the process, what were once large, fragile, and hard-to-maintain monolithic applications have been replaced by smaller, more robust microservices. While this has solved many problems with the old ways of building and maintaining software, it has also caused a large number of newer and perhaps more intractable problems that can’t be fixed using traditional methods.

In this article, we look at distributed tracing–a new way of dealing with the issues of modern software. We describe the problem, explain the solution, and present key concepts and available tools to help you get started.

Why Request Tracing Is Broken

Dealing with problems used to be far easier with non-distributed applications. Nowadays, this is no longer the case. Here we look at how times have changed.

The Monolithic Approach

Monolithic software is built upon a large and sprawling legacy codebase that is often so tightly coupled that any changes in one small section often result in breaking one or several features that depend on it. As a result, these apps are fragile and prone to breaking. When this happens, we have a number of tried and tested methods for troubleshooting and debugging them.

One of the key methods for finding a problem is tracing. A trace follows the course of a request or system event from its source to its ultimate destination. Each trace is a narrative that tells the request’s story as it travels through your system. For example, many applications generate a stack trace following a runtime error. The trace is stored in an application log, displayed in a console/terminal window, and then analyzed and inspected via development tools such as Microsoft’s Visual Studio and Apple’s Xcode.

Despite their numerous shortcomings, monolithic apps have a number of advantages over microservices and other types of distributed applications. The former tend to reside and stay in a single location, run on specific devices, and have defined relationships to external systems. As a result, when things go wrong, you can find the origin of the request and trace its path through the system with relative ease. In our brave new world of cloud-based microservices, however, things are no longer so simple.

Modern Apps and Tracing Today

Today, software can run on any real device, including wearables, mobile devices, desktops, and servers. Due to virtualization technologies, such as hypervisors and containers, it is possible to host multiple virtual machines and software on a single host. Meanwhile, serverless computing has thrown a new level of abstraction into the mix. But knowing where your software runs is only part of the problem. Modern apps use multiple web services and APIs, some of which originate within your organization, while many may be purchased from third-party providers. To make matters worse, as the price of cloud computing continues to drop and the quality of service improves, many companies will migrate their existing on-premise software to the cloud.

Due to the rapid pace of change and the advantages of this new environment, many developers have found it hard to transition existing monolithic applications to the cloud. Instead, they break their legacy software into microservices, which are not only better suited to distributed computing but are also more reliable. However, it’s on those rare occasions when microservices stop working that their downside becomes apparent.

First, it’s hard to know whether your problem is an internal defect or the result of external factors, such as unpredictable behavior that results from automatic scaling. Even if the problem is internal, you then have to find where it runs and on which version of the service the problem occurred. Often, multiple versions of a service can be running across different servers and locations. All of these factors, combined with numerous others, make it almost impossible to use existing tracing techniques to debug microservices and similar technologies. If you don’t know where your app runs, which services it’s using, and the path taken by a request, you simply can’t trace the event. You could try to use existing application monitoring solutions to shed light on a given problem. But while this approach can be helpful, it often generates more signal than noise. And it can be extremely difficult to locate a specific issue from the massive amounts of recorded data.

Fixing the Problem With Distributed Tracing

Distributed tracing is a new and improved form of tracing that you can use to profile and monitor microservice-based apps/architectures, locate failures, and improve performance. Instead of tracking the path within a single application domain, distributed tracing follows the progress of a single request from its point of origin to its final destination. As the request is followed across multiple systems and domains, distributed tracing takes into account the processes, APIs, and services it interacts with.

In order to understand distributed tracing, it is important to understand some key concepts. It all starts with a single request–the entity or event being traced. As the request makes it journey, it generates traces that record complete processing operations performed on it by entities within a distributed system/network infrastructure. Each trace is then assigned its own unique ID and passes through a span (segment) that indicates a given activity that a host system performs on the request. Every span represents a single step within the request’s path and has a name, unique ID, and timestamp. A span can also carry additional metadata.

When it comes to implementing distributed tracing, there are many available options. First, there are a number of distributed tracing engines, many of which are free and open-source. The engine collects the request, trace, and segment data and helps you present, analyze, and visualize this data. Popular tools include Jaeger and Zipkin. In addition to these tools, there are frameworks and libraries that you can use to extend your existing monitoring and tracing tools or to build your own solutions. In this category, OpenTracing and OpenCensus are popular choices. Alternatively, you could also use an automated, cloud-based solution, such as Epsagon.

Conclusion: Distributed Tracing for Distributed Apps

At the start of this article, we saw that microservices are a much better solution for building cloud-based services than existing methods. However, the benefits gained can make it harder to debug and fix software. Since modern software is no longer in one place, but distributed across a wide network, using traditional tracing methods only tells you part of the story. You could try to implement existing methods, such as tracing, logging, and monitoring. But the amount of collected data via this approach can be overwhelming, potentially leading you to multiple issues, but not the specific one that you are looking for. The solution to this problem is distributed tracing.

Over the course of this article, we looked at the transition from monolithic to distributed application architectures, which have given rise to new solutions such as microservices. Next, we looked at why current tracing techniques are not a good fit for the challenges of mobile, cloud-based, and distributed environments. Having understood the problem, we then looked at how and why distributed tracing solves these issues. Specifically, we understood how distributed tracing follows the path of a single request through a system, as it generates traces and passes through spans. We also presented some of the common methods used to implement distributed tracing. Lastly, we learned about various tools, frameworks, and services to create software better suited to serving your customers’ and your organization’s needs.

Stay tuned for our next posts, where we will demonstrate how to use popular distributed tracing frameworks and examine the pros and cons of doing so.