With everything in computing these days becoming distributed, it was only a matter of time that somebody would create a distributed version of software tracing – and call it distributed tracing. In order to understand why this latest version of tracing was created, let’s start by defining what we mean by tracing.

Software tracing is an approach to monitoring and logging that provides a way to see what your software is doing and how it does it by way of instrumentation and monitoring systems. These tools track transactions and record data from running applications to find problems, monitor execution, and measure system performance. You can collect this data for immediate analysis to locate, troubleshoot, and debug problems. Or you can store the data and use it to detect longer-term trends and issues. 

Most traditional tracing systems were designed around monolithic applications. But with the recent rise of cloud-based distributed computing and serverless computing, along with the transition to microservices, the utility of these systems has been increasingly questioned. In this article, we go back to the origins of tracing and explain how it became distributed. We also examine the benefits of distributed tracing, how you can get started, and what tools exist to help you.

Tracing: Old vs. New

There are a number of problems with traditional software tracing. First, there is no clear definition of what we mean by tracing and how to distinguish it from other forms of logging. Does tracing refer to the data in the stack traces generated when an application crashes, general system logs, or performance data? A bigger problem is that most forms of tracing were not designed for the various forms of cloud and distributed computing that exist today nor to track a large number of transactions over multiple contexts. 

Today monolithic applications are being decomposed into microservices and/or serverless functions. While you could use traditional tracing for testing and debugging individual services, it is not designed to handle multiple services. Furthermore, this approach has a hard time tracking the path of a single request through multiple services. And any information that such tracing systems can record does not provide sufficient contextual data to find and fix problems or measure performance.

Distributed Tracing provides a tracing solution better suited to microservices and distributed applications. Specifically, distributed tracing provides a method for tracking requests and transactions in this new environment, allowing you to follow the progress of a single request from origin to destination. Each request refers to a single entity or event being traced, and the path of the request through a system is called a “trace.” Each trace covers multiple segments or spans, one span being a single step in the processing of a request. 

As a request is traced through each span, you can record contextual data from all participants and processes. You can then use this data to profile and monitor microservice-based apps/architectures, locate failures and problems, and diagnose performance issues.

Distributed Tracing Frameworks: OpenCensus vs. OpenTracing

opencensus opentracing

OpenCensus vs. OpenTracing

The easiest way to get started with distributed tracing is with a tracing framework. Each framework provides resources that let you implement a distributed tracing solution. A framework gives you everything you need to instrument your software components and integrate them with your existing software. It also gathers application metrics and distributed traces and sends them to the backend for processing and analysis. At the core of each framework are libraries that provide the application programming interfaces (APIs) for various languages, platforms, and environments.

Currently there are two major frameworks: OpenCensus and OpenTracing. Both of these platforms provide APIs for implementing distributed tracing, and while there are many similarities between them, there are also a number of differences in terms of approach and implementation. 

It should be noted that in May 2019, OpenCensus and OpenTracing announced their intention to merge into a single platform called OpenTelemetry. The plan is to provide a unified platform that can handle distributed tracing and other forms of application monitoring. While this initiative will simplify the process of choosing a framework and getting started, the project is still at an early stage and beyond the scope of this article.

Overview

OpenCensus is based on Google’s internal Census platform for recording traces and metrics from its services. Google open-sourced this project and made the source code available on GitHub. Although OpenCensus is best known as a distributed tracing framework, its website describes it as a telemetry platform that can perform a range of monitoring tasks including logging and measuring performance.

OpenCensus provides a standard library that is available across most major programming languages. Apart from Google, OpenCensus is supported by a number of major software companies, including Microsoft and VMWare. 

OpenTracing is an open project supported by the Cloud Native Computing Foundation (CNCF).  Like OpenCensus, OpenTracing originates from Google’s work in this space and takes its inspiration from a white paper describing Google’s Dapper project. At its core, OpenTracing is an attempt at providing a standardized approach to instrumentation and distributed tracing. However, while it encourages standardization, OpenTracing has declared that it is not a standard.

Unlike OpenCensus, it does not provide a single official set of libraries or client applications but is instead a specification that shows how other libraries and frameworks can implement open tracing. The stated aim of this approach is to avoid potential vendor lock-in. 

Getting Started

When it comes to getting started, both platforms provide a comprehensive range of documentation, guides, and related resources. 

OpenCensus provides a number of quickstart guides for a range of popular languages including C#, C++, Java, Node, and Python. For example, the Python quickstart explains how to download the necessary tools and dependencies and explains how to install and configure the necessary tools and client applications.

Once you have everything in place, the tutorial provides a comprehensive range of sample code snippets. For example, this Python code snippet shows how to create and trace a span.

from opencensus.trace.tracer import Tracer
from opencensus.trace import time_event as time_event_module
from opencensus.trace.samplers import always_on

with tracer.span(name="trace") as span:
    message = ''.join(sys.argv[1:])
    channel.basic_publish(exchange='', routing_key='task_queue', body=message,properties=pika.BasicProperties(delivery_mode=2))
    logging.info("Sent " + message) 
    connection.close()

The OpenTracing website provides a more concise approach with a couple of short examples of getting started with either Java or Go. In line with OpenTracing’s decentralized approach, the site provides links to guides for other major languages such as C#, Python, and Node.

Here is a short code sample that shows how to create a basic trace and span with OpenTracing:

import opentracing

tracer = opentracing.tracer
span = tracer.start_span('test')
test_msg = 'Test, %s!' % test_msg
print(test_msg)
span.finish()

Data Management & Analysis

Once you have decided on a framework, you will need a way to view and manage your data. This is where tools such as Zipkin and Jaeger can help. These tools are community-supported, free, and open-source platforms that let you view and analyze your distributed tracing data. Both Zipkin and Jaeger provide basic components that manage the distributed tracing process. First, they provide a collector that retrieves data from client applications and services. Next, they provide a storage component that persists the collected data to a data store for future analysis. Finally, both Zipkin and Jaeger give you a query service that lets you locate information for a specific trace and span. 

Conclusion: Taking the Next Steps 

Tracing lets you record data generated by running applications and then use it when you need to locate a problem or improve your software’s performance. Software tracing has been around for a long time and can still prove useful in many situations. But as more organizations embrace distributed software, the limitations of this methodology are brought into focus. Specifically, traditional tracing is unable to track a large number of transactions over multiple contexts. 

Distributed tracing is designed to handle the transition from monolithic applications to cloud-based distributed computing as an increasing number of applications are decomposed into microservices and/or serverless functions. Distributed tracing lets you track the path of a single request through multiple services. This not only gives you the logging data you need but also provides sufficient contextual information to find and fix problems or measure performance.

Read more:

Epsagon Launches Agentless Tracing and Why That’s Important

AWS EventBridge and Epsagon Automated Tracing Integrate

Distributed Tracing in Modern Applications – Microservices and Containers

Debugging Distributed Systems Using Automated Tracing