So what is Distributed Tracing anyway? In the olden days, to debug problems, developers would typically log into the server running the software and inspect the logs and maybe some real-time metrics to investigate the issue. Nowadays, the distributed nature of modern architectures makes this impossible (although it wasn’t the best way, to begin with). Indeed, when your software runs in parallel on multiple instances or containers, and when deployments are automated and happen without human intervention, you need to devise new methods of investigation.
Today, we need to centralize logs and metrics to allow developers and testers to investigate issues. Distributed tracing goes one step further in the sense that it helps you understand the logical flow of data among many services, making it an ideal method for analyzing problems and performance issues in microservice-based architectures.
The object of this article is to describe the various steps involved in integrating distributed tracing into an application and what DevOps should keep in mind when doing so.
Choosing a Solution
Distributed Tracing involves instrumenting the code (either manually or automatically), identifying all microservice calls related to a single request, and sending trace data to a central location. Additionally, it is made up of a chain of various components: code instrumentation, collection of trace data, and, finally, analysis and visualization–all of which we’ll discuss below.
But first, you need to pick a solution. Ideally, all stakeholders (especially developers and DevOps engineers) should consider their options and needs carefully and come to an agreement as to the best solution for everyone involved.
In a nutshell, virtually all distributed tracing solutions use an agent that runs on each instance and sends trace data to a global collector. There are a number of options out there to choose from, although there aren’t that many open-source projects in this field–Zipkin and Jaeger are probably the two most well-known. There are quite a lot of commercial offerings as well, including Epsagon.
Unless your requirements are very basic, distributed tracing should be augmented with logs and metrics in order to provide a more complete picture. If the chosen solution is able to link traces with server logs and metric data related to the request being traced, this will usually be extremely helpful to whoever is investigating a given issue.
Code instrumentation is an essential part of distributed tracing. Although developers implement it, the whole team needs to define the strategy beforehand. Make sure to include the DevOps engineers in these discussions, as developers and DevOps engineers need to work together for this endeavor to be successful.
It’s better to stick to widely used industry standards, such as OpenTelemetry, so you can replace elements of the chain in the future if required. Avoiding vendor lock-in is usually an important parameter to keep in mind, although this should always be weighed against the ease and speed of implementation offered by the vendor. It’s also important to keep in mind that you don’t necessarily get what you pay for: some commercial solutions can be quite expensive and yet difficult to get up and running and not easy to use at all.
Code instrumentation can be done either manually, automatically, or via a combination of both. In fact, a combination of both gives you the best of both worlds: the automated instrumentation giving you a baseline that can be augmented with manual traces whenever the need arises.
Each incoming request will be assigned a unique identifier, which will be passed to all calls related to that request. All the traces will then be centralized into a global location, which is described in the next section.
Code instrumentation is probably the most tedious and time-consuming step, but you should not neglect it. “Garbage in, garbage out,” as they say! The usefulness of what you get in the end will mostly be based on the quality of the work done and the effort put in by the team during this stage of the project.
Collection of Trace Data
The trace data is usually sent to a local agent (i.e., running on the same machine) first in order to avoid too much overhead in the application. This may perform some filtering and discard unwanted traces. For example, you may want to get the traces related only to a certain time of request that you want to debug, or maybe you’re only concerned with the performance of the system and need to collect data on just 1% of the requests, randomly selected.
As a DevOps, you want to ensure that the configuration of the local agent is automated and synchronized among all of your instances. The agent will then efficiently package trace data and send it in bulk to a global collector, which is a centralized software collecting all such traces.
It’s usually a good idea to use the same product end to end to minimize interoperability problems. That being said, for the storage/visualization side of things, many products offer a range of choices and good interoperability. Beyond these choices, your job as a DevOps Engineer is to ensure the agent is always installed and configured properly. This should ideally be done in an automated manner, either via a provisioning mechanism when using Infrastructure-as-Code tools such as CloudFormation or Terraform or via configuration management when using tools such as Ansible, Chef, or Puppet.
Note that if you use a serverless architecture (such as Lambda functions on AWS or Cloud Functions on GCP), you can’t have an agent running on the same host your code is running on, so an agentless solution would be required.
Storage, Analysis & Visualization
In an ideal world, trace data should be combined with metrics and logs to provide some context and additional useful information. Such specific cross-analysis and presentation can only be offered by specialist software. Generic solutions such as Elasticsearch will struggle to present a coherent platform for detailed and efficient analysis.
In any case, the DevOps team working on this will have to ensure that the backend is highly available and able to scale with the increased workload. An alternative to scaling would be to have, like Jaeger, a feedback mechanism to reduce the trace samples (i.e., send fewer trace data in case of increased workload). In addition, details such as data retention and lifecycle will need to be devised as well, in collaboration with the development team.
It should be acknowledged from the start that implementing a robust distributed tracing solution requires quite a lot of work, especially when you try to combine tracing data with logs and metrics. There is definitely an argument to be made that commercial solutions can make things far easier for you. Epsagon can help by providing a turn-key solution suitable for most workloads, even if part or all of your architecture is serverless.
Importantly, keep in mind that implementing distributed tracing that is useful and pertinent to your workload will require tight coordination between your developers and DevOps engineers.