Recently, distributed tracing has come onto the scene as an approach to application monitoring that lets you track the path of a single request across multiple systems. In order to understand the benefits of distributed tracing, this post will look at how the following methods can be used to troubleshoot an e-commerce website, how they differ, and where they compare.

Monitoring and troubleshooting traditional, monolithic applications is hard enough. But trying to monitor and troubleshoot a distributed, microservices-based application seems almost impossible. Trying to troubleshoot distributed applications is like watching a Marvel movie or Game of Thrones when all you care about is a single character.

Traditionally, there have been two approaches to application logging and troubleshooting. The first approach is to collect and analyze logging data from individual microservices. The second is to aggregate logs from multiple sources, process them using a tool and service, and then analyze the results. While both of these approaches work well for monolithic applications, they do not scale for microservice-based distributed systems.

Use Case: E-Commerce

The site we’re working with lets a customer check for an item’s availability. If that item is not available, the service will notify them when it is in stock. To implement this scenario, we’ll create two microservices. The first microservice will let customers subscribe to item updates.

The following examples are in Python:

import pika, sys, logging
from python_logging_rabbitmq import RabbitMQHandler

rabbit = RabbitMQHandler(host='localhost', port=15672)
connection = pika.BlockingConnection(pika.ConnectionParameters(host='localhost'))
channel = connection.channel()
channel.queue_declare(queue='task_queue', durable=True)
logger = logging.getLogger('send_message')
message = ''.join(sys.argv[1:])
channel.basic_publish(exchange='', routing_key='task_queue',
body = message, properties=pika.BasicProperties(delivery_mode=2))
logging.info("Sent " + message)

The second microservice sends out notifications when this condition is met:

import pika, sys, logging
from python_logging_rabbitmq import RabbitMQHandler

rabbit = RabbitMQHandler(host='localhost', port=15672)
connection = pika.BlockingConnection(pika.ConnectionParameters(host='localhost'))
channel = connection.channel()
channel.queue_declare(queue='task_queue', durable=True)
logger = logging.getLogger('recieve_message')
print('Service launched')
def callback(ch, method, properties, body):
logging.info("Received % r" % body)
channel.basic_ack(delivery_tag=method.delivery_tag)
channel.basic_qos(prefetch_count=1)
channel.basic_consume(queue='task_queue', on_message_callback=callback)
channel.start_consuming()

These messages will be handled by the RabbitMQ Message queue. To monitor and debug the code, we will use individual logging and monitoring, log aggregation using Kibana tool, and a “homebrew” distributed tracing system that uses the first two methods discussed below.

Method 1: Individual Logging and Monitoring

This method investigates issues by isolating individual components and/or services. For each item, you’ll analyze the output of each service individually by reviewing the monitored data that’s been collected. This lets you investigate a specific problem in its original context.

At the application level, you can use the data generated when the application is compiled or run. Depending on your environment, this data can be accessed as compiler errors (Java, C#, etc), or, if you are using dynamic languages (Python, Ruby, Javascript), as runtime traces.

At the host level, this data is available as system resource logs or can be instrumented using third-party agents/services. At the system level, you can gather data from network and infrastructure resources and logs from external systems, databases, and APIs, such as RabbitMQ.

For example, if your application crashed, you could check the RabbitMQ crash log:

2019-04-03 14:06:10 =SUPERVISOR REPORT====
     Supervisor: {<0.68.0>,user_sup}
     Context:    child_terminated
     Reason:     epipe
     Offender:   [{pid,<0.69.0>},{mod,user_sup}]
2019-04-03 14:06:10 =ERROR REPORT====
** Generic server <0.68.0> terminating
** Last message in was {'EXIT',<0.69.0>,epipe}
** When Server state == {state,user_sup,undefined,<0.69.0>,{<0.68.0>,user_sup}}
** Reason for termination ==
** epipe

The benefit of this approach is that it can utilize a wealth of readily available data. The disadvantage is that is doesn’t scale. When using individual logs, you can find low-level, isolated issues, but you neglect the wider context in which these problems occur.

This approach doesn’t take into account dependencies, i.e. the service/components that the individual service relies on. Moreover, a single trivial issue on one component might be amplified by other components/services. As a result, the interdependent nature of distributed systems becomes masked, and this might lead to cascading problems.

Method 2: Log Aggregation

Like Method 1, this approach collects data from multiple sources, such as application trace files, host and system logs, agents, and external systems. Instead of looking at this information in isolation, this method uses existing/third-party tools to stream logs to an existing service and aggregate the collected data.

Log aggregation provides tools that let you search the data, match patterns, and present that data using a variety of analysis/visualization tools. This cross-system approach shows you the impact of system-wide problems and gives you a better understanding of dependencies and interdependencies.

In addition, most cloud-based services can provide almost unlimited log storage that lets you take a much longer-term perspective. To illustrate this point, we are using Kibana, a popular open-source log aggregation tool. Kibana provides a number of methods for viewing aggregated log data.

For example, here is a screenshot of Kibana’s log aggregation screen:

Kibana log aggregation

Kibana log aggregation screen

While log aggregation is better suited to handling distributed system software and microservices than the previous method, it still suffers from a number of scaling issues.

First, it requires the installation and configuration of additional tools and agent–an onboarding process that can be long and complicated. These systems are also usually designed to monitor and troubleshoot at the host and system levels, not for distributed systems and microservices.

Method 3: DIY Distributed Tracing

In this method, we will build on the first two methods to build our own distributed tracing system. This requires combining existing systems and integration code with open distributed tracing frameworks, such as OpenCensus or OpenTracing.

This distributed system will enable us to uniquely identify requests and payloads and monitor them as they pass through defined spans. This approach shows the impact of the request on any host system and can be used for both server-based and serverless workloads.

Below, our first example has been modified to work with the OpenTracing framework:

from opencensus.trace.tracer import Tracer
from opencensus.trace import time_event as time_event_module
from opencensus.ext.zipkin.trace_exporter import ZipkinExporter
from opencensus.trace.samplers import always_on

ze = ZipkinExporter(service_name="dr-test", host_name='localhost', port=9411, endpoint='/api/v2/spans')
tracer = Tracer(exporter=ze, sampler=always_on.AlwaysOnSampler())

def main():
    connection=pika.BlockingConnection(pika.ConnectionParameters(host='localhost'))
    channel = connection.channel()
    rabbit = RabbitMQHandler(host='localhost', port=15672)
    channel.queue_declare(queue='task_queue', durable=True)
    logger = logging.getLogger('send_message')
    with tracer.span(name="main") as span:
        message = ''.join(sys.argv[1:])channel.basic_publish(exchange='',

routing_key='task_queue',body=message,properties=pika.BasicProperties(delivery_mode=2))
logging.info("Sent " + message)
connection.close()

if __name__ == "__main__":
    main()

To view the results of the traces, you can use a trace monitoring system, such as Zipkin. The following shows all the traces detected by Zipkin:

Traces detected by Zipkin

Traces detected by Zipkin

This approach is a definite improvement over the previous two. Using this method, you can start integrating distributed tracing into your workflow. However, like the previous two examples, this approach also suffers from scaling issues. This approach also ends up being less than the sum of its parts because it is not always possible to repurpose your existing systems and code for new uses.

In addition, building your own distributed tracing solution requires further coding and integration work for which your organization might not have the necessary resources or experience available.

Conclusion: The Ultimate Distributed Tracing Solution

In this article, we looked at three approaches to debugging distributed systems. All these approaches have much to recommend, but none of them provide a truly scalable solution. Using individual logs lets you locate low-level, isolated issues, but it’s not much use when trying to understand the wider context in which these problems occur.

Log aggregation lets you look at multiple sources of data, but instead of focusing on the path of a specific request, all you get is an excessive amount of irrelevant data. DIY distributed tracing is a step in the right direction, but getting useful information depends on putting in the time, effort, and resources to build something truly useful.

The good news is that there is a fourth and better approach that gives you the ultimate solution. Instead of trying to repurpose your existing tools or methods or building your own, you can use a cloud-based service such as Epsagon. Epsagon provides everything you need to perform automated distributed tracing through major cloud providers without having to write a single line of code. This solution can also handle synchronous events, asynchronous events, and message queues, such as RabbitMQ.

By choosing Epsagon, you can automatically monitor any requests generated by your software and track them across multiple systems. Any data recorded by the distributed system can also be viewed, analyzed, and presented in a number of visual formats and charts. This makes it a natural fit, not just for the examples in this article, but for server and serverless environments as well, including AWS Lambda, message queues, and Fargate Managed Kubernetes, to name just a few of the many possibilities. 

To start a free trial with Epsagon, click here.