A key component of operational excellence is tracing, or more accurately, distributed traces. As our applications become loosely coupled and composed with more services, resources, and APIs – distributed traces help us to understand the intercommunication. To be able to pinpoint traces efficiently we can use tagging. When it’s done right, tags can become super helpful when trying to slice and dice an event among tons of information, or for data aggregation.

In this post, we are going to learn and see some examples of tagging in traces, and we will demonstrate how to accomplish that using Epsagon.

Application health using distributed traces

Following the new well-architected for serverless, let’s examine OPS 1 question: How do you evaluate your Serverless application health? The answer starts with tracing.

Let’s take the following retail application:

Serverless Application on AWS

Serverless Retail Application on AWS

This application is responsible for our whole retail needs – stock management, payment, catalog and more.

Good tracing can help us visualize such a draw, into a real service map with metrics. Service maps allow us to understand connections in distributed applications, and detect performance issues and bottlenecks.

Also, for troubleshooting, it is easier to see an end-to-end trace, including the payload and the relevant logs, instead of trying to correlate logs between different services. An example from Epsagon can clarify how a root cause analysis can be done easily with a visual trace:

Tracing architecture view

Tracing Visualization by Epsagon

Tagging traces

While tracing, especially when automated, is a powerful tool, sometimes we need to pinpoint a specific event in our application, or detect trends based on a unique-business dimension.

Tagging adds more context to an existing trace in the form of key=value. For example, we can add the following tag: `userId=123`. In this scenario, we will be able to filter all traces that matched a specific user in our application. These are some good tags that can be used:

  • Identifiers – can help us to pinpoint an event based on our application unique identifiers. For example user ID, customer ID, item ID, etc.
  • Flow control – can help us understand what happened in the code. For example, event type, item category, etc.
  • Business metrics – can help us to understand some unique business KPIs. For example, the quantity of items in a purchase, views of an item, etc.

Tags help us in the following scenarios:

  • Correlate incidents and customers – Looking for a problem that happened to a specific user/customer in our application.
  • Insights into the customer experience – Understanding the performance metrics of a specific event in our system.
  • Business trends – Looking at trends for business KPIs.

Tagging traces with Epsagon

Let’s use the previous scenarios on our blog site application. Using Epsagon it is pretty straightforward to add tags to a current trace in Lambda functions:

import epsagon
def handler(event, context):
    epsagon.label('userId', event['headers']['user'])
    epsagon.label('eventType', event['body']['event'])
    ...

Using that snippet, we are going to record our user IDs and event types that our main Lambda function is handling. With that in place, let’s use the powerful trace search capability, to list all errors that happened to our beloved user “123”:

Trace Search

Trace Search

Now we can easily drill down to the end-to-end trace of every error that happened to this user. Let’s also see how many “analyze” events we had, and which was the slowest one:

trace analyze

Conclusion

Operational excellence with distributed applications – both serverless and microservices – is not an easy task to accomplish.

Using the new Well-Architected framework, you can learn and gain more experience. Tracing and tagging is an important part of it, and it can help you search, pinpoint, and analyze events across the most complex applications.