As the adoption of microservices continues to rise – Gartner research suggests 85% of enterprises will be running containerized applications, up from 35% in 2019 – the complexity of applications and their architectures rises in tandem. The era of monolithic applications is coming to a close and now, suddenly, those easy to depict architectures are evolving to become chaotic rats nests of interconnected services, databases, message queues, and external APIs.
With this new architecture, we need new tools in order to understand, visualize, and troubleshoot these highly distributed applications – we need distributed tracing. But what does tracing look like when our applications are made up of not just our own service code but also managed cloud services our code communicates with, like DynamoDB, Kafka, and S3? What about the reverse, when those managed cloud services can themselves trigger one of our microservices to run?
In order to provide true end-to-end observability, a good tracing tool must be able to not only trace our own code but also any of the manages services within our environment. For the sake of this post, we’ll call the services running on your code – “compute”, and the managed cloud services – “non-compute”.
In this blog, we’ll differentiate between our “compute” and “non-compute” resources and also discuss the important considerations to keep in mind when it comes to deciding on the right distributed tracing implementation for your applications and architecture.
Compute and Non-Compute Resources
Before we get ahead of ourselves, let’s gain some clarity on what is meant by “non-compute” resources, and let’s do that by first defining a “compute” resource. A compute resource is nothing more than a traditional service, written by your engineers, using one of any number of programming languages (Node, Python, etc). You have full access to the code and can therefore implement tooling like distributed tracing directly, either with OpenTelemetry or a managed APM solution like Epsagon.
Non-compute resources, then, are any of the middleware components being leveraged within your application’s compute resources. This would include both platform-as-a-service (PaaS) technologies such as DynamoDB as well as infrastructure-as-a-service (IaaS) offerings like S3.
To truly achieve full end-to-end observability within our microservices environment, a tracing tool needs to provide us with not only the ability to trace up to our non-compute resource but also the ability to trace through these components. Further than that, it also needs to provide us with full payload visibility within these traces.
And, ideally, it will allow us to accomplish this automatically, without the need for any manual instrumentation or code changes whatsoever.
Let’s look at both of these requirements and their importance in turn.
Tracing *to* is not enough
Let’s take a look at the example service map above. There are a number of non-compute resources being leveraged within this application, all of which have been auto-instrumented (no code changes) to collect trace data. Now let’s take a look specifically at the S3 bucket here, which is interesting because it is not only being sent data from our Java service, but it is also acting as a trigger for a Lambda further downstream.
This type of implementation is becoming more and more common, where a non-compute resource will trigger code to run, whether that trigger is a Cognito event, a DynamoDB write to a table, or any number of other possibilities. This makes it very important when evaluating a tracing tool as part of your observability strategy to ensure that it can handle this use case; that it can properly trace not only up to but also through these non-compute resources. If not, you’ll be left with broken traces and a frustratingly incomplete picture.
Observability = Payload ? True : False
When we talk about observability and achieving a true understanding of your microservice-based environments, it means more than just being able to trace through each service in a request chain while collecting only basic information (latency, error rate, etc) for each. That information is clearly helpful, and necessary, but not sufficient on its own. You also need the full payload data.
This information is absolutely essential to get a deep understanding of the interactions between each of your microservices and becomes even more critical when trying to investigate issues when, unfortunately, something inevitably breaks and impacts your users. One of the key value drivers for customers in adopting distributed tracing is to reduce their MTTR (mean-time-to-resolution) when troubleshooting these highly complex environments. Without that full payload available, you are already at a disadvantage before you even begin troubleshooting.
The value of this payload information also extends to analysis and optimization. In the image above you can not only see the DynamoDB call in the context of the larger trace, but you can also see it with context to the payload information, i.e. the item being created. This type of information helps to more efficiently diagnose issues, like a long-running, inefficient query into a database.
However, even beyond that, we can leverage all of this payload information that is being captured to create robust filters, visualizations, and alerts. We can pull up all of the traces where the payload contained a user_token field equal to some value. We can use that same user_token field to set up alerts for key accounts to ensure that we meet our SLA commitments to our customers. And we can create visualizations from all of this data, allowing us to track key business metrics (aggregate the total_order_value field from the payload and then show me that broken down by country to show us our sales by geography) in addition to all of our technical dashboards.
Conclusion and Best Practices
Non-compute resources, or managed cloud services to be exact, are an important and growing part of modern microservices architectures. As such, it is equally important when considering an observability strategy to ensure that these resources are properly monitored and accounted for.
This includes the ability to trace through these components, capturing a true end-to-end view into your application that also includes any of these modern trigger-based workflows that you may be using. And of course, these traces should include the detailed payload data to allow for true observability across your entire environment.
And lastly, and perhaps most importantly, this should all be done automatically out-of-the-box, without any need to manually instrument your code or make any code changes whatsoever. One of the key benefits of microservices is the increase in developer velocity that it allows for. At Epsagon, we strongly believe that implementing a quality observability strategy should not diminish these gains in velocity and release cadence. At the end of the day, a good observability strategy should empower developers throughout the team to focus on what’s important: new business features and functionality.