In this blog post, we’ll look into how instrumentation enables developers and operators to achieve better monitoring as well as troubleshooting running serverless applications faster.

Monitoring is crucial to ensure that you notice failures immediately as they happen. But knowing something is wrong is only the beginning. Further information is required for troubleshooting the issue properly and successfully fix it.

We’ve all been there. A new service or feature is ready. It’s fast and doesn’t fail. It’s perfect – until it ships. All of a sudden it fails for some users and is slow at times for others. Weeks go by with the developers completely unaware of the problem. Some users complain, while many others simply leave in frustration.

This is the harsh reality of software development: Things fail and perform in unpredictable ways.

What Is Instrumentation?

Fundamentally, instrumentation is taking measurements from an application. It is the very foundation on which we can be monitoring, troubleshooting, debugging, profiling, and understanding how serverless applications function and why they function a certain way.

In practice, instrumentation can be as simple as recording the time at which the execution of a function takes place and logging it to help narrow down where a performance bottleneck exists.

If an application is consuming more memory than expected, taking samples of memory usage over time (or samples of memory allocations and garbage collection metrics) can also provide valuable information to track down a memory leak. Recording inputs and outputs of either functions or entire systems (e.g., request and response payloads of an HTTP service) can help debug programs with unexpected inputs or incorrect logic. These are just some use cases that show how instrumentation is essential to help pinpoint and solve problems using monitoring and troubleshooting.

It is not uncommon for a developer to instrument code while developing it – sometimes without even realizing it – either by logging some values or printing the execution time of a particular block of code when looking into its performance impact. However, such instrumentation may be short-lived, kept in the form of an unstructured log or enabled only in development environments.

To monitor a service in production, having the instrumented data in logs alone is not enough. One solution is log aggregation, analysis, parsing, and extraction of metrics by way of tools such as Fluentd or Logstash. This data is then sent to a monitoring service.

Another typical solution is having data shipped directly from where its instrumented to a monitoring and troubleshooting service such as Epsagon, which is more straightforward to operate due to fewer components involved. This combination of gathering metrics and storing them is typically called telemetry.

Impact of Instrumentation

Now that we know how vital instrumentation is, should you perform it everywhere and gather all the possible metrics you can? No. There can be a high cost associated with instrumentation. After all, no code is faster than no code, and, well, instrumentation is additional code to be run.

Luckily, not all instrumentation is expensive. For example, instrumenting every single memory allocation or function/system call isn’t feasible on a live system. This kind of low-level instrumentation should instead be done locally on a development machine.

In a production environment, it’s much better to instrument at a higher-level: database operations, remote procedure calls, queue sizes/latency, or anything else at an API level. Let’s look at an event-driven serverless application such as one deployed on AWS Lambda. At a bare minimum, you’ll want to have the execution time, non-sensitive input data, and whether or not the invocation completed successfully.

Such information allows you to identify performance issues, the success/failure rate of invocations, and also the input data so that you reproduce an issue locally and fix it faster. For events with output data such as an HTTP response, it can also be valuable to collect the generated output.

Remember, instrument enough but not too much. Fast code paths are usually a bad candidate, but any place where the I/O cost is much higher than the cost of instrumentation can be a good candidate. For low-latency, high-throughput serverless applications, you may consider instrumenting only a fraction of the calls, often referred to as sampling.

There are no strict rules here, only guidelines. As with everything, use your judgment, and evaluate the costs based on your needs.

Manual vs. Automatic Instrumentation

There are two ways to go about adding instrumentation to your serverless applications: manually or automatically. And just as instrumentation affects the performance of running code, how your team implements instrumentation can affect performance as well. 

Manual Instrumentation

One way to instrument code is to manually write additional code to perform the measurements along with the rest of the application code. However, instrumenting snippets of code manually can be a tedious task. And as projects and teams grow in size, it becomes harder to standardize what to instrument and how to do so. Additionally, anything manual and repetitive is prone to error: either forgetting it in some places or doing it incorrectly.

Fortunately, not all instrumentation needs to be done by hand, as a lot of it can usually be automated.

Automatic Instrumentation

Most monitoring and troubleshooting services provide specialized, automated ways to instrument code: for languages, runtimes, and frameworks. At Epsagon, we specialize in monitoring and helping to troubleshoot cloud applications running on Amazon ECS and AWS Lambda. We also support a variety of languages and frameworks out of the box.

Automatic instrumentation is commonly implemented by adding middleware that wraps certain significant pieces of code with instrumentation logic. A typical example is a middleware around an HTTP request that measures the time spent to produce a response as well as the information on both the request and response, such as status code and payloads.

Some frameworks and libraries may also provide hooks that can allow for automatic instrumentation, such as hooking into the database operations of an ORM library to measure those operations and later find out which queries are slow.

Support for automatic instrumentation in different programming languages can vary. Dynamic interpreted languages often leverage techniques such as monkey patching for adding automatic instrumentation, while bytecode compiled languages like Java allow for the runtime modification of bytecode to achieve the same effect.

More recently, with the introduction of service meshes such as Istio and Linkerd, every call between microservices, in addition to external services, provides telemetry automatically out of the box.

It is recommended to use automatic instrumentation whenever possible. It often provides much better coverage with sane defaults than doing it yourself and requires little to no time to implement and maintain. Leave manual instrumentation for the rare cases – if any – not covered by automatic instrumentation.

Instrumenting Microservices

Automatic instrumentation provides excellent data for monitoring and troubleshooting even at the highest possible level. And due to their smaller size and complexity, microservice architectures shine here compared to monolithic applications.

Where it perhaps gets a bit more complicated is when you are trying to piece several instrumented metrics together. In monolithic applications, a stack trace usually provides a lot of the information required to understand how an application ended up where it did.

But by simply instrumenting microservices in isolation, it can be hard to know how an entire system works together. For that we have distributed tracing, which we will discuss below. It’s also crucial to enforce consistency when instrumenting each service in terms of what metrics are captured.

Distributed Tracing

Traditionally, tracing involved recording the program’s execution path to aid debugging. Tools like DTrace have existed for a while and focus on tracing individual processes and low-level interaction with the operating system. They also trace resources such as filesystems and the network.

With microservices, however, a new concept emerged: distributed tracing. Distributed tracing allows for the execution path of a request to be followed throughout all involved services. When combined with a visualization tool, it enables monitoring, profiling, and debugging of an entire microservice architecture.

Instrumentation, telemetry, and distributed tracing go hand in hand in today’s world, which is why several frameworks have been created. The two most prominent open-source projects in this space are OpenTracing and OpenCensus. While developed in isolation, these are now merging into a new Cloud Native Computing Foundation (CNCF) project named OpenTelemetry.

If you wish to learn more about distributed tracing, be sure to check out our Introduction to Distributed Tracing blog post.

Summary

In this blog post, we looked at why instrumentation is so essential and learned about some of its basics: what it is, how to do it, and related concepts.

We examined the costs and impact associated with instrumentation, in terms of system performance and also looked at how manual vs. automatic instrumentation affects development time.

Lastly, we had a short introduction to distributed tracing and the most common frameworks that exist around it. You should now have a more precise understanding of instrumentation and how crucial it is. If you want to automatically instrument and monitor your applications with distributed tracing in just five minutes, we hope you give Epsagon a try for free.