In this post, I’m covering five approaches for gaining observability in serverless (FaaS based) systems. I hope that it can help developers, system architects, and DevOps engineers make the serverless journey successful, less frustrating, and of course – enjoyable.
Serverless and Function-as-a-Service (FaaS) have been around for about three years now. It started with AWS Lambda and quickly expanded into similar services provided by the leading cloud vendors – Microsoft, Google, IBM, and others. However, as people started using it, serverless became much more than just FaaS. Serverless systems today involve functions, containers, managed services provided by the cloud vendor (e.g., message queues, DBs, storage), and a vast variety of SaaS APIs (e.g., Auth0, Twilio, Stripe), all interacting with one another.
Due to the nature of these functions – limited running time, low memory, and no state, developers are encouraged to utilize more and more of these managed services. As the number of functions and APIs increase, so does the complexity of serverless applications. Modern applications are highly distributed (“nano-services”) and event-driven. On top of that, the lack of access to any server, and everything being stateless doesn’t make things easier. Observability is a central challenge when trying to resolve issues quickly and preventing system downtime, as well as understanding performance and cost implications.
To overcome these challenge, developers and DevOps engineers have several options today for gaining observability into their systems. Each approach has pros and cons, and naturally – its sweet spots. Since AWS is the most popular platform for serverless today, many of the examples presented here are from AWS Lambda, but they can be applied in other cloud vendors.
1. “This is easy” – using the default cloud vendor console
The cloud vendor’s console is the first and most direct approach to attempting to solve issues in your serverless system. AWS equips developers with CloudWatch, “a monitoring service for AWS cloud resources and the applications you run on AWS.” Anyone using AWS Lambda is well familiar with the CloudWatch console. There, you can find log groups for each of your functions, and each group contains all the logs of the function. The logs refresh asynchronously, and it can take any time from seconds to minutes for the logs to update.
Since most developers feel comfortable writing logs, it’s pretty straightforward to use. Just run your function, and take a look at the log. However, when things become a bit more complicated, i.e., in a distributed system with multiple functions, queues, triggers, and more, it can be quite challenging to understand the whole picture and connect all the different log items. Imagine a chain of 5 Lambda functions with triggers in between – not to mention the times you don’t remember to log everything. Azure and Google have a similar solution for logging from their FaaS services. The default console is great to start with, and indeed cannot be ignored. However, when crossing the 5 or 10 functions range, most teams quickly discover challenges in understanding their systems, and some even refer to it has “log hell.” Let’s explore some alternatives.
2. “Let’s go somewhere else!” – log streaming to an external service
When your serverless architecture evolves, and you’re going through logs, more logs, and some more logs, something suddenly doesn’t feel right. It shouldn’t be this way – should it? You’re spending all your time going through endless lists of lines, written from functions which run millions of times every day. It’s not practical, and it just doesn’t work anymore. Even if you found the correct error in the log, going backward and doing root-cause-analysis is nearly impossible.
What has been going on in all the message queues? Perhaps an external API is slowing your system down and causing your cloud bill to spike? There must be a better solution – right?
You probably heard about log aggregation platforms, such as Splunk, Loggly, and others. Some of them are great – they even automatically detect anomalies for you! In that case, why not stream all of your logs into such a service, and then quickly and easily search and filter everything there? No more manual log scrubbing? Hurray!
Yan Cui describes some of these methods in one of his many excellent blog posts. It’s relatively easy to stream logs into any existing log aggregation service. So, now, you can quickly do all the queries and searches that you’ve always wanted to do there. When a function fails, you can find the corresponding log, and search for the logs of other functions as well.
However – is this really what you wanted? You still have to log everything manually, and you still don’t get to the bottom of the asynchronous nature of the system. Where are all the triggers and events? How are they connected? Also, these log aggregation tools do not come cheap.
It’s time to think about a more appropriate solution – a distributed tracing solution.
3. “I want more” – advanced observability layer provided by the cloud vendor
Luckily, AWS and the other cloud vendors do not come empty-handed (I’m going to focus on AWS in this part). AWS proposes X-Ray, which “makes it easy for developers to analyze the behavior of their production, distributed applications with end-to-end tracing capabilities.” AWS X-Ray is a great tool which allows you to trace and instrument your code to gain extra visibility into what’s going on. It also enables you to profile different parts of your code and identify slow spots, or slow AWS APIs, such as DynamoDB, SQS, and others.
We’ve given X-Ray a run in several distributed serverless applications and found out that it can indeed provide much value when it comes to an understanding of what slows down your Lambda. The automatic integration and the instrumentation of the Lambda function can help in spotting issues, triggers, and slow APIs.
X-Ray does not connect asynchronous events yet, such as a Lambda publishing a message into SNS, which triggers another Lambda. Therefore, it can be challenging to troubleshoot complex issues with X-Ray. Also, many of X-Ray’s cool stuff doesn’t happen automatically but requires the developer to insert traces manually. Finally, serverless architectures are dynamic, always evolving, and connect with multiple external 3rd party services. This is a perfect opportunity for companies to leverage advanced AI to anticipate any issues that could occur in such a dynamic environment.
If built-in tools like X-Ray are not enough, you will probably consider the next step.
4. “Leave me alone” – implement a tracing solution yourself
Sometimes, the existing tools are just not enough. You have a complex, distributed system, with tens or hundreds of functions. At the end of the day, you want to be able to answer questions such as: “is everything working properly?”, “what slows down my system?”, and also, when something goes wrong – “why did it break?” and “what’s the fastest way to fix it?”.
Since you’re a talented developer, and you didn’t manage to find a satisfying solution, you decide to implement one yourself. This fascinating blog post by Yan Cui suggest ways for capturing and forwarding correlation IDs between functions, which can be an inspiration to such tracing technology.
Using similar methods you might be able to trace an asynchronous event through your system! Beautiful, isn’t it? Now you need to take this nice tracing tech, expand it a little to all the different cloud vendor services and the external APIs that you’re using, implement a back-end to analyze all the events, have a nice UI to show it, add some alerts, of course, so you’ll know when something breaks…
If you’ve reached this far, think twice. You are about to enter a fascinating world of distributed tracing, and unless this is the business of your company (for some of us, this is the case), you are going to spend A LOT of time on this – and it’s probably never going to be enough. The one thing you should think about is – how do you want to spend your time? Utilizing serverless to its last and expand your business, or holding all the Lambda functions very tight all day long to make sure everything doesn’t collapse?
There must be something out there!
5. “Help me, Google!” – search for a dedicated solution
You’ve come this far. You’re asking for help. Eventually, you don’t want to develop an observability solution for serverless yourself, just like you didn’t want to implement the AWS Lambda yourself. Utilize managed services – this is the mantra that you chose to follow.
Today, there are several products which can help you understand what’s going on in your serverless system. Most of them focus solely on AWS, as it still holds a significant share of the market, although Microsoft, Google, and IBM are quickly progressing toward it. The core question that you need to ask yourself when choosing such a solution is, “will this product solve the most difficult challenges in my distributed system?”
Again – since modern, microservices-based systems are inherently distributed, it can be tough to solve issues quickly when just observing every service, or every function, separately. The variety of triggers, events, services, and APIs that you’re using can make a small problem huge. These asynchronous events are sneaky!
When choosing an observability solution for your serverless system, I would recommend considering the following things into consideration:
- Is this solution going to help me pinpoint the issues in my system quickly?
- Can I get an overview of everything that’s going on, not just my code? (APIs, managed services, even things like AWS Step Functions)
- It is enough for me, or will I have to go back to the cloud vendor’s console to get more information?
At Epsagon, we love distributed tracing. This is why we’ve researched and developed visualization and end to end observability technology, dedicated to modern, serverless systems.
There are many great solutions out there, and make sure to give them a shot before going back to the old “log and search” approach. Serverless is a new software paradigm, and it requires new approaches for observability and monitoring. The world is moving fast – don’t let it get away!