In this webinar you will learn:
- What are the key observability considerations when operating a serverless application?
- How distributed tracing and visualization can help you troubleshoot complex issues?
- What are the cost risks and how to address them?
- Why automation is critical when working in serverless?
- What is Epsagon and how can it help?
The webinar is hosted by:
Below is the transcription of the webinar. Enjoy!
Ran: Hi, everyone, and welcome to Epsagon webinar: Serverless Monitoring in Practice. In the following Webinar, we’re going to discuss the best practices to monitor and troubleshooting serverless applications. I’m going to guide you through. I am Ran Rinezaft, co-founder and CTO at Epsagon and together with me, Gal.
Gal: Hi, I am Gal and I am a senior engineer at Epsagon and I’ve been with the team from the last year.
Ran: Perfect, let’s get started. Let’s start with the first meaning of Serverless. What is Serverless? We are talking Serverless the broader term. It starts with the compute parts, the function-as-a-service, AWS Lambda, and other resources as well. We want to focus on the fact that Serverless is a much broader term that also includes our cloud resources, message queues, storage, databases, and also the rest APIs that we’re utilizing as part of our application.
The Challenges in Serverless
We need to make sure that when we go Serverless, the main thing that we want to accomplish is to not manage our infrastructure and focus more on our business logic. Focus on what we do best. In function-as-a-service applications, new limitations are introduced. First, the runtime – we need to understand how long are functions will run. Also, in terms of memory, so we need to understand how much memory will our function consume.
The next thing is stateless – our functions are ephemeral, which means that we need to interact with more resources to save our state. Also, performance and cold-starts. Applications are no more just a single compute unit – they are composed of lots and lots of resources and APIs, such as S3, DynamoDB, and API Gateway, and also external resources. For example, MongoDB, Twilio, Stripe, and so on. It’s part of our application and it turns our application to highly distributed, and everything is even driven. So, it’s really hard to understand what’s going on in such a complex system.
An Example of HSBC (re:Invent 2018)
Let’s take an example of HSBC, from the last re:Invent at 2018. You can see an application that is composed very neatly with all of the services. Each service includes its communication, inbound and outbound, its own database and all of the services are talking to each other. It makes it much more complex to troubleshoot and monitor: “Is everything okay?”; “Is everything talking properly one to another?”. In general, in Serverless, the monitoring challenges are that you’ve got no access to the underlying infrastructure, so you can’t log in anymore to the database or the computer unit and start debugging in production what’s going – because it’s ephemeral, as we know.
We also need to troubleshoot now asynchronous transactions which is much more complex because one thing happens at a single point of time and the other one is another point of time in a different environment, and correlating between them is very hard. We also get an unpredictable cost to which we will dive afterward. It makes cost much more of a problem than before, and as we saw, we got some new compute limitations that we haven’t had before.
So, let’s go one by one, and let’s go start playing with tracking the system health, and what does it mean to have a system in Serverless?
Serverless Troubleshooting – One By One
Gal: The first question we have to ask ourselves is what is our system, right? In order to be able to track the health of our system, we have to understand what is our system. The first answer that comes to mind is that it’s just a list of our functions. If we monitor all the functions in our AWS account, we will be able to understand the behavior of our system. But let’s have a look at this example. In this example, we can see that the application is not made up of our functions. Our functions are an integral part of our system, but we have a lot of other resources – we have a DynamoDB table, we have an S3, we have an SNS topic. Again, we have to monitor all of them too if we want our system to be resilient.
Ran: You know, on top of that, we sometimes see here external resources such as Auth0 and MongoDB. The key crucial thing is to monitor them as well because they are an integral part of the application.
Gal: Yeah, I think the thing is here is not just about monitoring your AWS infrastructure, it’s about monitoring your application.
Gal: So, maybe some of it is in Azure. You still want to be able to monitor it.
Another thing is that you can’t monitor a specific function or a specific execution, because events in your system are made of a bunch of executions that are put together. So, you want to be able to track an event, that’s entered your system through all of the components it went through, follow the different AWS Lambdas it involves, follow the tables, S3, and follow each and every one of them.
Ran: So, you’re fundamentally saying that looking at each and every resource individually won’t give me the bigger picture, or won’t be able to trace through them, right?
Gal: Exactly. Now, it’s not a new problem, right? Distributed tracing has been around for as long as distributed systems exist, and the concept has always been the same. We have a distributed compute unit, whether it’s Lambda or just regular service, and you want to take traces from each and every one of them and piece them together into a bigger transaction.
Ran: Yeah, transaction equals the story, ultimately.
Gal: Yeah, the story of the event through your system. Which actions it took, and you can see in this transaction, for example, in a distributed tracing engine called Jaeger which is open-source. When you have a visualization of these transactions, you can see in the timeline all the different events and operations that took place, you have a much better understanding of your system.
Ran: And performance. You can see how long each and every operation took across your distributed architecture.
Implementing Distributed Tracing
Gal: Exactly. Now, when we talk about implementing distributed tracing, you can always do it yourself. There are a lot of standards. There’s OpenTracing, there’s OpenCensus. You can just hop into your code and before every API code you make, just type, “I am making this API code,” send it to the distributed tracing agent like Jaeger and the problem is solved, right? You have distributed tracing, you have a good understanding of your system, but it’s a bit problematic, right? There is a lot of code to write.
Ran: Yeah, a lot of maintenance and there is a high potential for errors.
Gal: Yeah, someone forgets to trace an API that failed.
Ran: Exactly, and honestly, in Serverless, we’re talking about the development velocity and we don’t want to keep maintaining things all over the place. We want something automated.
Gal: Yeah, that’s the point of Serverless, right? Just focus on your business logic.
Cost Challenges in Serverless
Ran: Exactly. Now, the second or the thing that we would like to talk about is the cost. Well, we said that in Serverless, predicting cost is very hard and it makes sense because, well, let’s start if we are paying by the usage that we are doing. So for example, the number of requests that we are making or the amount of time that we are consuming. How much compute resources have we utilized, and to some extent, it makes sense to pay-per-use. But actually, in real life, it’s very different than that, because you don’t pay-per-use for your gym, or your shoes, or your meals, you pay upfront about what you are going to take which is similar to VMs. But in Serverless, you pay only for the amount that you are using, so it’s very hard to calculate it, and let’s look at my favorite example: API Gateway, AWS Lambda functions, and the DynamoDB.
You start with a Lambda which is gigabyte seconds which. Okay, I can calculate how much memory I’m consuming, multiply the amount of time that I’m using it. API Gateway is priced by its request and the data – the number of requests that I’m getting and the amount of data for the payroll in each and every request. But now comes the DynamoDB which is how much data do I store.
Gal: Yeah, so how can you know how much data you’re going to get?
Ran: And the read capacity units, write capacity units, so it doesn’t make sense. The calculation is impossible. You can’t predict or even calculate even when you know what’s going on to understand how much you’ll pay by the end of the month. Now, in Serverless, it’s important to understand that performance equals time, equals money. So, time equals money is not new to any of you I guess, but in Serverless, performance equals time and it means that performance equals our money – the cost that we are about to pay.
In Serverless specifically, we need to understand how much we would pay or where we would spend our time. Where does our calculation end up? Is it more about our own code, our own code creations or external API calls? Think about it – in Serverless, you got to have API calls because it’s ephemeral. You need to store something in the database, and so on.
Gal: That’s philosophy, right?
Gal: This thing that can be managed, that would be managed.
What is Epsagon?
Ran: So, let’s talk a bit about Epsagon. So, what does Epsagon do, Gal?
Gal: Basically, Epsagon is a monitoring company. Our entire goal is to help you understand and monitor your Serverless application. So, the first thing we do is to automate the distributing tracing process, end-to-end, so you can visualize your entire application and understand every single transaction in it just by using Epsagon.
Ran: I think that the main thing that I would like to focus on is that Epsagon gives you the distributed tracing technique that we saw before, but in an automated way. So, even if for example we can see a Lambda pushes a message to an SNS, which will trigger another Lambda, I want to see it in the same place whether they are in two different traces.
Ran: As we saw in OpenTracing, you don’t want to manually instrument each and every event and every invocation. You want it working out-of-the-box.
In Serverless, more than ever, you want to focus on the application level, the business level because the application is your business. You want to monitor these KPIs – how long it takes for a new user to sign up to the website, or for a user to purchase something, and so on. These are the KPIs that I want to monitor, and so we will say that it also provides visual debugging and cost tracking as part of our performance monitoring.
Partnership with AWS
Let’s talk a bit about our AWS partnership because we are working closely with AWS. We are an Advanced Technology Partner. We have the DevOps Competency, and we’re also listed on Marketplace and the Solution Space. We have also been partners in the recent AWS Lambda Layers announcement that took place in re:Invent. How do we approach the world of monitoring this whole composition?
Epsagon’s Approach to Serverless Monitoring
Gal: So basically, we see monitoring as a pyramid, right? We’re, unlike the orthodox system where the pyramid was upside down, think that monitoring the system is the most important thing, so the pyramid is shifted. Monitoring the functions is important.
Ran: Yeah, but it’s just the first step.
Gal: It won’t help you understand everything. Basically, it will only help you understand the most basic things about your application. The next thing you want to monitor are the transactions, right? Those events that are flying through the system end-to-end, and understand that they are you can perform root-cause analysis just by knowing. If this Lambda has an error, I know every Lambda that is connected to this specific execution or every resource that is connected to this specific execution.
Ran: Also, on top of that, sometimes you want to monitor an aggregation of transactions which are your business flaws. As I said before, when a user registers to your website, signs up, or a user buys something or do something, so these are data flows, it means how data flowing through your system and on top of that, we can provide KPIs, and the last one is obviously the architecture because sometimes you just see in a glance how does my architecture looks like? What are the errors? What is the performance of each and every resource? Let’s dive into each one of them. So, for functions, as we said, it’s the first part of monitoring, it’s each and every individual resource, and functions are great because, on each one of them, you can detect errors, you can detect timeouts and out of memories with is the specific issues that we found out on Serverless, and also you can monitor the cost of each and every function because of the performance is going bad, so you will pay more for it.
Gal: As we said, transactions are preferred when looking for root-cause analysis, because understanding errors in a specific function is fine, but when a customer is complaining that something isn’t working, you want to have a look at the error and understand immediately what is the root cause. And the root-cause doesn’t have to be the same execution, and this is where transactions are coming to the rescue.
Ran: Yeah, and on top of the data flows, as we said great for monitoring business KPIs and SLAs, how long it takes for a user to sign up for something or to buy something in our system.
Gal: Not just monitoring but optimization because you want to understand what are your pain points and what are your bottlenecks in order to improve them.
Ran: Exactly, and each and every data flaw, it’s very interesting to understand where is the bottleneck? Where can I improve the overall performance of my application? On top of that, as we said, architectures are great because you want to track everything in one place and see all of your resources, and also, I can say architectures are dynamic. You were sketching up, I don’t know, six months ago, something on your board, and six months afterward, it doesn’t look the same, or resources have changed and you want to have something on top of that that you’ll make sure.
Gal: I think the fun thing about this way we look at things is that you can either do a drill-down like look at the architecture, see what is okay, what is wrong, look at these specific data flows then look at the transactions that are in, then look at the functions, or you can do it from the bottom up, like, I have a function error, let’s do a code analysis, how many people does it affect? Let’s look at the data flow, that’s the thing that I think is really cool about this Webinar thing.
Gal: I think that there are several key signals that are going to give you a hint that you should look for a better monitoring tool. So, if you’re looking at that Brazilian logs trying to find a way through the log stream, starting to say, “All right. These functions fail, what triggered it? Go to the next log stream and the next log stream, that’s a good sign that you need to choose a tracing system.
Ran: Yeah, because you’ve got no context and you just can’t troubleshoot out of the blue some things.
Gal: Also, the more moving pieces you have the resources you use, the more Lambda you have, the better the details, the more S3 buckets. Basically, the more moving parts in your system, the more you’re going to need a solution like Epsagon.
Ran: This is actually were distributed tracing comes in part to match all of this together in a single place.
Gal: Now, also, if you– There is no way for you to be alerted for issues that are specific to Serverless environment. Like for a functional timing or you’re having out of memory, if you are not sure if today in your system there are timeouts or out of memory, that you might not be aware of, this would be a good time to try out Epsagon.
Gal: Also if you want to track your cost, right? So, if you are at the start of a billing cycle and you are not sure what the cost is going to be at the end of the month, that would be a good sign that you can use at work like Epsagon to help you figure that out.
Gal: Thank you, guys.
Hi, everyone, it’s Ran here. So, thank you so much for joining the Webinar. So, let me address some of the questions that you were raising during the Webinar. I think I’ll start with the first one regarding the performance impact on your functions. It’s true that sending traces from your function takes some time to send it to our end, and it obviously matters to us as well so we don’t want to affect your performance, definitely not in production. In order to achieve the most minimal impact on your performance, we managed to reduce it to between 5 to 10 milliseconds additional time, depends on your function configuration, and for the specific request, we do send it at the end of the Lambda invocation, so we are not constantly sending traces during your function one time but we do it only in the end and we try to minimize that to be as quick as possible, so five milliseconds additional time will be added.
Now, regarding the architecture, it’s true it’s static now but starting next week, you will be able to see some more performance overview of each and every resource, so you’ll be able to see the number of errors, the number of invitations or the triggers and so on, and the average performance for each and every resource out there which is important because then you can understand your architecture and production, and see where do you get bottlenecks or issues that you need to solve out.
I’m trying to see any other question over here. Another great question regarding the X-ray. X-ray is definitely good to work and you know you’ve got the cloud watch for starting and X-ray for most advanced things, but first of all, you must know that X-ray is only for performance monitoring. It only lets you know about the performance of your resource against other resources. Now, it has some of these limitations. First of all, it doesn’t tell you a transaction between distributed traces, so you won’t find a Lambda reading an S&S which triggers another Lambda, you won’t find it under the same transaction. The second part is that it’s mostly used as a debugging tool for production. So, you turn it on when you need to understand so I got the performance issue, so it’s not a performance monitoring product that makes sure your production is live and working properly and all the benefits of cost monitoring and so on. Ultimately it focuses on the AWS or if you’re using external resources which do not cost on AWS like your own [reddies], your Mongo DB, off zero, Toledo, stripe, and all the rest of the guys that you might use, so you want to see you there. Everything you’ve seen in Epsagon is only at Epsagon.
Let me try to see if there are any other questions. Does Epsagon support sampling to help control tracing cost? At the moment, we don’t support tracing. The main problem with sampling, sorry, is it that, when you are doing distributed tracing and your sample, let’s say for example that one part of the trace of the transaction had and we sample the other part of this transaction. You won’t be able to go back and see that because we sampled part of your transaction so it’s not reversible, I can’t get it backward anywhere. We don’t believe in sampling, we do believe in getting all the data and then let you filter by it and even let you label it so you will be able to filter specific events or specific metrics afterward. So, at the moment, there is no sampling, if there is any specific use case that you will need to, there will be availability, and we also work our way through it towards the future.
Trying to look at the last question over here. Okay, the last question is what happens if the Lambda runs out of memory or time out? Can you send a request to your API? Actually, we just about to add that. If your function will experience a timeout or an out of memory, we will have some triggers before that happens, and we will send the transaction for you. So, actually, this is the most interesting transactions to be bought because there is actually a big issue that happened over there. Now, in such cases, we will be able to detect them before. So, for example, before the times run out, or before your memory is about to exceed what you can do, we will send the transaction, you will be able to see the invocation and the transaction at Epsagon and understand for example if it’s a time out, where did you spend your time in this specific transaction, you will know what you need to improve, but that’s a great question, it’s actually starting next week, you will be able to see such issues as well.
Do you have any more questions, guys? Okay, another question that I see is how long does it take to set up? We saw the setting up as two parts. One of it is the set of the cloud formation which includes on the read-only permissions which take, I think, between 10 seconds to 1 minute, depends how quick you are. It’s really easy. Regarding instrumenting the function, it really depends. It can take a few seconds per function. If you are using any deployment tools, it can be even easier. For example, we’ve got plugins for the Serverless framework, so you can instrument in a matter of seconds some multiple functions at the same time, and in general, I would say takes no more than 5 minutes to get up and running with Epsagon and start seeing the real value and detect time out and understand cost and see transactions and see the architecture and data flows and so on.
I am going to Stany’s love question now regarding if Lambda updated Dynamo DB and trigger stream to another Lambda, Epsagon will detect that issue. If you want, I will be able to show you afterward. There is actually a demo for it, a transaction in the demo.epsagon.com, but yeah, the thing is that we want to trace everything from any asynchronous event-driven application, even when the Lambda publish a message, sorry, put an item to a Dynamo DB which streams to another Lambda, we can definitely want to track it down and you will be able to see it, so it’s really great.
Getting back to Shelley, why do we need to have those setups? So, it’s not mandatory to have a both for a specific use case, but if you have to complete picture. I’ll tell you the main benefit of both of them. With the cloud formation templates, we can easily gather data from all functions in your AWS account, even if you’ve got multiple AWS accounts, you can see the nice list triggers, invocations, errors, time out and so on, which is great for starting. It takes 10 seconds to set it up, you immediately get the value. Now, if you want to troubleshoot more deeply, there is the tracing for it and you can know that for example, some of the functions you might not set the Epsagon library on, because it’s not really interesting to trace hard with function that all it does is to pin something, but it’s really interesting to put the Epsagon library on a function that contact the Dynamo DB and triggers another Lambda, so it’s really interesting to trace this kind of Lambda function. We do recommend you have both because it gives you the complete picture and it’s pretty easy to get up and running with it. If you have any specific issues with, you can’t do either one of them, you can contact us and we will have a customized that set up for you, but in general, we do recommend both of them.
Another question regarding VPC. So, I’m seeing that half of your function are behind the VPC and that half is not, any thoughts on implementing and that Gateway or are there any upcoming charges to end those Serverless in the VPC? Another great benefit having the confirmation template is that you get your value, even if your functions are behind the VPC without enough Gateway, because we are getting the data from the cloud watch logs. You can get started with Epsagon just by deploying the cloud formation template, and you’ll see all your functions regardless they are in a VPC or not. If you want to troubleshoot them with Epsagon library and get the more insightful side of Epsagon, so you will need to open a whitelist, few eyepieces that Epsagon traces are being sent to. It means that you will need a NAT Gateway for it. At the moment, we don’t have any alternative for it but we can understand that the specifications and maybe solve it in a different way.
I’m going to the last question over here. If I understand correctly, the cloud watch log monitoring is for errors/duration/memory and the wrapper is just for tracing? That’s correct. The cloud watch monitoring gives you the function screen including errors, timeouts, duration, performance, memory triggers from multiple functions including cost, and the wrapper gives you the tracing, the architecture, the data flow analysis, it gives you a broader picture and understand what actually happens in your lambda because sometimes, just seeing a function with an invocation at a certain duration doesn’t tell you a lot unless you really log everything and you track that and monitor that, but seeing things individually rather than seeing a lot of logs makes more sense if we are troubleshooting and understanding performance.
I’m trying to see if there are any new messages. True, Shelly, thank you very much for joining. I’m trying to scroll back. If there is any question that I haven’t answered, please bump it again, I might have missed that. I’m trying to scroll back to see if there is something I have missed.
By the way, just so you know, we are planning to have many Webinars. If you are having any specific issues that you would like us to cover or a specific use case, for example, monitoring Dynamo DB that includes streams, or a specific language and so on, please let us know, we are really open to getting your feedback and focus on the things that matter to you. So, feel free to offer it over here or at our website or to our emails, we are definitely open for that.
Okay, everyone, thank you very much for joining. It’s been a pleasure. We really love observability and Serverless and monitoring. Looking forward to hosting you on our next Webinar. Thank you all. Bye Bye!