Breaking into a new technology is always hard, especially for a paradigm shift as drastic as serverless. Even as a long time user of AWS Lambda (relative speaking, since Lambda was only released at re:invent 2014), I still find myself learning something new about it all the time.
As a beginner to serverless, you must have all sorts of questions about how to adopt existing practices to this new paradigm. You may be wondering if these practices are even relevant anymore. In this post, we will take a whirlwind tour through many of these considerations and help you to get on the right foot as you start this exciting journey into the world of serverless!
Picking the Programming Language
As discussed in this post, when developing API endpoints, you should avoid .NET Core (C#, F#) and JVM languages (Java, Scala, Kotlin, …). This is because there are significant cold start penalties with these runtimes and you have to run them on higher memory settings to compensate, which makes the cold start performance even worse when running inside a VPC.
Selecting the Memory/CPU
You need to give your function enough memory to do its job. Otherwise, you will experience out-of-memory exceptions at runtime. JVM and .NET Core applications have higher memory footprints and should, therefore, run on higher memory settings than Node.js, Python or Go functions.
Similarly, a function that ingests large amounts of data in memory for processing (such as Machine Learning algorithms) would also need higher memory allocations.
Beyond the amount of memory your function needs to operate, you might decide to allocate even more memory to the function to boost its performance. This is because CPU and network resources are allocated accordingly. More memory equals more CPU and higher network bandwidth, but it will also cost you more to run this function per 100ms.
Choosing the Timeout
As discussed in this post, you should use short timeouts around 3-6 seconds for API functions. If your function needs to perform long running tasks, then consider adopting the decoupled invocation pattern instead of keeping the caller waiting for a slow response.
For functions that process events in batches, such as those working with Kinesis or SQS, you should adjust the timeout according to the configured batch size.
Finally, if you need a long timeout value because of complicated retry scenarios, then consider moving the orchestration into Step Functions instead.
As is symptomatic of any new and hyped technology, there seems to be a new deployment framework for serverless every week. The two most popular deployment frameworks at the moment are the Serverless framework and the Serverless Application Model (SAM) framework by AWS.
Both are based around CloudFormation, but personally, I prefer the Serverless framework because of its extensibility. Its plugin system allows me to alter the framework behavior and to extend it to support new and interesting use cases. Not to mention the fact that there is already a very active community that has created many useful plugins you can use right away.
Check out Nitzan Shpira’s excellent round up of the most popular deployment frameworks out there.
To test your function locally, use a test runner such as Mocha or Jest to invoke the function handler with a stubbed invocation event and context.
To test success paths, I recommend using the real downstream systems—DynamoDB or any other services the function depends on—instead of mocks. Not using mocks lets you test the interaction with external services that are as close to the real thing as possible, and helps you catch bugs that are often hidden by mocks, e.g., syntax errors in DynamoDB queries.
Use mocks or stubs for testing failure cases only, where you need more control to simulate error scenarios.
Both SAM and the Serverless framework offer the ability to invoke functions locally. You can attach your debugger to the corresponding CLI command to debug functions locally as well.
SAM can go even a step further, SAM local can host a local version of your API endpoints. This makes it possible for you to hit the endpoints in a browser, which is very useful when you’re doing the server-side rendering.
You can continue to use CI tools such as Jenkins, CircleCI or Drone to continuously test and deploy your Lambda functions, while allowing deployment frameworks such as Serverless or SAM to do the hard work for you when it comes to packaging and deploying your functions. Both frameworks use CloudFormation to do the actual deployment and let you declare additional resources that your functions need, such as DynamoDB tables or S3 buckets.
Whenever you write to stdout from a Lambda function, the message is captured by the Lambda service asynchronously and sent to CloudWatch Logs. However, you can’t easily search across the logs for multiple functions in CloudWatch Logs. This is why in practice, most people would forward their logs from CloudWatch Logs to a log aggregation service such as Elasticsearch, Logz.io or Loggly.
Also, don’t forget to write your logs in a structured format in JSON! This allows you to record additional context information with each log message, such as correlation IDs, and make it easier for you to search and find relevant logs when you really need them.
Out of the box, you get a number of basic telemetry about your function’s health – invocation count, invocation duration, error count, throttle count and so on.
If you’re building APIs, then you should enable detailed metrics for the deployed API Gateway stage so that you receive latency and error count metrics for each endpoint. You should keep an eye on API Gateway’s latency and integration latency metrics, as these can help you explain spikes in end-to-end latency due to problems at the API Gateway layer.
Don’t forget to record custom latency and error metrics around your integration points, i.e., whenever you have to interact with another service from your function. If you’re concerned about adding latency to your function’s execution time, then check out this trick on how to track custom metrics asynchronously.
Finally, for latency metrics, use p95 or p99 (i.e. 95th or 99th percentile) values instead of the average. Average is easily skewed and does not give you a realistic picture of the latency most of your users are experiencing.
Don’t forget to create dashboards for your services and set up alarms in CloudWatch Alarms. Here are some common alarms to consider:
- Total regional concurrency. You should set up an alarm when it reaches 80% of your current provisioned concurrency as a signal to request for a raise.
- Throttled count for each function
- Error count for each function
- p95/p99 value of API Gateway’s latency metrics
- 5xx error count for API Gateway
Being able to monitor your system and be alerted when performance metrics start to deteriorate is great, but debugging complex interactions in a microservices architecture is tricky. Alerts and metrics often show you the symptom—not the cause—and you need visibility on those inter-service interactions to understand what’s going on. Distributed tracing solutions are the performance profilers for your microservices.
Within AWS you can use X-Ray, which supports both Lambda as well as API Gateway. However, I find Epsagon to be a much more powerful tracing solution at the moment. It supports asynchronous event sources such as SNS, Kinesis and S3, and is able to trace execution flows beyond the confines of AWS.
When you are starting out and only have a handful of functions to manage, then using environment variables is a convenient way to manage configurations. However, as your serverless architecture becomes more expansive and includes more and more functions, this approach quickly reached its limits.
You can’t easily share configuration across functions, and you can’t update configurations on the fly without redeploying the functions. Instead, you should consider using SSM Parameter Store instead, where you load and cache configurations during a cold start and periodically refresh the cache. For sensitive data such as credentials and API keys, you should apply the same approach and fetch them from AWS Secrets Manager during the cold start.
Fortunately, if you’re using the middleware engine Middy, then you’re in luck. Middy comes with middlewares to fetch configurations and secrets from SSM Parameter Store and Secrets Manager respectively.
Serverless security is a lot better than its serverful counterpart, for the simple reason that you no longer have to worry about a whole class of vulnerabilities around the server and OS! However, application security is still your responsibility and you need to concern yourself with common attack vectors such as SQL injection or cross-site scripting (XSS) attacks. Vendors such as Puresec are doing a great job to create tools that can automatically identify and protect you against these common attacks.
Even if you don’t sign up for Puresec, you should consider adopting their free library FunctionShield. In addition, you should create a tailored IAM role for each function to apply the principle of least privilege. If you’re using the Serverless framework, then you can use the serverless-iam-roles-per-function plugin to do this for you. With these two simple things, you address the most pressing security concerns around your serverless architecture.
So there you have it, a whirlwind tour through a pretty long list of things you should consider as a serverless beginner. Unfortunately, we are not able to dig deep into each of these topics in a single article, but I have included many links to help you learn more about them.
What I want you to take away from this post is that, while getting started with serverless platforms such as AWS Lambda is easy, there are a lot of nuances to making your system production-ready. None of these challenges are difficult. In fact, there are often very simple solutions. However, you still need to consider them when you’re developing your serverless application.
There is no getting away from the fact that application developers these days have a lot of non-functional requirements they need to concern themselves with. Scalability, resilience, performance, observability and security are just a few that pop up in my mind right away! Check out Nitzan Shapira’s post on 5 ways to gain observability into your serverless application.
Lambda has made addressing these concerns much easier, but there are still many gaps in the toolchain. I hope this article has offered you many insights into where these gaps are and how you should approach solving them.
As always, please let us know via the comments section below if you have any feedback regarding this article. In the meantime, check out some of these helpful open source libraries from Epsagon:
- list-lambdas: list Lambda functions across all regions with useful metadata
- lambda-cost-calculator: see the usage of Lambda functions and estimate their cost
Until next time!
By Yan Cui