Error Handling in AWS Lambda With Wrappers

We have been using web frameworks to develop web applications since long before serverless came around, and middlewares are stable in these web frameworks. Express.js, for instance, lets you create middlewares at several stages of the request handling pipeline, and even ships with a few common middlewares out of the box.

As our code moves into Lambda functions and we move away from these web frameworks, are middlewares still relevant? If so, how might they look in this new world of serverless?

In this post, we’ll revisit the idea of middlewares, their role in application development with AWS Lambda, and how we can use middlewares to enforce consistent error handling across all of our Lambda functions.

What are middlewares?

Within the context of web frameworks, middlewares are typically functions that have access to both the request and response objects and can end the request-response lifecycle prematurely.

Middlewares are often used to implement cross-cutting concerns. For example, if every endpoint needs to have the same authentication logic, then why not implement that logic as middleware and apply it to every single endpoint? Other popular use cases include performance tracing, request validation, JSON serialization, and error handling.

Since middlewares are simple functions, they can be encapsulated into shared libraries and reused across multiple projects. This lets you apply the same consistent handling to these common concerns. When adopted by everyone in the team, this consistency makes life so much easier during development and when you have to debug production issues.

It’s easy to underestimate the importance of handling errors consistently. When you are under time pressure to resolve problems that arise in production, it’s a godsend to be able to count on every API to:

  • always record detailed information about the error
  • use the same attribute names to make them easy to search for in your logs
  • always record the same latency and error metrics
  • always follow the same naming convention for the metrics

This consistency has a massive impact on your mean time to repair (MTTR) from an incident.

One drawback of middlewares is that their implementation tends to be web framework-specific. A middleware written for express.js is likely incompatible with another web framework and vice versa. While it’s not a showstopper, it does mean that you have additional overhead to support multiple web frameworks, or that you have to mandate the team to always use the same framework.

Are middlewares still relevant to Lambda?

With Lambda we don’t have to use web frameworks anymore, but are middlewares still useful on their own? If you’re building APIs with API Gateway and Lambda, then many of the aforementioned cross-cutting concerns can and should be addressed outside of the function.

Take authentication. For instance, API Gateway offers a wealth of options for managing access to your APIs. While you can still implement authentication in your code, there are good reasons why you shouldn’t:

  • It’s undifferentiated heavy lifting!
  • API Gateway does not charge for requests that fail its authentication, but if invalid requests flow all the way to the function, you would incur charges for API Gateway as well as Lambda invocations (plus associated costs for CloudWatch Logs, etc.).
  • Allowing invalid requests to flow to the function means greater concurrency for Lambda. Higher concurrency means more cold starts, some of which would incur on valid requests.
  • You have to implement caching for authentication results on your own.

Similarly, performance tracing (CloudWatch + X-Ray), JSON serialization, and basic error handling are all implemented by the service outside of our code. So do we still need middlewares?

I believe so. For starters, the default error handling behavior for API Gateway is really naive:

  • Log the error message and stack trace.
  • Increment the Invocation errors metric.
  • Return the 500 error code to the caller.

In most cases, you would want to extend the default behavior. For example:

  • Log the error message and stack trace with correlation IDs (request ID, user ID, etc.) so that you can correlate logs from multiple functions for the same user action.
  • Classify the error by type, and signal to the caller if the error can be re-tried and how long should they wait before retrying. In the JSON response body, you might include isRetryable and delayBeforeRetry fields for these.
{
  “errorMessage”: “DynamoDB ReadThroughput was exceeded”,
  “retryable”: true,
  “delayBeforeRetry”: 100
}
  • Record count metrics for each type of error.
  • In cases where a default response can suffice (sacrifice precision for availability), return a default response instead of erroring.
  • Return a request ID and internal error code instead of surfacing an error message that doesn’t make any sense to the user. The UI can translate the error code to a localized, user-friendly message. When the user contacts your customer support team, they also have information that can make diagnosis easier.
{
  “errorCode”: 14001,
  “requestId”: “58246c0c-5fbe-4ef3-8cc9-bc1e69c625dd”
}

Also, API Gateway is just one of many event sources for Lambda. Your error handling strategy needs to span across all of these different event sources and take into account the built-in retry behavior for each event source.

Take Kinesis Data Streams as an example. The retry-until-success retry behavior is useful when the stream is used as a task queue and you need to guarantee that every message is processed in sequence. However, this behavior renders you vulnerable to poison messages and is unsuitable for a real-time system. In these systems, it’s more important to process the events in real-time, and it’s okay to lose a few messages that cannot be processed.

There’s no one size that fits all situations. Depending on what other tools you use, you might need to introduce other custom actions such as reporting trace segments to a distributed tracing solution.

Middlewares are a great way to encapsulate these custom error handling behaviors and make them reusable.

Middy, the Middleware Engine

For Node.js Lambda functions, the Middy middleware engine makes it simple to write your custom middlewares. It also comes with a number of handy middlewares out-of-the-box. Middy employs an onion-like architecture where middlewares are applied in order before your handler code is run, and then in reverse order after your handler code has finished.

Middleware

At DAZN, we implemented a number of middlewares to standardize the way we sample debug logs in production and how we capture and forward correlation IDs from one function to another.

To sample debug logs, our middleware enables debug logging on our custom logger (based on a configured sample rate) before the handler code and then rolling back to the previous log level afterward.

Middleware Code

Similarly, you can write middlewares that would encapsulate your error handling strategy and distribute them as NPM packages.

Even if you’re not using Node.js, you can still apply the same concept and create your own middleware engine. I find this to be a much more scalable solution than creating bespoke middlewares each time.

Considerations for an Error Handling Middleware

As discussed earlier, you should modify the response payload for API Gateway when an error occurs. Your goal should be to create a consistent error handling strategy that gives your users the best experience under these unfortunate circumstances.

Event source-specific considerations like this aside, there are several things to think about when designing your error handling middleware. The goal is to create a middleware that leaves enough clues about what happened so that you have an easier time debugging the problem.

What Information to Log

The single most important thing you should do when an error occurs is to log enough information to make it easy for you to debug this error.

At the very least, you should record:

  • A marker message, something simple and easily searchable. This makes it easy to find all error messages captured by this middleware.
  • The invocation event captured as JSON. This makes it easy to replay the failed invocation to help you better understand the problem.
  • The AWS request ID, which is passed into the function as part of the context object. This makes it easy to find other related log messages for the invocation.
  • Any captured correlation IDs. They make it easy to find other related log messages for the entire call chain.
  • The error message and stack trace.

In addition to the above, you might also consider including other properties of the context object, or perhaps even the whole object itself! It contains a great deal of useful information about the invocation, as shown in the example below.

Logging

Integrating with External Vendors

Before you go to production with Lambda, you should ensure that you set up centralized logging for your functions. This upfront effort will prove invaluable as you create more and more functions. If you’re using other tools to help monitor your production environments such as Sentry or Epsagon, then you should also integrate with them as part of your middleware.

One common integration is to track errors with custom metrics (as discussed earlier) and then reporting those metrics to a monitoring service such as CloudWatch Metrics.

Limitations

One particular limitation to keep in mind, especially when integrating with third parties, is the timeout for your function.

If you’re integrating with third parties through API calls, then the added latency from these calls can cause the invocation to timeout. Where possible, you should consider making these API calls fire-and-forget—i.e., don’t wait for the callback or Promise to complete.

Alternatively, many vendors specializing in serverless also offer the option to ingest data through CloudWatch Logs (which are collected asynchronously by Lambda without adding to your function’s invocation time). Using this approach, you might record additional information such as custom metrics by writing special messages to stdout and processing them later.

The downside of this approach is that it increases your cost for CloudWatch Logs, which at $0.50 per GB ingested is a relatively expensive service. In practice, it’s common to spend more on CloudWatch Logs in production than on Lambda invocations. Writing large amounts of information to CloudWatch Logs is only going to increase that cost exposure further.

Obviously, there are also differences between the different programming languages supported by AWS Lambda.

Nesting Middlewares

Finally, the ordering of middlewares is also important. Generally speaking, error handling middlewares should be executed last at the end of the invocation. Given the way Middy’s onion-like architecture works, the error handling middleware should be the first in the chain. This way, any unintended errors from other middlewares won’t escape the attention of our error handling middleware. 

Summary

In this post, we discussed the role of middlewares in traditional web development as a way to create reusable components to address common cross-cutting concerns. In the world of Lambda, where we are no longer reliant on web frameworks, middlewares still stand as a useful abstraction layer on their own. While many cross-cutting concerns can now be implemented at the service layer, there is always room for customization and other concerns that can’t be easily addressed otherwise.

When it comes to writing middlewares, it’s better to use a lightweight engine such as Middy than to create them as standalone wrappers. These middleware engines make it easy to create middlewares that can be composed together in a coherent way and make the middlewares easier to reuse.

Finally, we discussed several important considerations when designing an error handling middleware:

  • What information to log
  • Integration with external vendors
  • Consideration for timeout limits and the cost of writing to CloudWatch Logs
  • Error handling middleware should be the last to execute after the handler code

In more complex scenarios where you need more control over retry behavior, you should consider using Step Functions instead. This introduces additional cost for Step Functions state transitions, but it gives you a lot more flexibilities in return:

  • Able to control the number of retries and delay between retries by error type
  • Not worry about the invocation timing out whilst retrying
  • The auditing and visualization capabilities that Step Functions offers

That’s it for another guest post from me. I hope you learned something new! Let us know your thoughts.