How to Handle AWS Lambda Errors Like a Pro

AWS Lambda error handling is a challenge for every new Lambda user. The Lambda retry mechanism sometimes makes it difficult to follow what’s going on in your serverless application.

In this post you will understand:

  1. How AWS Lambda errors and Lambda retry work, and what’s the idea behind it.
  2. What consequences it has on your code.
  3. How to build your system using AWS Step Functions to control AWS Lambda error handling. You will also get a useful resource for doing that.

Anyone familiar with serverless knows that it does not only mean executing your monolithic code on a Lambda function. It is a different architecture of your whole system. In this architecture, distributed nodes activated by asynchronous events are composing the system. Each node must be designed as an independent component which has its API (a “black box”), even when not exposing it to the outside world.

So how can we know how to define these nodes accurately? It turns out that it has a lot to do with correct Lambda error handling. And, of course, dealing correctly with the AWS Lambda retry behavior.

Lambda Retry Behavior

Lambda functions can fail in three cases:

  1. An unhandled exception is raised — whether if we received an invalid input, an external API failed, or just a programming bug occurred.
  2. Timeout — Lambda running longer than the configured timeout duration is violently closed with a ‘Task timed out after … seconds’ message. The default value is 6 seconds, and the maximal value is 5 minutes.
  3. Out of memory — In this case, the lambda usually terminates with ‘Process exited before completing request’. The ‘Memory Size’ is equal to ‘Max Memory Used’.

Example of CloudWatch out of memory log

When that happens (and be sure that it will), you will probably see your Lambda retry according to the following behavior:

1. Synchronous events

In event sources such as API Gateway or synchronous invocation using the SDK, the invoking application is responsible for making retries according to the response it gets from the Lambda. This is the least interesting case because it’s kind of like the regular monolithic error handling.

2. Asynchronous events

For most event sources, the Lambda invocation is happening asynchronously. It means that there isn’t any application to respond to the failure, and therefore the AWS framework takes care of that by itself. What it does is to trigger the lambda again with the same event, mostly twice in the following ~3 minutes (though in rare cases it may take up to six hours, and a different number of retries may occur). If all retries have failed, it’s often necessary that this event will be recorded and not just thrown away. Therefore, the important DLQ feature enables to configure a Dead Letter Queue over Amazon SQS that receives such events.

3. Stream-based events

Current events of this type are only Amazon Kinesis Data Streams and DynamoDB streams. AWS will trigger failing Lambda functions again and again until the data expires or processed successfully. Unlike asynchronous events, AWS will block the event source until that point.

In this post, I’ll refer mostly to the most common and problematic case of asynchronous events, though some of the given advice is relevant to the other cases as well. For a detailed explanation of retry behavior, check the AWS docs.

AWS Lambda Retry Behavior Consequences

Lambda retry - default behaviour

So each Lambda might be executed several times with same input, while the “caller” actually didn’t mean or even know about it. In order to execute the same operation multiple times, the Lambda must be what’s called idempotent – meaning that no additional effect takes place when it’s run more than once with the same input.

Serverless functions are not the only example of using this term. A classic example is a network API: when a request does not get a response, the same request is sent again.

In Serverless architectures a similar case may happen when, for example, a Lambda gets a time-out before receiving such a response. Even if that is highly unexpected, in some cases an incorrect retry handling may cause severe problems as DB structure violation.

Idempotency

Idempotence is the property of certain operations in mathematics and computer science that they can be applied multiple times without changing the result beyond the initial application” (Wikipedia).

But wait – what if we need to execute the same operation twice when it’s not a retry? For example, let’s say that the Lambda receives as input a user operation log, and is responsible for recording it on a database. In that case, we need to differentiate between a retry case and when the trigger input of the Lambda is simply the same because the user did the same operation again.

A good solution for that is to refer the Lambda’s request ID as if it were part of the input itself. Only when there is a Lambda retry you will get the same ID. To extract it, use context.awsRequestId in Node.js (or the corresponding field in other languages). This method is actually the general approach to detect retry executions.

Using the request ID for being genuinely idempotent is not always convenient. In the previous example, this ID should have been saved in the DB as well, so following invocations could find whether to add a new record. Another solution may be to use some in-memory data store (as Redis), but again, it adds a quite significant overhead.

Step Functions to the Rescue

Error handling in AWS Lambda can be achieved in various ways, such as using wrappers. However, it turns out that AWS Step Functions is a beneficial feature when building a serverless application that deals with errors and retries properly – even becomes a crucial one. The Hitchhiker’s Guide to Step Functions provides a good tutorial overview.

Motivation

Let’s say that in response to an event, the application has to perform several operations. If you combine all of them to the same Lambda, the code usually has to check for each operation. Should it be redone so that the whole Lambda remains idempotent? It could be a real pain. It is important to understand the difference between our example and monolithic applications. In monolithic, the application itself could be responsible for making retries since it can wait between them, and that’s not possible in Serverless.

On the other hand, with Step Functions, we can run each operation on a different Lambda. We can also define the transitions between them as suitable for the specific case. Moreover, we can control the retries behavior – their number and delay duration. That way, we can make it the most suitable for our use case. We can even disable it when it’s the right thing to do. From my experience, creating a step machine even for a single Lambda is the easiest workaround to disable unwanted retries behavior.

Lambda error handling using Step Functions

Going step by step

Implementation

You may know that unfortunately, the available triggers for AWS Step Functions are rather limited. The only available triggers are API Gateway and a manual execution using the SDK. Because of that, we have created a template for a Pythonic Lambda. You could use it as a glue code to execute a state machine asynchronously as a response to any event. In short, it is just:

import os
import json
import boto3

client = boto3.client('stepfunctions')

def run(event, context):
    client.start_execution(
        stateMachineArn=os.environ['CF_MyStateMachine'],
        name=str(context.aws_request_id),
        input=json.dumps(event)
    )

A complete ready-to-use template is available on a public repository.

To deploy this Lambda you should use the Serverless framework, with the awesome serverless-resources-env plugin in order to pass the state machine ARN easily. Make sure also to use serverless-step-functions and serverless-pseudo-parameters to define the state machine easily as in the following example:

service: state-machine-invoking-example

provider:
  name: aws
  region: eu-west-1
  runtime: python3.6
  # Specific Role for the Lambda and machine is better
  iamRoleStatements:
    - Effect: "Allow"
      Action:
        - "states:StartExecution"
      Resource:
        - "*"

functions:
  first_step:
    handler: simple_lambda.run
  second_step:
    handler: another_lambda.run
    timeout: 5
  machine_invoker:
    handler: state_machine_invoker.run
    events:
      - sns: 'arn:aws:sns:eu-west-1:xxxxxxxx:sns_name'
    custom:
      env-resources:
        - MyStateMachine

stepFunctions:
  stateMachines:
    exampleMachine:
      name: myStateMachine
      definition:
        StartAt: firstStep
        States:
          firstStep:
            Type: Task
            Resource: arn:aws:lambda:#{AWS::Region}:#{AWS::AccountId}:function:${self:service}-${opt:stage}-first_step
            TimeoutSeconds: 6
            Next: secondStep
          secondStep:
            Type: Task
            Resource: arn:aws:lambda:#{AWS::Region}:#{AWS::AccountId}:function:${self:service}-${opt:stage}-second_step
            TimeoutSeconds: 5
            End: true

plugins:
  - serverless-step-functions
  - serverless-pseudo-parameters
  - serverless-resources-env

We artificially chose an SNS event to trigger the state machine. It is accessible by the initial step Lambda as input. Because we named the state machine execution as the invoker Lambda request ID – everything becomes idempotent. If a retry occurs to the invoker Lambda, AWS gives it the same request ID. Afterward, AWS also won’t execute the state machine again since it’s named the same. Theoretically speaking, the execution name of the state machine is also a part of its input. While this solution is useful in many situations, keep in mind that it also adds some complexity overhead. It affects the debugging and overall observability of the system.

Things to Notice

It’s important to understand the error handling mechanism of Step Functions, which is different than the Lambda’s one. For every Task state, a timeout duration could be set, so that if the Task is not finished in time anStates.Timeout error is generated. This timeout is basically unlimited. However, for the typical case of a Task executing a Lambda, the case is different. The Lambda’s actual timeout duration is determined only by its own configured value. Therefore, it cannot get longer by this method. Therefore, make sure to configure the Task timeout to be equal to the Lambda’s timeout. The retries behavior of a Task is by default disabled and could be specifically configured (other than for Lambda).

AWS Lambda error handling with Step Functions

Conclusion

Error handling in AWS Serverless architecture is quite confusing and understanding how it affects your system is not always easy. I think that it would be better if you could manage the retries behavior of AWS Lambda. The same goes for AWS Step Functions. A retry counter field in the context parameter is obviously a missing feature.

There are different techniques for Lambda error handling besides the ones mentioned in this article. Using wrappers is a common example.

Nevertheless, I believe that the proposed architecture with Step Functions is useful in many cases. AWS Lambda error handling in just one of them. Besides helping controlling Lambda retries correctly, it encourages elements separation which is a good practice in the Serverless world.