Pro

How To Handle AWS Lambda Errors Like A Pro

In this post you will understand:

  1. How AWS Lambda errors and retries work, and what’s the idea behind it.
  2. What consequences it has on your code.
  3. How to build your system using AWS Step Functions to control error handling, together with a useful resource for doing that.

Anyone familiar with Serverless knows that it does not only mean executing your monolithic code on Lambda functions- it’s a different architecture of your whole system. In this architecture, the system is composed of distributed nodes activated by asynchronous events. Each node must be designed as an independent component which has its API (a ‘black box’), even when not exposing it to the outside world. So how can we know how to define these nodes accurately? It turns out that it has a lot to do with correct error handling and correct dealing with AWS retry behavior.

Lambda Retry Behavior

Lambda functions can fail in three cases:

  1. An unhandled exception is raised — whether an invalid input received, an external API failed, or just a programming bug occurred.
  2. Timeout — Lambda running longer than the configured timeout duration is violently closed with a ‘Task timed out after … seconds’ message. The default value is 6 seconds, and the maximal value is 5 minutes.
  3. Out of memory — In this case, the lambda usually terminates with ‘Process exited before completing request’ and ‘Memory Size’ is equal to ‘Max Memory Used’.

Example of CloudWatch out of memory log

When that happens (and be sure that it will), your Lambda will probably be retried according to the following behavior:

  1. Synchronous events — in event sources such as API Gateway or synchronous invocation using the SDK, the invoking application is responsible for making retries according to the response it gets from the Lambda. This is the least interesting case because it’s kind of like the regular monolithic error handling.
  2. Asynchronous events — for most event sources, the lambda is invoked asynchronously, meaning that there isn’t any application to respond to the failure, and therefore the AWS framework take care of that by itself. What it does is to trigger the lambda again with the same event, mostly twice in the following ~3 minutes (though in rare cases it may take up to six hours, and a different number of retries may occur). If all retries have failed, it’s often necessary that this event will be recorded and not just thrown away. Therefore, the important DLQ feature enables to configure a Dead Letter Queue over Amazon SQS that receives such events.
  3. Stream-based events — Current events of this type are only Amazon Kinesis Data Streams and DynamoDB streams. Failing Lambda functions are triggered again and again until the data expires or processed successfully. Unlike asynchronous events, this event source is blocked until that point.

In this post, I’ll refer mostly to the most common and problematic case of asynchronous events, though some of the given advice is relevant to the other cases as well. For a detailed explanation of retry behavior, check AWS docs.

Retry Behavior Consequences

Since each Lambda might be executed several times with same input, while the ‘caller’ actually didn’t mean the operation to be executed several times, the Lambda must be what’s called idempotent – meaning that no additional effect takes place when it’s run more than once with the same input.

This term is not related to Serverless functions only: a classic example is a network API in which the answer to some request did not arrive, and therefore the same request is made again. In Serverless architecture a similar case may happen when, for example, a Lambda is timed out before receiving such response. Even if that is highly unexpected, in some cases an incorrect retry handling may cause severe problems as DB structure violation.

Idempotence is the property of certain operations in mathematics and computer science that they can be applied multiple times without changing the result beyond the initial application” (Wikipedia).

But wait – what if the same operation has to be executed twice when it’s not a retry? For example, let’s say that the Lambda receives as input a user operation log, and is responsible for recording it on a database. In that case, it is needed to differentiate between a retry case and when the Lambda is just triggered with same input because the user did the same operation again.

A good solution for that is to refer the Lambda’s request ID as if it were part of the input itself because only when the Lambda is retried the same ID is given. To extract it, use context.awsRequestId in Node.js (or the corresponding field in other languages). This method is actually the general approach to detect retry executions.

Using the request ID for being genuinely idempotent is not always convenient. In the previous example, this ID should have been saved in the DB as well, so following invocations could find whether to add a new record. Another solution may be to use some in-memory data store (as Redis), but again, it adds a quite significant overhead.

Step Functions for the Rescue

It turns out that AWS Step Functions is a beneficial feature when building a Serverless application that deals with errors and retries properly – even a crucial one.

Let’s say that in response to an event, the application has to perform several operations. If all of them are combined to the same Lambda, the code usually has to check for each operation whether it has to be redone so that the whole Lambda remains idempotent, and that could be a real pain. It’s important to understand the difference here from monolithic applications, in which the application itself could be responsible for making retries since it can wait between them – and that’s not possible in Serverless.

On the other hand, with Step Functions, we can run each operation on a different Lambda, and define the transitions between them as suitable for the specific case. Moreover, we can control the retries behavior (retries number and delay duration) to make it most suitable too, and even disable it when it’s the right thing to do. From my experience, creating a step machine even for a single Lambda is the easiest workaround to disable unwanted retries behavior.

Going step by step

If you are already familiar with Step Functions, you may know that unfortunately, their currently available triggers are only API Gateway and manual execution using the SDK. Because of that, we have created a template for a Pythonic lambda that can be used as a glue code to execute a state machine asynchronously as a response to any event, which in short is just:

import os
import json
import boto3

client = boto3.client('stepfunctions')

def run(event, context):
    client.start_execution(
        stateMachineArn=os.environ['CF_MyStateMachine'],
        name=str(context.aws_request_id),
        input=json.dumps(event)
    )

A complete ready-to-use template is available on a public repository.

To deploy this Lambda you should use the Serverless framework, with the awesome serverless-resources-env plugin in order to pass the state machine ARN easily. Make sure also to use serverless-step-functions and serverless-pseudo-parameters to define the state machine easily as in the following example:

service: state-machine-invoking-example

provider:
  name: aws
  region: eu-west-1
  runtime: python3.6
  # Specific Role for the Lambda and machine is better
  iamRoleStatements:
    - Effect: "Allow"
      Action:
        - "states:StartExecution"
      Resource:
        - "*"

functions:
  first_step:
    handler: simple_lambda.run
  second_step:
    handler: another_lambda.run
    timeout: 5
  machine_invoker:
    handler: state_machine_invoker.run
    events:
      - sns: 'arn:aws:sns:eu-west-1:xxxxxxxx:sns_name'
    custom:
      env-resources:
        - MyStateMachine

stepFunctions:
  stateMachines:
    exampleMachine:
      name: myStateMachine
      definition:
        StartAt: firstStep
        States:
          firstStep:
            Type: Task
            Resource: arn:aws:lambda:#{AWS::Region}:#{AWS::AccountId}:function:${self:service}-${opt:stage}-first_step
            TimeoutSeconds: 6
            Next: secondStep
          secondStep:
            Type: Task
            Resource: arn:aws:lambda:#{AWS::Region}:#{AWS::AccountId}:function:${self:service}-${opt:stage}-second_step
            TimeoutSeconds: 5
            End: true

plugins:
  - serverless-step-functions
  - serverless-pseudo-parameters
  - serverless-resources-env

We artificially made a state machine being triggered by an SNS event, which is accessible by the initial step Lambda as input. Because we named the state machine execution as the invoker Lambda request ID – everything becomes idempotent. If the invoker Lambda is retried, AWS gives it the same request ID, and afterward, AWS also won’t execute the state machine again since it’s named the same. Theoretically speaking, the execution name of the state machine is also a part of its input. This solution is useful in many situations, but keep in mind that it also adds some complexity overhead that affects debugging and overall observability of the system.

It’s important to understand the error handling mechanism of Step Functions, which is different than the Lambda’s one. For every Task state, a timeout duration could be set, so that if the Task is not finished in time anStates.Timeout error is generated. This timeout is basically unlimited, but for the typical case of a Task executing a Lambda, the Lambda’s actual timeout duration is determined only by its own configured value (so it cannot get longer by this method =/ ). Therefore, make sure to configure the Task timeout to be equal to the Lambda’s timeout. The retries behavior of a Task is by default disabled and could be specifically configured (other than for Lambda).

Conclusions

Error handling in AWS Serverless architecture is quite confusing, and understanding how it affects your system is not always easy. I think that it would have been better if the retries behavior of AWS Lambda could have been manageable (as for Step Functions) and that a retry counter field in the context parameter is obviously a missing feature.

Nevertheless, I believe that the proposed architecture with Step Functions is useful for many cases, and besides helping to handle errors and retries correctly, it encourages elements separation which is a good practice in the Serverless world.