Building a Serverless App Using Athena and AWS Lambda

Amazon Athena, an interactive query service that makes it easy to search data in Amazon S3 using SQL, was launched at reInvent 2016. Athena is a serverless service, meaning that you don’t need to manage any infrastructure or perform any setup, and you only have to pay for as much as you use. You can store structured data in S3, for example as JSON or CSV, and then simply query that data using SQL, just as if your S3 bucket was a database.

In this post, we will cover some details and get you started with Amazon Athena via a simple tutorial that uses Athena as infrastructure as code from a Serverless Framework project.

What are We Going to Build?

As previously mentioned, we will be presenting a use case to help you understand the power of Amazon Athena. We are going to create a simple service that lets you create new products and then search the products by name. To do this, you will have two endpoints: one that will create the product and another that will search the products by name and return some results.

This will be a totally serverless application using AWS and Serverless Framework, and the two endpoints will trigger different AWS Lambdas. The first Lambda will create a new object and store it as JSON in an S3 bucket. The second Lambda will create a new SQL query with the name provided in the query parameters and then query the product list using Athena.

Everything will be executed using infrastructure as code from our Serverless Framework project.

Saving a Product to S3

Before starting this tutorial, you need to have an AWS account and Serverless Framework installed and configured on your computer. To create a new Serverless Framework project, you  need to run the command inside an empty directory:

$ sls create --template aws-nodejs --name products-athena

This creates the boilerplate for our project, creating the serverless.yml and handler.js files.

 

We can go to the serverless.yml file and edit it like this:

service: products-athena

custom:
 BUCKET_NAME: <PICK A BUCKET NAME>

provider:
 name: aws
 runtime: nodejs8.10
 environment: ${self:custom}

 iamRoleStatements:
   - Effect: Allow
     Action:
       - "s3:PutObject"
     Resource:
       - "arn:aws:s3:::${self:custom.BUCKET_NAME}/*"
       - "arn:aws:s3:::${self:custom.BUCKET_NAME}"

functions:
 createProduct:
   handler: handler.createProduct
   events:
     - http:
         method: post
         path: product

resources:
 Resources:
   ProductS3Bucket:
     Type: AWS::S3::Bucket
     Properties:
       BucketName: ${self:custom.BUCKET_NAME}

This will create a new Lambda called createProduct that will be triggered with an HTTP event with the method POST and the path product. It will also create a new S3 bucket and give permissions to the lambda to put objects into the bucket.

You can then go to the handler.js and modify it like this:

'use strict';

const productManager = require('./productManager');

module.exports.createProduct = async (event) => {
 const product = JSON.parse(event.body);

 try {
   await productManager.saveProduct(product);
   
   return {
     statusCode: 200,
     body: 'Product was saved in the storage'
   };
 } catch (error) {
   return {
     statusCode: 400,
     body: error
   };
 }
};

You also need to create the new module named ProductManager, which will be the one saving the product to S3.

'use strict';

const AWS = require('aws-sdk');
const s3 = new AWS.S3();
const uuidv1 = require('uuid/v1');

const FOLDER_NAME = 'raw';
const BUCKET_NAME = process.env.BUCKET_NAME;

module.exports.saveProduct = async product => {
   product.productId = uuidv1();
   const params = {
       Body: JSON.stringify(product),
       Bucket: BUCKET_NAME,
       Key: `${FOLDER_NAME}/${product.productId}`,
       ContentType: "application/json"
   };  
   return s3.putObject(params).promise();
}

The method saveProduct in the ProductManager module will save the product object into the S3 bucket that we created in our serverless.yml. The product object is a JSON object.

You can now deploy the code and test it.

A Short Introduction to Athena

As noted at the beginning of this post, Amazon Athena is an interactive query service, not a database or a database service. It is simply a query service for analyzing data in S3 with SQL.

Getting started with Athena is pretty simple. First, you need to tell Athena where your data is in S3. Then you can define the schema for querying and start performing queries using SQL. One good thing about this approach is that there can be multiple schemas for the same dataset. Athena is an asynchronous service, but most of the results are returned in seconds.

When using Athena against S3 objects, you can store your data in any form: text files, CSV, different kinds of logs, JSON, compressed files, or columnar formats like Parquet or ORC. The data won’t be moved from S3 when doing the queries, and the data won’t be affected by the queries.

Athena also uses Presto, an in-memory distributed query engine for ANSI-SQL. Because of this, you get all the features that Presto has to offer when doing your queries.

Getting Started With Athena

The first step to using Athena is to create a database and table. The database is a collection of metadata, and you need to specify the S3 bucket where the database files and metadata will be stored when creating it.

After creating the database, you need to create a table, which specifies the schema for the queries. You will also need to provide the location of the raw data when creating the table.

If you want to configure Athena in your project, you can create a new function that will only be invoked once. Upon creating the project, this function will execute the entire setup.

This is how the serverless.yml will look when we want to start using Athena.

service: products-athena

custom:
 BUCKET_NAME: <PICK A BUCKET NAME>
 ATHENA_BUCKET_NAME: <PICK A BUCKET NAME>
 DATABASE_NAME: 'products-athena'

provider:
 name: aws
 runtime: nodejs8.10
 environment: ${self:custom}

 iamRoleStatements:
   - Effect: Allow
     Action:
       - "s3:GetBucketLocation"
       - "s3:GetObject"
       - "s3:ListBucket"
       - "s3:ListBucketMultipartUploads"
       - "s3:ListMultipartUploadParts"
       - "s3:AbortMultipartUpload"
       - "s3:CreateBucket"
       - "s3:PutObject"
     Resource:
       - "arn:aws:s3:::${self:custom.BUCKET_NAME}/*"
       - "arn:aws:s3:::${self:custom.BUCKET_NAME}"
- "arn:aws:s3:::${self:custom.ATHENA_BUCKET_NAME}"
       - "arn:aws:s3:::${self:custom.ATHENA_BUCKET_NAME}/*"
   - Effect: "Allow"
     Action:
       - "glue:*"
     Resource:
         - "*"
   - Effect: "Allow"
     Action:
       - "athena:*"
     Resource:
       - "*"

functions:
 createProduct:
   handler: handler.createProduct
   events:
     - http:
         method: post
         path: product
 athenaInit:
   handler: athena.init

resources:
 Resources:
   ProductS3Bucket:
     Type: AWS::S3::Bucket
     Properties:
       BucketName: ${self:custom.BUCKET_NAME}
   AthenaBucket:
     Type: AWS::S3::Bucket
     Properties:
       BucketName: ${self:custom.ATHENA_BUCKET_NAME}

Here we can see the init-method. This method is calling the startQueryExecutionAthena two times, first with the create database query and then with the create table query. We are using the AWS SDK to call Athena from our AWS Lambda. The AWS SDK makes it very easy to use Athena, and you can take a look at the documentation for the AWS SDK here.

To create the database and table, you can now deploy the code and then run it in your terminal:

$ sls invoke -f athenaInit

This command will invoke your function in the cloud, and Athena will get configured for your queries.

Querying the Data

Now that you have created the database and table, you can go ahead and query the data. You want to search for all the products with the same name in the product table. So for that, you need to create a query like this:

SELECT * FROM products-athena WHERE name=<NAME>

You will get the name as a query parameter from the HTTP request. To do this, you need to create a new Lambda that will be triggered by an API Gateway with the method GET and the path products. This will return all the results in the HTTP response, start the previous query in Athena, and then display the results.

So you have to add this new code to the existing serverless.yml under the function property:

searchProductByName:
  handler: handler.searchProductByName
  timeout: 10 # optional, in seconds, default is 6
  events:
    - http:
        method: get
        path: product

This will create a new function triggered with an API Gateway just like you need. One important thing to remember is to change the default timeout for the function from six seconds to 10 seconds. Athena queries usually take more than six seconds to return a result.

In the handler.js, go ahead and add this method:

module.exports.searchProductByName = async (event) => {

 const name =
 event.queryStringParameters && event.queryStringParameters.name;

 const result = await productManager.searchProductByName(name);
 return {
   statusCode: 200,
   body: JSON.stringify(result)
 };
}

And then you need to add some more methods to the Athena module.

module.exports.searchProductByName = async (name) => {
   const searchQuery = "SELECT * FROM products WHERE name='"+ name +"' ";
   const queryExecutionIdSearch = await startQueryExecutionAthena(searchQuery);
   return getQueryResults(queryExecutionIdSearch);
}

async function getQueryResults(queryExecutionId) {
   var executionDone = false;

   while (!executionDone) {
       executionDone = await isExecutionDone(queryExecutionId);
       console.log('waiting...')
       sleep(2000);
   }
  
   const results = await athena.getQueryResults(queryExecutionId).promise();
   return Promise.resolve(formatResults(results));
}

async function isExecutionDone(queryExecutionId) {
   const result = await athena.getQueryExecution(queryExecutionId).promise();

   if (result.QueryExecution.Status.State === 'SUCCEEDED') {
       return Promise.resolve(true);
   } else {
       return Promise.resolve(false);
   }
}

function formatResults(results) {
   var formattedResults = [];
  
   const rows = results.ResultSet.Rows;
   rows.forEach(function(row) {         
     var value = {
           productId: row.Data[0].VarCharValue,
           name: row.Data[1].VarCharValue,
           color: row.Data[2].VarCharValue
     };
     formattedResults.push(value);
   });
  
   return formattedResults;
}

 

The last methods you added are for doing the query. So the searchProductsByName method will start the query execution. But since you want to return the result in the response, and Athena is an async service, you need to add some waiting logic in the method getSearchResults. There you need to wait for a few seconds to check if the query is ready. It takes between three and eight seconds for Athena to return something in a small testing dataset. When you do get the results, then you want to format them in order to display them in the HTTP response.

Deploying and Verifying

After all of this is in place, you can deploy the code and see it in action. Try to store some data, and then start searching in your data.

If you want to add some monitoring to this service, you can use Epsagon by simply adding an external layer to all the functions. To learn more about layers, take a look at this article. As an example, your functions will look something like this:

functions:
  createProduct:
    handler: handler.createProduct
    events:
      - http:
          method: post
          path: product
    layer:
      - arn: arn:aws:lambda:us-east-1:066549572091:layer:epsagon-node-layer:1

Then, right in your Epsagon console, you can instrument this application and start seeing the traces coming in with just one click. You can even get an architecture diagram of your application. Here below, we can see the API Gateway calling the createProduct Lambda that stores the data in the S3 bucket. We can also see the API Gateway calling the other Lambda that is doing the search by name in Athena.

Tracing with Epsagon

Conclusion

Using Athena in your serverless applications is a great way to have a serverless database where you don’t need to worry about infrastructure and can take advantage of the reliability that comes with S3. Athena combined with S3 can be considered an alternative for DynamoDB in some cases.

It’s important to keep in mind that Athena is an asynchronous service and is best used when your application does not need results to be returned immediately.

If you want to learn more about other AWS Services, there is a similar tutorial about SQS for serverless applications that is a great read as well.