Launched at AWS re:Invent 2016 by Andy Jassy, Amazon Athena is an interactive tool for querying data stored in Amazon S3. It leverages standard SQL queries to process data quickly and is cost-effective, as you are charged only for the queries run.

Athena’s serverless capabilities also allow you to scale automatically by executing multiple–complex–parallel queries against large data sets.

This blog post will explore the internals of Amazon Athena, present use cases and alternatives, and walk you through an example of running a query using Athena in a step-by-step guide.

Why Amazon Athena?

Amazon Athena offers numerous benefits:  

  • Serverless service: Managing infrastructure to handle large data sets can be complex. With no infrastructure setup or management requirements, Athena is ideal for organizations.
  • Pay as you go: One of Athena’s key selling points is that you pay only for the queries you run and are charged for the data scanned by each query alone. This truly makes Athena the most cost-effective service available.
  • Scalability: Amazon Athena is powered by AWS core services, which allow you to scale quickly by executing parallel queries against large data sets.
  • Standard SQL queries: The ability to query data with Standard SQL queries makes Amazon Athena easy to use. Anyone with basic SQL knowledge can query large data sets as well as leverage join arrays and window function capabilities of SQL. The queries can be saved and used again as required.
  • Data processing: Amazon Athena uses PrestoDB, which can process unstructured, semi-structured, and structured data sets. The data can be in CSV, JSON, logs, or columnar data formats, such as Apache Parquet or ORC.

These, along with additional features including security, availability, and integration with AWS services, make Amazon Athena a great service within the AWS services portfolio.

Amazon Athena Use Cases 

As previously noted, Amazon Athena’s numerous beneficial features make its adoption easy. For the many organizations that have already adopted Athena, they have quickly started to see its value.

Following are a number of use cases for organizations:

  • Log analysis: Amazon Athena is already used by various organizations for log analysis work against real-time and archival data. A simple example is when you have an instance running in your environment from the year 2017, but you are not able to determine who launched the instance at that time. 

Athena can come to your rescue, and within minutes, via simple SQL queries against your AWS CloudTrail logs, you can determine who launched this instance along with other information. A similar logic can be applied against the records stored in your Amazon S3 bucket from various sources, such as AWS Application Load Balancer/Elastic Load Balancer, AWS Config logs, OS System Logs, Application logs, and logs from other sources.

  • Ad-hoc queries: Amazon Athena can be used to run ad-hoc queries against your data sets without the need to set up a complex query server infrastructure. These ad-hoc queries provide insights into your data and come up with a conclusion quickly. 

For example, a researcher has access to an extensive data set stored in Amazon S3 but needs to determine if the data is useful for him. He can write ad-hoc SQL queries and execute them against the data set in Amazon S3. The results he receives in just seconds or minutes can help determine the usefulness of the data, and all of this is possible without setting up a complex environment.

  • Data visualization: Amazon Athena can be easily integrated with Amazon QuickSight. This helps organizations build a visualization platform leveraging Amazon Quicksight, while Amazon Athena is used in the background to run queries against data stored in Amazon S3.

Amazon Athena Alternatives

Before the launch of Amazon Athena, AWS had multiple services in its data analytics portfolio to support analytics customers. Furthermore, other cloud service providers have similar offerings available as part of their own catalogs, as seen below.

AWS Alternatives

  • Amazon Redshift: This service allows you to run a fast, fully managed petabyte-scale data warehouse and supports data analysis by leveraging existing business intelligence tools. It is best to use Amazon Redshift for environments where data is aggregated from various sources, formatted, organized, and stored in a standard format for a long period. Complex, high-speed queries can be executed against stored data (massive databases) to produce outcomes by involving business intelligence tools.

On the other hand, Athena doesn’t care about the structure, format, and organization of the data. It is used to run queries directly against data stored in Amazon S3, which is cost-effective and delivers quick results.

  • Amazon Redshift Spectrum: While similar to Amazon Athena, this service operates at a different scale. Amazon Redshift Spectrum scales to exabytes of data and quickly fetches results by using complex queries against huge data sets. It eliminates the need to load the data into Amazon Redshift tables. It is also a serverless service with a pay-as-you-consume model. Like Amazon Redshift, it can integrate with various BI tools and allows you to use existing queries.

Google Cloud Platform Alternatives

  • BigQuery: Athena’s biggest rival in the cloud provider space is Google’s BigQuery. As with Athena, the service allows you to take data stored on Google Cloud Storage and query it directly in a serverless fashion using SQL queries. Some performance analyses suggest Google BigQuery is somewhat faster than Athena, especially when it comes to larger data sets.

There are also some differences in features. Google BigQuery offers streaming support and regex support for table schema definition. But unless you have very specific requirements, you are more likely to pick the cloud provider you use for the rest of your application workloads.

In-House Alternatives

  • Traditional DB: The last alternative to consider is a traditionally managed database to store and query data. This may work better for you if you have complex structured data to query. The query results can be retrieved faster, as you benefit from many decades of database software development and optimization.

It can also be significantly cheaper if you operate at scale, as the cost of database and storage management does not scale linearly, while cloud provider costs typically do. Similarly, if you already have the expertise in-house to manage these database systems, these resources can be reused for relatively little cost.

Amazon Athena Pricing

At $5.00 per TB of data scanned, Amazon Athena has a straightforward pricing model. Depending on the data parsed by your SQL queries, you will be charged automatically. For storage, Amazon S3 standard charges will apply.

Amazon Athena Walkthrough Guide

Let’s walk through a simple example of using Athena to run a query against data stored in S3 in this step-by-step guide. 

Step 1: Get Data to Query

First, you need some data to query. We will use a data set from Kaggle. Click “download” on this page to get a zip file (login required). Unzip the file, and open the googleplaystore.csv file in an editor of your choice. Remove the first line (which contains the unneeded column names of the table), and save the updated file.

Step 2: Create an S3 Bucket

Next, create an S3 bucket in AWS by clicking “Create bucket.” For this example, we will use the name aws-athena-walkthrough.

Amazon S3 bucket

Figure 1: Creating an S3 bucket

Click on the newly created bucket, and you will see a screen like this:

Empty S3 Bucket

Figure 2: View of Empty S3 Bucket

Step 3: Upload the File

Click on the “Upload” button, then “Add files,” and choose to upload the file from your computer.

Upload object to Amazon S3 Bucket

Figure 3: Upload object to Amazon S3 Bucket

Select the googleplaystore.csv file and upload it to the bucket, choosing the defaults for the remaining screens. 

Your bucket, when selected, should look like this:

S3 Bucket with the uploaded object

Figure 4: View of S3 Bucket with the uploaded object

Step 4: Create Athena Database and Table

Now that your data is in S3, you are ready to create your database and table.

Go to the Athena product on your AWS console, and you should see a page like this:

Amazon Athena on AWS Console

Figure 5: View of Amazon Athena Service on AWS Console

Click on “Create table.”

Create table on Athena Service

Figure 6: Create table on Athena Service

Choose “Create table from S3 bucket data,” then click on “Create a new database,” and input “googleplay” for your database, “googleplaystore” for your table, and “s3://aws-athena-walkthrough” for your bucket name (replacing “aws-athena-walkthrough” with your bucket name). Finally, click “Next.”

new database and table on Athena

Figure 7: Create a new database and table on Athena

Choose “CSV,” then click “Next” again:

Figure 8: Select Data Format as “CSV”

On the next page, click “Bulk add columns.”

Figure 9: Add Bulk Columns on Database Table

Input the following column definitions into the text field:

app string,
category string,
rating float,
reviews int,
size string,
installs string,
type string,
price string,
content_rating int,
genres string,
last_updated string,
current_ver string,
android_ver string

Submit them, and you will see a screen with the definitions. Scroll down and click “Next”:

Figure 10: View of Columns on Database Table

Click “Create Table”:

Figure 11: Create table on Athena

Now, you should be returned to the Athena product page.

Step 5: Run Query

On the main Athena product page, choose the “googleplay” database from the “Database” dropdown menu on the top left. Click “+” to add a “New Query” on the right. Type the following query to retrieve all “Art and Design” Google Play Store Apps that have a rating of greater than 4.0:

SELECT app
FROM googleplaystore
WHERE category = ‘ART_AND_DESIGN’
AND rating > 4.0;

Now click “Run query.” After a few seconds, you should see some results appear in the bottom right of the screen:

Figure 12: Run Query on Athena

These are some user-friendly results that you can examine, but the more formal output will have been deposited onto your S3 page.

Step 6: Examine Results in S3

If you return to your S3 page, you should see a bucket with a name similar to the one below among your existing buckets:

amazon athena tutorial

Figure 13: View results on S3 bucket for Athena Query Execution

Within this bucket will be several files with the results in a .csv file that you can download and examine at your leisure.

Remember that you will be charged for the storage of these results in theS3 bucket, so do not forget to delete it, especially if the results set is large!

Observability

One of the key items missing in the example above is observability. While using your application in a production environment, it is very difficult to gain end-to-end visibility of your application with performance insights. What you need is a dashboard that connects various components in your environment and has the ability to build a topology view and share performance insights of the application.

Epsagon comes to the rescue here, allowing you to build an architecture view and fetch traces for your environment to highlight performance and cost insights, giving you end-to-end visibility of your environment. It provides you with the ability to view the metrics for your environment, trigger alerts, and fix issues within seconds.

Service Map

Figure 14: View of Espagon Service Map

Conclusion

In this Amazon Athena guide, we explored present use cases and alternatives and walked you through an example of running a query using Athena. Despite the existence of other AWS services, such as Redshift, EMR, and Redshift Spectrum, with so many great features to offer, Amazon Athena is difficult to overlook if it fits the use cases for your given environment. Athena’s easy-to-use SQL queries, serverless capabilities, and cost-effectiveness make it the ideal service for querying data stored in Amazon S3 through standard SQL.

Read More:

The Hitchhiker’s Guide to Serverless

AWS Tools Series: Amazon EventBridge Technical Tutorial

Meet the Family: The “Other” AWS Serverless Services