Launched at AWS re:Invent 2016 by Andy Jassy, Amazon Athena is an interactive tool for querying data stored in Amazon S3. It leverages standard SQL queries to process data quickly and is cost-effective, as you are charged only for the queries run.
Athena’s serverless capabilities also allow you to scale automatically by executing multiple–complex–parallel queries against large data sets.
This blog post will explore the internals of Amazon Athena, present use cases and alternatives, and walk you through an example of running a query using Athena in a step-by-step guide.
Why Amazon Athena?
Amazon Athena offers numerous benefits:
- Serverless service: Managing infrastructure to handle large data sets can be complex. With no infrastructure setup or management requirements, Athena is ideal for organizations.
- Pay as you go: One of Athena’s key selling points is that you pay only for the queries you run and are charged for the data scanned by each query alone. This truly makes Athena the most cost-effective service available.
- Scalability: Amazon Athena is powered by AWS core services, which allow you to scale quickly by executing parallel queries against large data sets.
- Standard SQL queries: The ability to query data with Standard SQL queries makes Amazon Athena easy to use. Anyone with basic SQL knowledge can query large data sets as well as leverage join arrays and window function capabilities of SQL. The queries can be saved and used again as required.
- Data processing: Amazon Athena uses PrestoDB, which can process unstructured, semi-structured, and structured data sets. The data can be in CSV, JSON, logs, or columnar data formats, such as Apache Parquet or ORC.
These, along with additional features including security, availability, and integration with AWS services, make Amazon Athena a great service within the AWS services portfolio.
Amazon Athena Use Cases
As previously noted, Amazon Athena’s numerous beneficial features make its adoption easy. For the many organizations that have already adopted Athena, they have quickly started to see its value.
Following are a number of use cases for organizations:
- Log analysis: Amazon Athena is already used by various organizations for log analysis work against real-time and archival data. A simple example is when you have an instance running in your environment from the year 2017, but you are not able to determine who launched the instance at that time.
Athena can come to your rescue, and within minutes, via simple SQL queries against your AWS CloudTrail logs, you can determine who launched this instance along with other information. A similar logic can be applied against the records stored in your Amazon S3 bucket from various sources, such as AWS Application Load Balancer/Elastic Load Balancer, AWS Config logs, OS System Logs, Application logs, and logs from other sources.
- Ad-hoc queries: Amazon Athena can be used to run ad-hoc queries against your data sets without the need to set up a complex query server infrastructure. These ad-hoc queries provide insights into your data and come up with a conclusion quickly.
For example, a researcher has access to an extensive data set stored in Amazon S3 but needs to determine if the data is useful for him. He can write ad-hoc SQL queries and execute them against the data set in Amazon S3. The results he receives in just seconds or minutes can help determine the usefulness of the data, and all of this is possible without setting up a complex environment.
- Data visualization: Amazon Athena can be easily integrated with Amazon QuickSight. This helps organizations build a visualization platform leveraging Amazon Quicksight, while Amazon Athena is used in the background to run queries against data stored in Amazon S3.
Amazon Athena Alternatives
Before the launch of Amazon Athena, AWS had multiple services in its data analytics portfolio to support analytics customers. Furthermore, other cloud service providers have similar offerings available as part of their own catalogs, as seen below.
- Amazon Redshift: This service allows you to run a fast, fully managed petabyte-scale data warehouse and supports data analysis by leveraging existing business intelligence tools. It is best to use Amazon Redshift for environments where data is aggregated from various sources, formatted, organized, and stored in a standard format for a long period. Complex, high-speed queries can be executed against stored data (massive databases) to produce outcomes by involving business intelligence tools.
On the other hand, Athena doesn’t care about the structure, format, and organization of the data. It is used to run queries directly against data stored in Amazon S3, which is cost-effective and delivers quick results.
- Amazon Redshift Spectrum: While similar to Amazon Athena, this service operates at a different scale. Amazon Redshift Spectrum scales to exabytes of data and quickly fetches results by using complex queries against huge data sets. It eliminates the need to load the data into Amazon Redshift tables. It is also a serverless service with a pay-as-you-consume model. Like Amazon Redshift, it can integrate with various BI tools and allows you to use existing queries.
Google Cloud Platform Alternatives
- BigQuery: Athena’s biggest rival in the cloud provider space is Google’s BigQuery. As with Athena, the service allows you to take data stored on Google Cloud Storage and query it directly in a serverless fashion using SQL queries. Some performance analyses suggest Google BigQuery is somewhat faster than Athena, especially when it comes to larger data sets.
There are also some differences in features. Google BigQuery offers streaming support and regex support for table schema definition. But unless you have very specific requirements, you are more likely to pick the cloud provider you use for the rest of your application workloads.
- Traditional DB: The last alternative to consider is a traditionally managed database to store and query data. This may work better for you if you have complex structured data to query. The query results can be retrieved faster, as you benefit from many decades of database software development and optimization.
It can also be significantly cheaper if you operate at scale, as the cost of database and storage management does not scale linearly, while cloud provider costs typically do. Similarly, if you already have the expertise in-house to manage these database systems, these resources can be reused for relatively little cost.
Amazon Athena Pricing
At $5.00 per TB of data scanned, Amazon Athena has a straightforward pricing model. Depending on the data parsed by your SQL queries, you will be charged automatically. For storage, Amazon S3 standard charges will apply.
Amazon Athena Walkthrough Guide
Let’s walk through a simple example of using Athena to run a query against data stored in S3 in this step-by-step guide.
Step 1: Get Data to Query
First, you need some data to query. We will use a data set from Kaggle. Click “download” on this page to get a zip file (login required). Unzip the file, and open the googleplaystore.csv file in an editor of your choice. Remove the first line (which contains the unneeded column names of the table), and save the updated file.
Step 2: Create an S3 Bucket
Next, create an S3 bucket in AWS by clicking “Create bucket.” For this example, we will use the name aws-athena-walkthrough.
Click on the newly created bucket, and you will see a screen like this:
Step 3: Upload the File
Click on the “Upload” button, then “Add files,” and choose to upload the file from your computer.
Select the googleplaystore.csv file and upload it to the bucket, choosing the defaults for the remaining screens.
Your bucket, when selected, should look like this:
Step 4: Create Athena Database and Table
Now that your data is in S3, you are ready to create your database and table.
Go to the Athena product on your AWS console, and you should see a page like this:
Click on “Create table.”
Choose “Create table from S3 bucket data,” then click on “Create a new database,” and input “googleplay” for your database, “googleplaystore” for your table, and “s3://aws-athena-walkthrough” for your bucket name (replacing “aws-athena-walkthrough” with your bucket name). Finally, click “Next.”
Choose “CSV,” then click “Next” again:
On the next page, click “Bulk add columns.”
Input the following column definitions into the text field:
app string, category string, rating float, reviews int, size string, installs string, type string, price string, content_rating int, genres string, last_updated string, current_ver string, android_ver string
Submit them, and you will see a screen with the definitions. Scroll down and click “Next”:
Click “Create Table”:
Now, you should be returned to the Athena product page.
Step 5: Run Query
On the main Athena product page, choose the “googleplay” database from the “Database” dropdown menu on the top left. Click “+” to add a “New Query” on the right. Type the following query to retrieve all “Art and Design” Google Play Store Apps that have a rating of greater than 4.0:
SELECT app FROM googleplaystore WHERE category = ‘ART_AND_DESIGN’ AND rating > 4.0;
Now click “Run query.” After a few seconds, you should see some results appear in the bottom right of the screen:
These are some user-friendly results that you can examine, but the more formal output will have been deposited onto your S3 page.
Step 6: Examine Results in S3
If you return to your S3 page, you should see a bucket with a name similar to the one below among your existing buckets:
Within this bucket will be several files with the results in a .csv file that you can download and examine at your leisure.
Remember that you will be charged for the storage of these results in theS3 bucket, so do not forget to delete it, especially if the results set is large!
One of the key items missing in the example above is observability. While using your application in a production environment, it is very difficult to gain end-to-end visibility of your application with performance insights. What you need is a dashboard that connects various components in your environment and has the ability to build a topology view and share performance insights of the application.
Epsagon comes to the rescue here, allowing you to build an architecture view and fetch traces for your environment to highlight performance and cost insights, giving you end-to-end visibility of your environment. It provides you with the ability to view the metrics for your environment, trigger alerts, and fix issues within seconds.
In this Amazon Athena guide, we explored present use cases and alternatives and walked you through an example of running a query using Athena. Despite the existence of other AWS services, such as Redshift, EMR, and Redshift Spectrum, with so many great features to offer, Amazon Athena is difficult to overlook if it fits the use cases for your given environment. Athena’s easy-to-use SQL queries, serverless capabilities, and cost-effectiveness make it the ideal service for querying data stored in Amazon S3 through standard SQL.