In this article, I am going to discuss Amazon Athena and how we can analyze data stored in S3 using Athena. As you might know, Amazon’s AWS has a lot of services in the field compute, databases, analytics, machine learning, and robotics, one of the most important and popular services is Amazon Athena. By the official definition, “Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.”
In this article, we are going to store a file in a bucket in Amazon S3 and then query the data using Amazon Athena using the standard SQL syntax. In order to query data using the standard SQL syntax, we need to have some database where the queries can be executed against. Fortunately, we do not have to use any other services since, Athena offers this feature under the hood to support datastores like CSV, JSON, and some other columnar data formats such as Avro and Parquet. Due to this, it is much easier to start analyzing your S3 data much easily with Athena.
Moreover, you can query the data using the Athena console, or by using any programming language of your choice like Python. This provides us great flexibility to not only query data manually, but also define programmatic approaches to query the data stored in S3 buckets.
Storing data in S3 buckets
Now that we have some idea of what Amazon Athena is all about, we can start by uploading a plain CSV file to Amazon S3. For the purpose of this tutorial, I am going to create a new dataset that can be downloaded from here. Download the file from GitHub and upload it to your S3 bucket. I am going to create a new bucket with the name “ad-athena-datasets”, create a new folder with the name “data” inside the bucket, and upload the file inside the folder.
In addition to the above, we would also need a location within our S3 bucket to store the query results. For that, I am going to create a new folder with the name “query-results”.
Setting up the Athena environment
Now that our dataset is ready in S3, let us go ahead and set up the Athena environment to query the data store in the S3 bucket. For that, head over to https://console.aws.amazon.com and search for Athena under Services. Navigate to the Athena console and click on Create Table to create a new table that can represent the data stored in the S3 bucket. Since our data resides within S3, select the option “from S3 bucket data” in the dropdown menu.
On the next page, you need to provide some relevant information regarding Athena and your S3 data. You need to specify the name of the database in the first box. If you do not have already created any database earlier, you can create a new one by providing a valid database name here. If you have already created a database before, you can simply use that database for the purpose of this exercise.
The next is to give a suitable name to the table that you are going to query using Athena. Usually, I like to keep the name of the table similar to that of the name of the file in the S3 bucket. This makes me easier to understand which file is this referring to. Since the name of the file in the S3 in “superstore.csv“, I am going to name the table as “superstore“.
Finally, we need to provide the location of the data file that we are going to query using Athena. This is the S3 URI of the folder under which your files are placed. An important point to note here is that we are not supposed to provide the ARN of the file. Instead, we should provide the location of the next higher level directory in which the file resides. In our example, the file resides in the bucket “ad-athena-datasets“, under the directory “data“. So the location that is to be provided here is “s3://ad-athena-datasets/data/“. Click on Next once done.
On the next page, select the type of file that we are going to work with. Since it is a CSV file, simply select CSV from the options and click on Next.
On the next page, we need to define the name of the columns in the dataset and the datatypes associated with those. Since our dataset is fairly simple, by specifying the name and the datatypes manually. Specifying the appropriate datatypes in this part is extremely important otherwise there might be some errors while executing the queries against the dataset.
There is also an option to bulk add all the columns in case you have a relatively wider dataset that has many columns in it. In that case, you can click on Add Bulk and then provide the name of the column and the datatype in a key-value pair as follows.
On the next page, we need to define partitions. For the sake of simplicity, we are not using any sort of partitions on our dataset since it is quite small. However, when you have humongous datasets, you can partition those by days, months, years and so on. Once you are done with these, click on Create Table. You will notice that a new query window appears which generates a script to create a new external table in Amazon Athena. Click on Run Query to execute the script and create the table.
You can see that the query has run successfully and a new table with the name “superstore” has been created in the Athena console.
Querying the external table in Amazon Athena
Now that we have created our external table in Athena, we can query the table by using standard SQL scripts. Navigate to the query window and write a simple SQL script to fetch all the records from the superstore table. You can see the results in the query window below.
As you can see in the figure above, we have successfully queried the external table in Athena and the results were returned from the CSV file stored in S3. This is a very simple example and does not involve any complexities. However, in a proper business, it is possible that there will be a lot more files in the same folder and with different data qualities. Amazon Athena provides a simple interface to query all such data in S3 by using a native SQL syntax which is very simple and useful.
In this article, we have seen how to use Amazon Athena to run interactive queries to analyze data stored in flat files in Amazon S3. Athena is one of the most popular used tools in the field of data science and analytics as it is very simple and easy to query data stored in those flat files. You can consider Athena loosely based on the open-source software Apache Hive which used similar technology to query files stored in distributed systems using SQL queries.