Lab: Querying Data in Amazon S3 with Amazon Athena
Introduction
Welcome to this hands-on AWS lab, where we'll query data stored in Amazon S3 with SQL queries in Amazon Athena. Let's get started!
Solution
Log in to the AWS Management Console using the credentials provided for the lab. Make sure you're in the us-east-1
(N. Virginia) Region.
Create a Table from S3 Bucket Metadata
Review the S3 Bucket Resources
Navigate to S3 using the Services menu or the unified search bar.
You should see the S3 bucket that was provided for the lab. This bucket has been partitioned by year, month, and day.
Select the bucket name.
Open the
year=2018/
folder, then open themonth=11/
folder.Open one of the
day=X/
folders.You should see a
.parquet
data file representing your column-oriented data structure.
Configure Settings to Send Query Results to the S3 Bucket
Navigate to Amazon Athena using the Services menu or the unified search bar.
Click Launch query editor.
From the menu on the left of the screen, select Workgroups.
Select the primary workgroup.
Click on Edit.
Scroll down and expand Query result configuration (optional).
Click on Browse S3.
Check the radio button next to your S3 bucket name.
Click on Choose.
Scroll down to the bottom of the page and click on the Save changes button.
From the menu on the left, select Query editor.
Select the Settings tab, then click Manage.
On the right, click Browse S3.
You should see the S3 bucket that was provided for the lab.
Select the S3 bucket's radio button, then click Choose.
The S3 bucket is auto-populated in the Location of query result setting.
Enable the Assign bucket owner full control over query results option and click Save.
Select the Editor tab.
In the sidebar menu, select Data sources.
You should see the AWS Glue Data Catalog provided for the lab.
Click Create data source.
Ensure S3 - AWS Glue Data Catalog is selected as the data source, then click Next.
Fill in the AWS Glue Data Catalog section:
Choose an AWS Glue Data Catalog: AWS Glue Data Catalog in this account
Choose a way to create a table: Create a table manually
Click Create table.
Create a Table from the S3 Bucket Data
In the Table name field, enter cf_access_optimized.
Fill in the Database configuration section:
Choose an existing database or create a new database: Create a database
Database name: aws_service_logs
Fill in the Dataset section:
Click Browse S3.
Select the S3 bucket's radio button, then click Choose.
In the Data format section, under Table type make sure Apache Hive is selected.
For File format click the dropdown and select Apache Parquet.
Fill in the Column details section:
Click Bulk add columns.
In the Bulk add columns pop-up, add the following data:
Click Add.
The column names and types are auto-populated.
Fill in the Partition details section:
Click Add column twice so you can create three partitions.
Create a year partition:
Column name: year
Column type: string
Create a month partition:
Column name: month
Column type: string
Create a day partition:
Column name: day
Column type: string
In the Preview table query section, review your query.
Click Create table.
Add Partition Metadata
From the Athena query editor, click the plus icon on the right to open a new query tab.
In the editor, load the partition data with the following query:
Click Run.
Click the plus icon on the right to open another new query tab.
In the editor, verify the number of rows created from your data with the following query:
Click Run.
Note that the
rowcount
equals207535
, which is the expected number of rows in your table.Click the plus icon on the right to open another new query tab.
In the editor, verify the partitions were created with the following query:
Click Run.
Query the Total Bytes Served in a Date Range
From the Athena query editor, click the plus icon on the right to open a new query tab.
In the editor, view the data timestamped between 11-02-2018 and 11-03-2018 with the following query:
Click Run.
Note the
bytes
column. This information is useful, but doesn't give you the total bytes served.Above the query results, click Clear to clear the current query.
In the editor, view the total bytes served with the following query:
Click Run.
Note that the
total_bytes
equals87310409
. You can use this information to figure out which parts of your website can be optimized to lower your CloudFront bill.
Last updated
Was this helpful?