Lab: Querying Data in Amazon S3 with Amazon Athena

Introduction

Welcome to this hands-on AWS lab, where we'll query data stored in Amazon S3 with SQL queries in Amazon Athena. Let's get started!

Solution

Log in to the AWS Management Console using the credentials provided for the lab. Make sure you're in the us-east-1 (N. Virginia) Region.

Create a Table from S3 Bucket Metadata

Review the S3 Bucket Resources

Navigate to S3 using the Services menu or the unified search bar.
You should see the S3 bucket that was provided for the lab. This bucket has been partitioned by year, month, and day.
Select the bucket name.
Open the year=2018/ folder, then open the month=11/ folder.
Open one of the day=X/ folders.
You should see a .parquet data file representing your column-oriented data structure.

Configure Settings to Send Query Results to the S3 Bucket

Navigate to Amazon Athena using the Services menu or the unified search bar.
Click Launch query editor.
From the menu on the left of the screen, select Workgroups.
Select the primary workgroup.
Click on Edit.
Scroll down and expand Query result configuration (optional).
Click on Browse S3.
Check the radio button next to your S3 bucket name.
Click on Choose.
Scroll down to the bottom of the page and click on the Save changes button.
From the menu on the left, select Query editor.
Select the Settings tab, then click Manage.
On the right, click Browse S3.
You should see the S3 bucket that was provided for the lab.
Select the S3 bucket's radio button, then click Choose.
The S3 bucket is auto-populated in the Location of query result setting.
Enable the Assign bucket owner full control over query results option and click Save.
Select the Editor tab.
In the sidebar menu, select Data sources.
You should see the AWS Glue Data Catalog provided for the lab.
Click Create data source.
Ensure S3 - AWS Glue Data Catalog is selected as the data source, then click Next.
Fill in the AWS Glue Data Catalog section:
- Choose an AWS Glue Data Catalog: AWS Glue Data Catalog in this account
- Choose a way to create a table: Create a table manually
Click Create table.

Create a Table from the S3 Bucket Data

In the Table name field, enter cf_access_optimized.
Fill in the Database configuration section:
- Choose an existing database or create a new database: Create a database
- Database name: aws_service_logs
Fill in the Dataset section:
- Click Browse S3.
- Select the S3 bucket's radio button, then click Choose.
In the Data format section, under Table type make sure Apache Hive is selected.
For File format click the dropdown and select Apache Parquet.

Fill in the Column details section:

Click Bulk add columns.

In the Bulk add columns pop-up, add the following data:

time timestamp, location string, bytes bigint, requestip string, method string, host string, uri string, status int, referrer string, useragent string, querystring string, cookie string, resulttype string, requestid string, hostheader string, requestprotocol string, requestbytes bigint, timetaken double, xforwardedfor string, sslprotocol string, sslcipher string, responseresulttype string, httpversion string

Click Add.
The column names and types are auto-populated.

Fill in the Partition details section:
- Click Add column twice so you can create three partitions.
- Create a year partition:
  - Column name: year
  - Column type: string
- Create a month partition:
  - Column name: month
  - Column type: string
- Create a day partition:
  - Column name: day
  - Column type: string
In the Preview table query section, review your query.
Click Create table.

Add Partition Metadata

From the Athena query editor, click the plus icon on the right to open a new query tab.
In the editor, load the partition data with the following query:
```
MSCK REPAIR TABLE aws_service_logs.cf_access_optimized
```
Click Run.
Click the plus icon on the right to open another new query tab.
In the editor, verify the number of rows created from your data with the following query:
```
SELECT count(*) AS rowcount FROM aws_service_logs.cf_access_optimized
```
Click Run.
Note that the rowcount equals 207535, which is the expected number of rows in your table.
Click the plus icon on the right to open another new query tab.
In the editor, verify the partitions were created with the following query:
```
SELECT * FROM aws_service_logs.cf_access_optimized order by time desc LIMIT 10
```
Click Run.

Query the Total Bytes Served in a Date Range

From the Athena query editor, click the plus icon on the right to open a new query tab.

In the editor, view the data timestamped between 11-02-2018 and 11-03-2018 with the following query:

SELECT * FROM aws_service_logs.cf_access_optimized WHERE time BETWEEN TIMESTAMP '2018-11-02' AND TIMESTAMP '2018-11-03'

Click Run.
Note the bytes column. This information is useful, but doesn't give you the total bytes served.
Above the query results, click Clear to clear the current query.

In the editor, view the total bytes served with the following query:

SELECT SUM(bytes) AS total_bytes
FROM aws_service_logs.cf_access_optimized
WHERE time BETWEEN TIMESTAMP '2018-11-02' AND TIMESTAMP '2018-11-03'

Click Run.
Note that the total_bytes equals 87310409. You can use this information to figure out which parts of your website can be optimized to lower your CloudFront bill.

PreviousAmazon Glacier: AWS Archive Storage Service NextLAB: Loading Data into a Redshift Cluster

Last updated 7 months ago

Was this helpful?