LAB: Loading Data into a Redshift Cluster

Introduction

In this lab, you work as a Database Administrator and manage your company's Redshift cluster. The Development team has requested a way to import data into the Redshift cluster from either an S3 bucket or DynamoDB table. They have provided you with a small sample of the data to be imported from the S3 bucket or DynamoDB table into Redshift.

Solution

Log in to the live AWS environment using the credentials provided. Use an incognito or private browser window to ensure you're using the lab account rather than your own.

Make sure you're in the N. Virginia (us-east-1) region throughout the lab.

Open a terminal session and log in to the provided EC2 instance via SSH using the credentials listed on the lab page:

ssh cloud_user@<PUBLIC_IP>

Configure AWS CLI

In the AWS console, click the cloud_user username in the upper right corner.
Select My Security Credentials.
Under Access keys for CLI, SDK, & API access, click Create access key.
Save Access Key and Secret Access Key to a scratch pad or download the CSV file.
In the terminal, configure the AWS CLI client:
```
aws configure
```
Paste the Access Key into the first prompt.
In the second prompt, paste the Secret Access Key.
For Default region name, enter us-east-1.
For Default output format, press Enter to leave it as None.

Prepare the Source Data

Download CSV-formatted data:

curl -O -l https://raw.githubusercontent.com/linuxacademy/content-aws-database-specialty/master/S06_Additional%20Database%20Services/redshift-data.csv

Download JSON-formatted data:

curl -O -l https://raw.githubusercontent.com/linuxacademy/content-aws-database-specialty/master/S06_Additional%20Database%20Services/redshift-data.json

Create the S3 bucket and give it a globally unique name (e.g., having today's date and the current time at the end):
```
aws s3 mb s3://redshift-import-<DATE_AND_TIME>
```

Load the data into the bucket, replacing <BUCKET_NAME> with your bucket name:

aws s3api put-object --bucket <BUCKET_NAME> --key redshift-data.csv --body redshift-data.csv

Confirm redshift-data.csv file is listed:
```
aws s3 ls s3://<BUCKET_NAME>
```

Create the DynamoDB table:

aws dynamodb create-table --table-name redshift-import --attribute-definitions AttributeName=ID,AttributeType=N --key-schema AttributeName=ID,KeyType=HASH --provisioned-throughput ReadCapacityUnits=5,WriteCapacityUnits=5

Confirm DynamoDB table was created:
```
aws dynamodb list-tables
```
Import the JSON data into the DynamoDB table (NOTE:"UnprocessedItems": {} will appear if successful.):
```
aws dynamodb batch-write-item --request-items file://redshift-data.json
```

Confirm it was imported:

aws dynamodb scan --table-name redshift-import

Create IAM Role

In the AWS Management Console, navigate to IAM.
In the side menu, click Roles.
Click Create role.
Scroll down and select EC2 as the service.
Click the first EC2 use case.
Click Next: Permissions.
On the Attach permissions policies page, in the Filter policies box, search for and select each of the following managed policies:
- AmazonS3ReadOnlyAccess
- AmazonDynamoDBReadOnlyAccess
Click Next.
For Role name, enter "redshift-import".
Click Create role.
On the IAM > Roles page, select the newly created redshift-import role.
On the redshift-import Summary page, click the Trust relationships tab.
Click Edit trust relationship.
Change the "Service" line to:
```
"Service": "redshift.amazonaws.com"
```
Click Update Policy.

Load the Data into the Redshift Cluster

Navigate to Redshift and access the dashboard.
Select our listed cluster.
Click Actions > Manage IAM roles.
In Available IAM roles, open the dropdown menu and select redshift-import.
Click Associate IAM role.
Click Save changes.
In the terminal, list the Redshift clusters:
```
aws redshift describe-clusters | head -25
```
In the "Endpoint" section, copy the listed Address within the quotation marks.
Paste the endpoint name in the PGHOST environment variable:
```
export PGHOST=<Redshift_Endpoint>
```
List the IAM roles:
```
aws iam list-roles
```
Copy the value of the "Arn" for the redshift-import role, and paste it into a text file.
List the S3 buckets:
```
aws s3 ls
```
Copy the bucket name and paste it into a text file, if you haven't already.
Echo $PGHOST:
```
echo $PGHOST
```
Connect to the cluster:
```
psql -U masteruser -p 5439 import-test
```
At the prompt, enter the password:
```
MasterPasswd2020!
```

Create the table:

create table users_import (ID int, Name varchar, Department varchar, ExpenseCode int);

Navigate to Amazon Redshift, select your cluster, go under Actions and choose Manage IAM Roles. Select the redshift-import IAM role, click Associate IAM role, and then click Save changes.
Import data from the S3 bucket, replacing <BUCKET_NAME> with your bucket name and <IAM_ROLE_ARN> with the ARN you copied:
```
copy users_import from 's3://<BUCKET_NAME>/redshift-data.csv' iam_role '<IAM_ROLE_ARN>' delimiter ',';
```
Query the table to verify rows were inserted:
```
select * from users_import ;
```
Clear the users_import table:
```
truncate users_import ;
```
Confirm table status:
```
select * from users_import ;
```

Import the data from DynamoDB, replacing <IAM_ROLE_ARN> with the ARN you copied:

copy users_import from 'dynamodb://redshift-import' iam_role '<IAM_ROLE_ARN>' readratio 50;

Query the table to ensure rows were inserted:
```
select * from users_import ;
```

PreviousLab: Querying Data in Amazon S3 with Amazon Athena NextAttached Storage

Last updated 9 months ago

Was this helpful?