AWS Data Preparation Services


Data Ecosystem Components
Data Sources
S3
Data lakes
Redshift
Operational databases (cloud/on-premises)
Analysis Tools
QuickSight (Business Intelligence)
SageMaker (Machine Learning)
ETL (Extract, Transform, Load) Process
Extract
Select relevant subset of data
Focus on data needed for specific insights
Transform
Clean data
Remove duplicates
Format correction
Combine data sources
Sort data
Load
Transfer to target analysis platform
Prepare for consumption by applications
AWS Glue
Core Features
Serverless ETL service
Automated scaling for processing and storage
Data quality monitoring and rules
Scheduled or event-driven jobs

Components
Glue Crawlers
Discover data across multiple sources
Collect metadata
Support for S3, Redshift, RDS, EC2 databases
Glue Data Catalog
Stores metadata about data locations
Integration with:
EMR
Redshift
Athena
Glue ETL Jobs
Transform data based on catalogs
Load to various destinations:
Lake Formation
Redshift
S3
CloudWatch
Glue Data Quality
Automated quality recommendations
Predefined rule sets
Quality metrics and alerts
Threshold monitoring

Amazon Athena
Key Characteristics
Serverless query service
Petabyte-scale analysis capability
Query data where it resides
SQL query support
Apache Spark integration

Features
Direct S3 querying
Standard SQL syntax
Federated queries across 25+ data sources
Integration with:
QuickSight
SageMaker
Lake Formation
Operational databases
Key Exam Points
Most data ecosystems require ETL, though AWS is working toward ETL-free solutions
Glue provides end-to-end ETL capabilities in a serverless environment
Athena specializes in petabyte-scale extraction and SQL querying of S3 objects
Both services integrate with broader AWS analytics ecosystem
Last updated
Was this helpful?