AWS Data Preparation Services

Data Sources
- S3
- Data lakes
- Redshift
- Operational databases (cloud/on-premises)
Analysis Tools
- QuickSight (Business Intelligence)
- SageMaker (Machine Learning)

Extract
- Select relevant subset of data
- Focus on data needed for specific insights
Transform
- Clean data
- Remove duplicates
- Format correction
- Combine data sources
- Sort data
Load
- Transfer to target analysis platform
- Prepare for consumption by applications

Glue Crawlers
- Discover data across multiple sources
- Collect metadata
- Support for S3, Redshift, RDS, EC2 databases
Glue Data Catalog
- Stores metadata about data locations
- Integration with:
  - EMR
  - Redshift
  - Athena
Glue ETL Jobs
- Transform data based on catalogs
- Load to various destinations:
  - Lake Formation
  - Redshift
  - S3
  - CloudWatch
Glue Data Quality
- Automated quality recommendations
- Predefined rule sets
- Quality metrics and alerts
- Threshold monitoring

Direct S3 querying
Standard SQL syntax
Federated queries across 25+ data sources
Integration with:
- QuickSight
- SageMaker
- Lake Formation
- Operational databases

Most data ecosystems require ETL, though AWS is working toward ETL-free solutions
Glue provides end-to-end ETL capabilities in a serverless environment
Athena specializes in petabyte-scale extraction and SQL querying of S3 objects
Both services integrate with broader AWS analytics ecosystem

Last updated 6 months ago

Was this helpful?