EMR

Elastic MapReduce.

EMR helps with ETL (Extract, Transform, Load) processing of large datasets, web indexing, machine learning training, and large-scale genomics.

EMR Uses open-sources tools such as Spark, Hive, HBase, Flink, Hudi, and Presto.

Clusters are groups of EC2 instances within Amazon EMR. Each instance is a node.

Master Node: Manages the cluster, coordinating the distribution of data and tasks.
Core Nodes: Run tasks and store data in the Hadoop Distributed File System (HDFS).
Task Nodes (Optional): Only run tasks, do not store data. Typically uses spot instances.

On-Demand. Most Reliable as will not be terminated. Most expensive choice.
Reserved. Minimun of 1 year. Offer great cost savings. Typically used for Primary and Core nodes.
Spot. Cheapest option. Can be terminated with little warning. Typically used for Task nodes.

3 different storages options:

Hadoop Distributed File System (HDFS). Stored data across instances (nodes) and it is used for caching results during processing
EMR File System (EMRFS) extends HDFS to add the ability to directly access S3. S3 is used to store input and output data, not intermediate data.
Local File System. Disks created with each EC2 instance. These are ephemeral disks.

Last updated 9 months ago

Was this helpful?