What is Amazon EMR? How to create Amazon EMR clusters

What is Amazon EMR?

For processing and analysing massive amounts of data, Amazon EMR, formerly Amazon Elastic MapReduce, makes it easy to run big data frameworks on AWS, such as Apache Hadoop and Apache Spark. Data can be processed for business intelligence and analytics using these frameworks and open-source applications. Amazon EMR also lets you transform and transmit enormous amounts of data between Amazon DynamoDB and Amazon S3.

How to construct and operate Amazon EMR clusters

Comprehensive explanation of Amazon EMR clusters, including how to submit work to one, how the data is handled, and the several stages the cluster goes through during processing.

Getting to know nodes and clusters

The cluster is Amazon EMR’s main component. A cluster is a group of instances of Amazon Elastic Compute Cloud (Amazon EC2). Every instance within the cluster is referred to as a node. Known as the node type, each node has a certain duty inside the cluster. In order to assign each node a function in a distributed application such as Apache Hadoop, Amazon EMR also installs various software components on each type of node.

The Amazon EMR node types are as follows:

Primary node: The primary node is the one that runs software components to coordinate the allocation of jobs and data across other processing nodes, hence managing the cluster. The cluster’s health and task status are tracked by the principal node. There is a primary node in every cluster, and the primary node alone can be used to form a single-node cluster.

Core node: A node that houses the software components needed to execute operations and store information in your cluster’s Hadoop Distributed File System (HDFS). There is at least one core node in multi-node clusters.

Task node: A node that has software components but does not store data in HDFS; instead, it solely executes tasks. Task nodes are not required.

Work submission to a cluster

You can specify the work that needs to be done in a number of ways when running a cluster on Amazon EMR.

In functions that you define as steps when you construct a cluster, give a detailed description of the work that needs to be done. Clusters that handle a certain volume of data and then shut down after processing is finished are usually the ones that get this treatment.

After creating a long-running cluster, submit steps which could include one or more jobs using the Amazon EMR UI, the Amazon EMR API, or the AWS CLI. To learn more, see Submit work to an Amazon EMR cluster.

Create a cluster, use SSH to connect to the primary node and other nodes as needed, then proceed with tasks and submit interactive or scripted queries using the interfaces provided by the installed apps. Use the Amazon EMR Release Guide to learn more.

Processing data

The frameworks and apps you want to install for your data processing requirements are determined when you launch your cluster. You can either perform steps in the cluster or send jobs or queries straight to installed apps to process data in your Amazon EMR cluster.

Directly posting job openings to applications

Your Amazon EMR cluster’s installed software allows you to submit jobs and communicate with it directly. Usually, you do this by establishing a secure connection to the primary node and using the tools and interfaces available for the software that runs directly on your cluster.

Executing procedures to process data

One or more ordered steps can be sent to an Amazon EMR cluster. Each step is a unit of work that includes instructions for data manipulation that the cluster’s installed software can use.

An example procedure with four steps is as follows:

Send in a dataset for processing as input.
Use a Pig program to process the output from the first stage.
To process a second input dataset, use a Hive application.
Create a dataset for output.

In most cases, data saved as files in your selected underlying file system, like HDFS or Amazon S3, is used as input when processing data in Amazon EMR. This data moves through the processing chain from one stage to the next. In the last stage, the output data is written to a designated location, like an Amazon S3 bucket.

The following order is used to run the steps:

A request to start the processing stages is made.
All steps are in the PENDING condition.
Its status turns to RUNNING when the sequence’s first step begins. The remaining steps are still at the PENDING stage.
Its state turns to COMPLETED once the first step is finished.
Its status turns to RUNNING once the sequence’s subsequent step begins. Its condition becomes COMPLETED once it is finished.
Every stage follows this pattern until they are all finished and the processing is finished.

The steps’ sequence and state changes during processing are depicted in the following diagram.

The state of a step becomes FAILED if it fails during processing. Each step has a follow-up that you can choose. By default, if a preceding step fails, the remaining steps in the sequence are set to CANCELLED and do not execute. Other options include ending the cluster right away or ignoring the failure and letting the remaining steps go forward.

When a step fails during processing, the default state change and step sequence are depicted in the accompanying diagram.

Understanding the lifespan of a cluster

This is how a successful Amazon EMR cluster operates:

According to your requirements, Amazon EMR first sets up EC2 instances in the cluster for every instance. See Configure hardware and networking for Amazon EMR clusters for additional details. Amazon EMR always uses either the Amazon EMR default AMI or a custom Amazon Linux AMI that you designate. For additional details, see to Using a custom AMI to provide Amazon EMR cluster configuration more flexibility. At this stage, the cluster state is only getting started.
You can configure bootstrap activities for each instance that Amazon EMR executes. You can install custom apps and make the necessary customisations using bootstrap actions. To learn more about installing extra software with an Amazon EMR cluster, read Create bootstrap actions. The cluster status at this stage is BOOTSTRAPPING.
When you build the cluster, you can decide which native apps, such Hive, Hadoop, Spark, and others, Amazon EMR will install.
The cluster state is RUNNING once the bootstrap procedures have been successfully finished and native applications have been installed. Now you can connect to cluster instances, and the cluster will execute whatever steps you set when you formed it in a sequential manner. You have the option to submit further actions that execute after any earlier steps are finished. To learn more, see Submit work to an Amazon EMR cluster.
Following a successful step, the cluster enters a WAITING state. After the final phase, an auto-terminating cluster enters TERMINATING before becoming TERMINATED. If set to wait, the cluster must be manually shut down when no longer needed. Following a manual shutdown, the cluster enters the TERMINATING state before transitioning into the TERMINATED state.

Unless termination protection is enabled, if a cluster lifecycle failure occurs, Amazon EMR will terminate the cluster and all of its instances. Any data saved on a cluster is erased and its status is changed to TERMINATED_WITH_ERRORS if it ends due to a failure. It is possible to recover data from your cluster, remove termination protection, and end the cluster if you configured it. See Protecting your Amazon EMR clusters from unintentional shutdown with termination protection for additional details.

The cluster lifecycle is depicted in the following graphic, along with how each stage corresponds to a certain cluster state.

One response

What are the benefits of Amazon EMR? Drawbacks of AWS EMR

May 12, 2025 at 3:26 pm

[…] Amazon EMR offers numerous advantages. These include AWS’s flexibility and the financial advantages when compared to developing your own on-premises resources. […]

Loading…

Reply