Creating a Scalable Amazon EMR Cluster on AWS in Minutes

Creating a Scalable EMR Cluster on AWS in Minutes

Amazon EMR cluster

You can quickly set up an Amazon EMR cluster to process and analyze data with the aid of big data frameworks like Spark. Three primary categories Plan and Configure, Manage, and Clean Up are covered in this article.

This is a thorough breakdown of how to set up a cluster:

Configuring the Amazon EMR Cluster

The course walks you through using Spark to launch an example cluster and executing a basic PySpark script. You must finish the activities listed in the “Before you set up Amazon EMR” criteria before you can start.

In accordance with Amazon EMR pricing, which varies by region, the sample cluster will incur minor charges at the per-second rate while operating in a live environment. It is essential to finish the cleanup duties outlined in the tutorial’s last step in order to prevent further fees.

There are several steps in the setup process:

Configure Amazon EMR Cluster and Data Resources

Preparing your application and input data, setting up your data storage location, and finally starting the cluster itself are all part of this first stage.

Set Up Storage for Amazon EMR:

Although Amazon EMR supports a number of file systems, in this tutorial, you store data in an S3 bucket using EMRFS. A Hadoop file system solution for reading and writing to Amazon S3 is called EMRFS.
For this tutorial, you must establish a special S3 bucket. To create a bucket, adhere to the guidelines provided in the Amazon Simple Storage Service Console User Guide.
The bucket needs to be made in the same AWS region as your Amazon EMR cluster launch. Take US West (Oregon) us-west-2, for instance.
There are restrictions on the names of buckets and folders used with Amazon EMR. They can be lowercase letters, digits, periods (.), and hyphens (-); they cannot finish in numbers, and each bucket name must be distinct across all AWS accounts.
The bucket must have an empty output folder.
Small files stored on Amazon S3 may incur minimal fees, however if you are under the AWS Free Tier usage limits, some or all of the fees may be eliminated.

Create an Amazon EMR application using input data:

Uploading an application and its input data to Amazon S3 is the standard procedure for getting it ready. You include S3 locations when submitting work.
The given PySpark script analyses King County, Washington food establishment inspection data from 2006 to 2020 and determines the top ten restaurants with the highest number of “Red” type violations. The dataset’s sample rows are given.
Copy the sample code from the source into a new file and save it as health_violations.py to get the PySpark script ready. Next, add this file to your newly established S3 bucket. The Getting Started Guide for Amazon Simple Storage Service has advice on how to upload.
The sample input data can be prepared by downloading and unzipping the food_establishment_data.zip file, saving the CSV file to your computer as food_establishment_data.csv, and then uploading the CSV file to the same S3 bucket. Once more, for details on how to upload, consult the Amazon Simple Storage Service Getting Started Guide.
“Prepare input data for processing with Amazon EMR” contains more details about configuring data for EMR.

Start an Amazon EMR Cluster:

Using Apache Spark and the most recent Amazon EMR release version, you may start the sample cluster after setting up storage and your application. Either the AWS Management Console or the AWS CLI can be used for this.
Console Launch:
- Open the Amazon EMR console after logging into the AWS Management Console.
Go to “EMR on EC2” > “Clusters” > “Create cluster” to begin.
The default values for “Release,” “Instance type,” “Number of instances,” and “Permissions” should be noted.
Input a distinct “Cluster name” that doesn’t include <, >, $, |, or `.
Choose “Spark” from the “Applications” menu to begin installing Spark. Note: Applications cannot be added or removed after the cluster has been launched; you must select them beforehand.
To publish cluster-specific logs to Amazon S3, tick the box under “Cluster logs”. The default destination is s3://amzn-s3-demo-bucket/logs. Replace it with your S3 bucket. For log files, this generates a new ‘logs’ subdirectory.
Click on “Security configuration and permissions” and choose your pair of EC2 keys. For the instance profile, choose “EMR_DefaultRole” for the Service role and “EMR_EC2_DefaultRole” for the IAM role.
Select the “Create cluster” option.
The page with cluster details appears. Observe how the cluster’s “Status” shifts from “Starting” to “Running” to “Waiting” as the EMR fills the cluster. The console view may need to be refreshed. When the cluster is prepared to take on work, the status changes to “Waiting”.
Use the AWS CLI’s aws emr create-default-roles command to generate IAM default roles.
Create a Spark cluster with aws emr create-cluster. Give your EC2 key pair a name with –name, the necessary settings for –instance-type, –instance-count, and –use-default-roles, and a name with –name. The example command’s Linux line continuation characters () may need to be modified for Windows.
The ClusterId and ClusterArn will be displayed in the output. You will need your ClusterId later, so make a note of it.
Check your cluster’s status using aws emr describe-cluster –cluster-id myClusterId>.
The Status object with a State is displayed in the output. As the cluster is provisioned by EMR, the State value shifts from STARTING to RUNNING to WAITING. When the cluster is ready, operational, and up, its status changes to WAITING.

Permitting SSH Connections

You must change your cluster security groups to allow incoming SSH connections before you may use SSH to connect to your running cluster. Security groups on Amazon EC2 function as virtual firewalls. EMR established default security groups when you started the cluster: ElasticMapReduce-slave for core and task nodes and ElasticMapReduce-master for the primary node.

SSH authorisation using the console:

To administer security groups for the cluster’s VPC, you require authorisation.
Open the Amazon EMR console after logging into the AWS Management Console.
Click on “Clusters” and pick the cluster that needs updating. Pre-selection of the “Properties” tab is required.
From the “Properties” tab, select “Networking” and then “EC2 security groups (firewall)”. Choose the link for the security group under “Primary node”.
The EC2 console is now accessible. Select “Edit inbound rules” after selecting the “Inbound rules” option.
Look for and remove any inbound rule (Type: SSH, Port: 22, Source: Custom 0.0.0.0/0) that permits public access. Warning: It is highly advised to remove the ElasticMapReduce-master group’s pre-configured rule that permitted public access and limit traffic to reliable sources.
Select “Add Rule” by scrolling down.
Choose “SSH” for “Type”; this automatically sets Port Range to 22 and Protocol to TCP.
Choose “My IP” for “Source” to enter your current IP address or provide a range of “Custom” trusted client IP addresses. Keep in mind that dynamic IPs may need to be updated in the future.
Select “Save.”
To grant SSH access to those nodes as well, you can choose “Core and task nodes” from the list when you’re back in the EMR console and follow these instructions again.

Using the AWS CLI to connect:

You can use the AWS CLI to establish an SSH connection regardless of your operating system.
Aws emr ssh –cluster-id –key-pair-file <~/mykeypair.key> should be used. Substitute your ClusterId for and the complete path to your key pair file for ~/mykeypair.key>.
To view Spark logs on the master node, go to /mnt/var/log/spark after connecting.
Submitting work, which is done in phases, is the next crucial step after cluster setup and access configuration.