What Are The Programmatic Commands For EMR Notebooks?

Programmatic Commands For EMR Notebooks

How to programmatically interact with Amazon EMR Notebooks.

From a script or command line, how to use execution APIs to programmatically control EMR notebook executions outside of the AWS UI. This enables you to list, characterise, halt, and initiate EMR notebook executions.

The following examples are given to illustrate these capabilities:

AWS CLI: Examples of using an Amazon EMR cluster operating on Amazon EC2 or an EMR Notebooks cluster (EMR on EKS) with a notebook in an EMR Studio Workspace are included. An example of how to execute a notebook by supplying its Amazon S3 location is also provided. Listing executions filtered by start time or by start time and status, interrupting an ongoing execution, and describing a notebook execution are all possible with the commands displayed.

Boto3 SDK (Python): The boto3 library is used in the example Python script (demo.py) to communicate with the EMR notebook execution APIs. The script shows how to launch a notebook execution, obtain the execution ID, describe the execution, display all of the running instances, and halt an execution after a short pause. Included is the output of executing this script, which displays status changes and execution IDs.

Ruby SDK: Using notebook execution API calls and setting up an Amazon EMR connection are demonstrated using sample Ruby code. There are examples of how to describe the execution and print information, halt notebook execution, and start notebook execution and obtain the execution ID. Additionally, the predicted results of describing a notebook run in Ruby are displayed.

Code Example (Python – Boto3)

import boto3,time

# Prepare an Amazon EMR client for a specific region
# NOTE: Replace 'us-west-1' and placeholder IDs/paths with your actual values
emr = boto3.client( 'emr', region_name='us-west-1' )

# Start a notebook execution
# Requires EditorId, RelativePath, ExecutionEngine (Id), and ServiceRole
start_resp = emr.start_notebook_execution(
 EditorId='e-40AC8ZO6EGGCPJ4DLO48KGGGI', # Replace with your Editor ID
 RelativePath='boto3_demo.ipynb', # Replace with your notebook path
 ExecutionEngine= { 'Id':'j-1HYZS6JQKV11Q'}, # Replace with your cluster/engine ID
 ServiceRole='EMR_Notebooks_DefaultRole' # Replace with your service role
)

# Get the execution ID from the response
execution_id = start_resp["NotebookExecutionId"]
print(f"Started execution with ID: {execution_id}") # Added f-string for clarity

# Describe the notebook execution
describe_response = emr.describe_notebook_execution(NotebookExecutionId=execution_id)
print("\nDescription of the execution:")
print(describe_response)

# List existing notebook executions (optional)
list_response = emr.list_notebook_executions()
print("\nExisting notebook executions:")
for execution in list_response['NotebookExecutions']:
 print(execution)

# --- The original script included a pause and stop ---
# print("\nSleeping for 5 sec...")
# time.sleep(5) # Pause for demonstration
# print(f"Stopping execution {execution_id}") # Added f-string

# Stop the notebook execution
# emr.stop_notebook_execution(NotebookExecutionId=execution_id)

# Describe the execution again to see status change (optional)
# describe_response = emr.describe_notebook_execution(NotebookExecutionId=execution_id)
# print("\nDescription after attempting stop:")
# print(describe_response)

Parameters used in these programmatic commands

Among the crucial parameters utilised in these programming instructions are:

editor-id or EditorId: Identifies the workspace for the EMR Studio.
relative-path or RelativePath: The path of the notebook file with relation to the workspace’s home directory is specified. My_folder/python3.ipynb and demo_pyspark.ipynb are two examples of pathways.
execution-engine or ExecutionEngine: Either an EMR cluster ID (j-1234ABCD123) or an EMR on EKS endpoint ARN and type to specify which engine to utilize.
service-role or ServiceRole: The IAM service role, such as EMR_Notebooks_DefaultRole, is specified.
notebook-params or notebook_params: Enables a notebook to receive various parameter values, removing the need for numerous notebook copies. The usual format for parameters is a JSON string.
notebook-s3-location or notebook_s3_location: The input notebook file’s S3 bucket and key are specified.
output-notebook-s3-location or output_notebook_s3_location: Specifies the S3 bucket and key where the output notebook will be kept.
notebook-execution-name: Gives the performance a unique moniker.
notebook-execution-id or notebook_execution_id: When describing, pausing, or listing, this is used to identify a particular execution.
–from and –status: Parameters to filter displayed executions based on status and/or start time.

Additionally, EMR Notebooks may be accessed in the console as EMR Studio Workspaces, according to the documentation. Additional IAM role permissions are required for users to access or create Workspaces. StartNotebookExecution, DescribeNotebookExecution, ListNotebookExecutions, and iam:PassRole permissions are among the specific IAM policies needed for programmatic execution. When using an EMR Notebooks cluster (EMR on EKS), additional permissions pertaining to emr-containers are required.

An execution is terminated if it continues for more than 30 days, and there is a limit of 100 concurrent executions per AWS Region per account. Interactive Amazon EMR Serverless applications do not enable programmatic execution.

You may use AWS Lambda and Amazon CloudWatch Events to plan or batch EMR notebook runs, or you can use Apache Airflow or Amazon Managed Workflows for Apache Airflow (MWAA) to coordinate them.