When we develop a Spark application on EMR, we can use docker for local development or notebooks via EMR Studio (or EMR Notebooks). However the local development option is not viable if the size of data is large. Also I am not a fan of notebooks as it is not possible to utilise the features my editor supports such as syntax highlighting, autocomplete and code formatting. Moreover it is not possible to organise code into modules and to perform unit testing properly with that option. In this post, We will discuss how to set up a remote development environment on an EMR cluster deployed in a private subnet with VPN and the VS Code remote SSH extension. Typical Spark development examples will be illustrated while sharing the cluster with multiple users. Overall it brings another effective way of developing Spark apps on EMR, which improves developer experience significantly.
Architecture
An EMR cluster is deployed in a private subnet and, by default, it is not possible to access it from the developer machine. We can construct a PC-to-PC VPN with SoftEther VPN to establish connection to the master node of the cluster. The VPN server runs in a public subnet and it is managed by an autoscaling group where only a single instance is maintained. An elastic IP address is associated with the instance so that its public IP doesn’t change even if the EC2 instance is recreated. Access from the VPN server to the master node is allowed by an additional security group where the VPN’s security group is granted access to the master node. The infrastructure is built using Terraform and the source can be found in the post’s GitHub repository.
SoftEther VPN provides the server and client manager programs and they can be downloaded from the download centre page. We can create a VPN user using the server manager and the user can establish connection using the client manager. In this way a developer can access an EMR cluster deployed in a private subnet from the developer machine. Check one of my earlier posts titled Simplify Your Development on AWS with Terraform for a step-by-step illustration of creating a user and making a connection. The VS Code Remote – SSH extension is used to open a folder in the master node of an EMR cluster. In this way, developer experience can be improved significantly while making use of the full feature set of VS Code. The architecture of the remote development environment is shown below.
Infrastructure
The infrastructure of this post is an extension that I illustrated in the previous post. The resources covered there (VPC, subnets, auto scaling group for VPN etc) won’t be repeated. The main resource in this post is an EMR cluster and the latest EMR 6.7.0 release is deployed with single master and core node instances. It is set up to use the AWS Glue Data Catalog as the metastore for Hive and Spark SQL by updating the corresponding configuration classification. Additionally a managed scaling policy is created so that up to 5 instances are added to the core node. Note the additional security group of the master and slave by which the VPN server is granted access to the master and core node instances – the details of that security group is shown below.
# infra/emr.tf |
The following security group is created to enable access from the VPN server to the EMR instances. Note that the inbound rule is created only when the local.vpn.to_create variable value is true while the security group is created always – if the value is false, the security group has no inbound rule.
# infra/emr.tf |
Change to Secret Generation
For configuring the VPN server, we need a IPsec pre-shared key and admin password. While those are specified as variables earlier, they are generated internally in this post for simplicity. The Terraform shell resource module generates and concatenates them with double dashes (–). The corresponding values are parsed into the user data of the VPN instance and the string is saved into a file to be used for configuring the VPN server manager.
## create VPN secrets – IPsec Pre-Shared Key and admin password for VPN |
After deploying all the resources, it is good to go to the next section if we’re able to connect to the VPN server as shown below.
Preparation
User Creation
While we are able to use the default hadoop user for development, we can add additional users to share the cluster as well. First let’s access the master node via ssh as shown below. Note the access key is stored in the infra/key-pair folder and the master private DNS name can be obtained from the emr_cluster_master_dns output value.
# access to the master node via ssh jaehyeon@cevo$ export EMR_MASTER_DNS=$(terraform –chdir=./infra output –raw emr_cluster_master_dns) |
A user can be created as shown below. Optionally the user is added to the sudoers file so that the user is allowed to run a command as the root user without specifying the password. Note this is a shortcut only and please check this page for proper usage of editing the sudoers file.
# create a user and add to sudoers |
Also as described in the EMR documentation, we must add the HDFS user directory for the user account and grant ownership of the directory so that the user is allowed to log in to the cluster to run Hadoop jobs.
[root@ip-10-0-113-113 hadoop]# sudo su – hdfs # create user directory # update directory ownership -bash-4.2$ hdfs dfs -ls /user |
Finally we need to add the public key to the .ssh/authorized_keys file in order to set up public key authentication for SSH access.
# add public key |
Clone Repository
As we can open a folder in the master node, the GitHub repository is cloned to each user’s home folder to open later.
[hadoop@ip-10-0-113-113 ~]$ sudo yum install git |
Access to EMR Cluster
Now we have two users that have access to the EMR cluster and their connection details are saved into a SSH configuration file as shown below.
Host emr-hadoop |
Then we can see the connection details in the remote explorer menu of VS Code. Note the remote SSH extension should be installed for it. On right-clicking the mouse on the emr-hadoop connection, we can select the option to connect to the host in a new window.
In a new window, a menu pops up to select the platform of the remote host.
If it’s the first time connecting to the server, it requests to confirm whether you trust and want to continue connecting to the host. We can hit Continue.
Once we are connected, we can open a folder in the server. On selecting File > Open Folder… menu, we can see a list of folders that we can open. Let’s open the repository folder we cloned earlier.
VS Code asks whether we trust the authors of the files in this folder and we can hit Yes.
Now access to the server with the remote SSH extension is complete and we can check it by opening a terminal where it shows the typical EMR shell.
Python Configuration
We can install the Python extension at minimum and it indicates the extension will be installed in the remote server (emr-hadoop).
We’ll use the pyspark and py4j packages that are included in the existing spark distribution. It can be done simply by creating an .env file that adds the relevant paths to the PYTHONPATH variable. In the following screenshot, you see that there is no warning to import SparkSession.
Remote Development
Transform Data
It is a simple Spark application that reads a sample NY taxi trip dataset from a public S3 bucket. Once loaded, it converts the pick-up and drop-off datetime columns from string to timestamp followed by writing the transformed data to a destination S3 bucket. It finishes by creating a Glue table with the transformed data.
# tripdata_write.py from pyspark.sql import SparkSession |
As the spark application should run in a cluster, we need to copy it into HDFS. For simplicity, I copied the current folder into the /user/hadoop/emr-remote-dev directory.
# copy current folder into /user/hadoop/emr-remote-dev [hadoop@ip-10-0-113-113 emr-remote-dev]$ hdfs dfs -put . /user/hadoop/emr-remote-dev # check contents in /user/hadoop/emr-remote-dev |
The app can be submitted by specifying the HDFS locations of the app and source file. It is deployed to the YARN cluster with the client deployment mode. In this way, data processing can be performed by executors in the core node and we are able to check execution details in the same terminal.
[hadoop@ip-10-0-113-113 emr-remote-dev]$ spark-submit \ |
Once the app completes, we can see that a Glue database named tripdata is created and it includes a table named ny_taxi.
Read Data
We can connect to the cluster with the other user account as well. Below shows an example of the PySpark shell that reads data from the table created earlier. It just reads the Glue table and adds a column of trip duration followed by showing the summary statistics of key columns.
Unit Test
The Spark application uses a custom function that converts the data type of one or more columns from string to timestamp – to_timestamp_df(). The source of the function and the testing script can be found below.
# utils.py from typing import List, Union # test_utils.py import pytest |
For unit testing, we need to install the pytest package and export the PYTHONPATH variable that can be found in the .env file. Note, as testing can be run with a local Spark session, the testing package can only be installed in the master node. Below shows an example test run output.
Summary
In this post, we discussed how to set up a remote development environment on an EMR cluster. A cluster is deployed in a private subnet, access from a developer machine is established via PC-to-PC VPN and the VS Code Remote – SSH extension is used to perform remote development. Aside from the default hadoop user, an additional user account is created to show how to share the cluster with multiple users and spark development examples are illustrated with those user accounts. Overall the remote development brings another effective option to develop spark applications on EMR, which improves developer experience significantly.