In an earlier post, I demonstrated how to set up a local development environment for AWS Glue 1.0 and 2.0 using a docker image that is published by the AWS Glue team and the Visual Studio Code Remote – Containers extension. Recently AWS Glue 3.0 was released but a docker image for this version is not published. In this post, I’ll illustrate how to create a development environment for AWS Glue 3.0 (and later versions) by building a custom docker image.
Glue Base Docker Image
The Glue base images are built while referring to the official AWS Glue Python local development documentation. For example, the latest image that targets Glue 3.0 is built on top of the official Python image on the latest stable Debian version (python:3.7.12-bullseye). After installing utilities (zip and AWS CLI V2), Open JDK 8 is installed. Then Maven, Spark and Glue Python libraries (aws-glue-libs) are added to the /opt directory and Glue dependencies are downloaded by sourcing glue-setup.sh. It ends up downloading default Python packages and updating the GLUE_HOME and PYTHONPATH environment variables. The Dockerfile is shown below and it can also be found in the project GitHub repository.
It is published to the glue-base repository of Cevo Australia’s public ECR registry with the following tags. Later versions of Glue base images will be published with relevant tags.
The Glue base image can be used for running a pyspark shell or submitting a spark application as shown below. For the spark application, I assume the project repository is mapped to the container’s /tmp folder. The Glue Python libraries also support pytest and it’ll be discussed later in the post.
Extend Glue Base Image
We can extend the Glue base image using the VSCode remote containers extension. The configuration for the extension can be found in the .devcontainer folder. The folder includes the Dockerfile for the development docker image and remote container configuration file (devcontainer.json). The other contents include the source for the Glue base image and materials for pytest, spark-submit and pytest demonstrations. These will be illustrated below.
Development Docker Image
The Glue base Docker image runs as the root user and it is not convenient to write code with it. Therefore a non-root user is created whose user name corresponds to the logged-in user’s user name – the USERNAME argument will be set accordingly in devcontainer.json. Next the sudo program is installed and the non-root user is added to the Sudo group. Note the Python Glue library’s executables are configured to run with the root user so that the sudo program is necessary to run those executables. Finally it installs additional development Python packages.
The development container will be created by building an image from the Dockerfile illustrated above. The logged-in user’s user name is provided to create a non-root user and the container is set to run as the user as well. And 2 visual studio code extensions are installed – Python and Prettier. Also the current folder is mounted to the container’s workspace folder and 2 additional folders are mounted – they are to share AWS credentials and SSH keys. Note that AWS credentials are mounted to /root/.aws because the Python Glue library’s executables will be run as the root user. Then the port 4040 is set to be forwarded, which is used for the Spark UI. Finally additional editor settings are added at the end.
The development container can be run by executing the following command in the command palette.
Remote-Containers: Open Folder in Container…
Once the development container is ready, the workspace folder will be open within the container.
I’ve created a script (execute.sh) to run the executables easily. The main command indicates which executable to run and possible values are pyspark, spark-submit and pytest. Below shows some example commands.
Using the script above, we can launch PySpark. A screenshot of the PySpark shell can be found below.
Below shows one of the Python samples in the Glue documentation. It pulls 3 data sets from a database called legislators. Then they are joined to create a history data set (l_history) and saved into S3.
./execute.sh spark-submit example.py
When the execution completes, we can see the joined data set is stored as a parquet file in the output S3 bucket.
Note that we can monitor and inspect Spark job executions in the Spark UI on port 4040.
We can test a function that deals with a DynamicFrame. Below shows a test case for a simple function that filters a DynamicFrame based on a column value.
./execute.sh pytest -svv
In this post, I demonstrated how to build local development environments for AWS Glue 3.0 and later using a custom docker image and the Visual Studio Code Remote – Containers extension. Then examples of launching Pyspark shells, submitting an application and running a test are shown. I hope this post is useful to develop and test Glue ETL scripts locally.