AWS Lambda provides serverless computing capabilities and it can be used for performing validation or light processing/transformation of data. Moreover, with its integration with more than 140 AWS services, it facilitates building complex systems employing event-driven architectures. There are many ways to build serverless applications and one of the most efficient ways is using specialised frameworks such as the AWS Serverless Application Model (SAM) and Serverless Framework. In this post, I’ll demonstrate how to build a serverless data processing application using SAM.
Architecture
When we create an application or pipeline with AWS Lambda, most likely we’ll include its event triggers and destinations. The AWS Serverless Application Model (SAM) facilitates building serverless applications by providing shorthand syntax with a number of custom resource types. Also the AWS SAM CLI supports an execution environment that helps build, test, debug and deploy applications easily. Furthermore the CLI can be integrated with full-pledged IaaC tools such as the AWS Cloud Development Kit (CDK) and Terraform – note integration with the latter is in its roadmap. With the integration, serverless application development can be a lot easier with capabilities of local testing and building. An alternative tool is the Serverless Framework. It supports multiple cloud providers and broader event sources out-of-box but its integration with IaaC tools is practically non-existent.
In this post, we’ll build a simple data pipeline using SAM where a Lambda function is triggered when an object (csv file) is created in a S3 bucket. The Lambda function converts the object into parquet and avro files and saves to a destination S3 bucket. For simplicity, we’ll use a single bucket for the source and destination.
SAM Application
After installing the SAM CLI, I initialised an app with the Python 3.8 Lambda runtime from the hello world template (sam init —runtime python3.8). Then it is modified for the data pipeline app. The application is defined in the template.yaml and the source of the main Lambda function is placed in the transform folder. We need 3rd party packages for converting source files into the parquet and avro formats – AWS Data Wrangler and fastavro. Instead of packaging them together with the Lambda function, they are made available as Lambda layers. While using the AWS managed Lambda layer for the former, we only need to build the Lambda layer for the fastavro package and it is located in the fastavro folder. The source of the app can be found in the GitHub repository of this post.
fastavro |
In the resources section of the template, the Lambda layer for avro transformation (FastAvro), the main Lambda function (TransformFunction) and the source (and destination) S3 bucket (SourceBucket) are added. The layer can be built simply by adding the pip package name to the requirements.txt file. It is set to be compatible with Python 3.7 to 3.9. For the Lambda function, its source is configured to be built from the transform folder and the ARNs of the custom and AWS managed Lambda layers are added to the layers property. Also an S3 bucket event is configured so that this Lambda function is triggered whenever a new object is created to the bucket. Finally, as it needs to have permission to read and write objects to the S3 bucket, its invocation policies are added from ready-made policy templates – S3ReadPolicy and S3WritePolicy.
# template.yaml AWSTemplateFormatVersion: “2010-09-09” |
Lambda Function
The transform function reads an input file from the S3 bucket and saves the records as the parquet and avro formats. Thanks to the Lambda layers, we can access the necessary 3rd party packages as well as reduce the size of uploaded deployment packages and make it faster to deploy it.
# transform/app.py |
Unit Testing
We use a custom function to create avro files (generate_avro_file) while relying on the AWS Data Wrangler package for reading input files and writing to parquet files. Therefore unit testing is performed for the custom function only. Mainly it tests whether the avro schema matches the input data fields and data types.
# tests/unit/test_handler.py import pytest |
Build and Deploy
The app has to be built before deployment. It can be done by sam build.
The deployment can be done with and without a guide. For the latter, we need to specify additional parameters such as the Cloudformation stack name, capabilities (as we create an IAM role for Lambda) and a flag to automatically determine an S3 bucket to store build artifacts.
sam deploy \ |
Trigger Lambda Function
We can simply trigger the Lambda function by uploading a source file to the S3 bucket. Once it is uploaded, we are able to see that the output parquet and avro files are saved as expected.
$ aws s3 cp test.csv s3://sam-for-data-professionals-cevo/input/ |
Summary
In this post, it is illustrated how to build a serverless data processing application using SAM. A Lambda function is developed, which is triggered whenever an object is created in a S3 bucket. It converts input csv files into the parquet and avro formats before saving into the destination bucket. For the format conversion, it uses 3rd party packages and they are made available by Lambda layers. The application is built and deployed and the function triggering is checked.