This blog offers a comprehensive walkthrough for setting up DBT to execute data transformation tasks specifically designed for Snowflake. We’ve streamlined the DBT configuration process by packaging it within a Docker Image, which is securely stored in a private ECR repository. To efficiently handle scheduling and orchestration, we’ve harnessed the power of both ECS Service and MWAA. You can access the source code in this GitRepo.
Architecture:
Data Build Tool:
Data build tool (DBT) enables analytics engineers to transform data in their warehouses by simply writing select statements. DBT handles turning these select statements into tables and views.
DBT does the T in ELT (Extract, Load, Transform) processes – it doesn’t extract or load data, but it’s extremely good at transforming data that’s already loaded into your warehouse.
Infrastructure:
There are various installation methods for DBT, but in our case, where our aim is to deploy DBT as a container on ECS Fargate, the initial step entails Dockerfile preparation.
Docker File:
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. |
DBT Project:
A DBT project is a directory of .sql and .yml files, which DBT uses to transform your data. At a minimum, a DBT project must contain:
- A project file: A DBT_project.yml file tells DBT that a particular directory is a DBT project, and also contains configurations for your project.
- Models: A model is a single .sql file. Each model contains a single select statement that either transforms raw data into a dataset that is ready for analytics, or, more often, is an intermediate step in such a transformation.
A project may also contain a number of other resources, such as snapshots, seeds, tests, macros, documentation, and sources.
Source Code:
The source code is housed within the CodeCommit repository. Please find the source code GitRepo.
Please find the tree of the source code, as seen below.
-Config |
DBT configuration:
In this instance, we are establishing a connection between the DBT tool and Snowflake. Typically, database connections are retrieved from the profiles.yml file. Meanwhile, sensitive information is dynamically extracted from Secret Manager or Parameter Store during the code building process.
The details we have passed in Code Build as an environmental variable.
dbtlearn: |
Code Build:
The build process will dynamically retrieve the confidential database connection information from Secret Manager or Parameter Store. It will then proceed to construct the source code and subsequently push it as an image to the private ECR repository.
Buildspec.yml
The yaml file defines a series of commands to be executed during different build phases. It sets up a Python environment, configures Snowflake credentials, builds and tags a Docker image, and pushes it to an Amazon ECR repository in the ap-southeast-2 region.
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. |
Code Build Environment Variable:
Up to this point, we’ve successfully compiled the code and uploaded our Docker image to the ECR Repository.
As an illustrative example, we will be writing a code in DBT to implement Slowly Changing Dimensions (SCD) type 2 using models and snapshots in the Snowflake environment.
DBT offers a feature called “snapshots,” which allows you to capture and track changes made to a table that can be updated over time. Snapshots are particularly useful for implementing type-2 SCDs on tables that can be modified.
source/sources.yml
version: 2
|
models/emp_fact.sql
{{ |
snapshots/scd_emp.sql
{% snapshot scd_emp %} |
Summary:
In this blog, we explored the process of configuring DBT on an ECR private repository via an AWS pipeline. In my next blog post, we will delve into the configuration of MWAA and demonstrate how to initiate the DBT job using Airflow.