Prioritising Data Quality with dbt-expectations: A Practical Approach to Building Reliable Data Pipelines

Sumit Bhardwaj

16 October, 2024

The Data Build Tool (dbt) is an open-source framework that enables data teams to transform raw data into analysis-ready formats using SQL. It supports modular, version-controlled data transformation pipelines that improve workflow efficiency and data quality. dbt allows teams to collaborate on, test, document, and deploy data models, incorporating best practices from software engineering, such as version control and monitoring, to ensure reliable and scalable analytics.

Great Expectations is an open-source Python tool for ensuring data quality by validating, profiling, and documenting data. It allows users to set expectations for data, such as column values, types, ranges, and uniqueness, and automatically checks data against these rules. With integrations like Airflow and Slack, Great Expectations supports automation, logging, and alerts, making it a comprehensive solution for monitoring data accuracy, completeness, and integrity across various data sources and formats.

dbt-expectations is a dbt package inspired by Great Expectations, designed to enable seamless data quality checks directly within dbt pipelines, removing the need for an external integration layer. This allows dbt users to define and execute data validation tests like Great Expectations, but with the added benefit of operating natively within the data warehouse environment. Tests in dbt-expectations are written in YAML templates using a combination of SQL, Jinja, and dbt macros, making them more straightforward to set up and easier to write. Additionally, dbt-expectations runs faster than Great Expectations since all validation occurs directly in the database, eliminating the need to extract and move data for testing. This results in a more streamlined and efficient data validation process, fully integrated into dbt’s workflow.

Setting Up dbt in a Docker Container

To deploy dbt and dbt-expectations on AWS ECS, ensure you have an active AWS account with access to the ECS service. We need to create a docker container which has dbt installed. This container will be deployed on ECS for execution.

Dockerfile for dbt

				
					# Base image with Python 3.9
FROM python:3.9

# Set working directory
WORKDIR /app

# Copy your dbt project into the container
COPY . /app

# Install dbt-core, dbt-expectations and required dependencies

RUN apt-get update && \
pip install --upgrade pip && \
pip install -r requirements.txt

# Set environment variables for dbt
ENV DBT_PROFILES_DIR=/app/.dbt

# Set entry point to run dbt commands
CMD ["bash"]

The requirements.txt looks like this

				
					boto3==1.35.24
dbt-core=~1.8
dbt-redshift=~1.8

Setting Up ECS Cluster

After building and tagging the Docker image, push it to AWS ECR. Once the image is in ECR, the next step is to deploy it on AWS ECS.

From the AWS System Manager Console, navigate to the Parameter Store under Application Management, create parameters to securely store the Redshift cluster credentials.
From the AWS management console navigate to ECS and launch a new ECS cluster using the newly created dbt image from ECR . For simplicity, we’ll use the Fargate launch type, which abstracts away the underlying infrastructure.
During the ECS task creation, inject the Redshift credentials from SSM as environment variables to ensure secure access to the Redshift cluster.
Additionally, configure the necessary networking and security groups to enable cluster access to Redshift.

Connecting dbt to Redshift

After the ECS task is set up, you’ll need to configure dbt to connect to Redshift. This configuration is straightforward and is managed through the profiles.yml file within the dbt project.

				
					my-redshift-db:
target: dev
outputs:
dev:
type: redshift
host: "{{ env_var('DBT_REDSHIFT_HOST') }}"
user: "{{ env_var('DBT_ENV_SECRET_DBT_USER') }}"
password: "{{ env_var('DBT_ENV_SECRET_DBT_PASSWORD') }}"
dbname: "{{ env_var('DBT_REDSHIFT_DB') }}"
schema: "{{ env_var('DBT_REDSHIFT_SCHEMA') }}"
port: 5439

# Optional Redshift configs:
sslmode: prefer
role: None
ra3_node: true
autocommit: true
threads: 4
connect_timeout: None

Adding dbt-expectation

To install dbt-expectation, add the following to packages.yml file:

				
					packages:
- package: calogica/dbt_expectations
version: [">=0.10.0", "<0.11.0"]

Next, add the dbt_date variable to your dbt_project.yml file. This is referenced by some of the tests.

				
					vars: 'dbt_date:time_zone': 'Australia/Sydney'

Finally, using the dbt-cli run dbt deps to install the package.

				
					packages:
- package: dbt-labs/dbt_utils
version: 1.2.0
- package: calogica/dbt_expectations
version: 0.10.3

Output will look like:

dbt-expectations offers a wide range of built-in tests designed to address common data quality issues, which can be applied across various elements within a dbt project, including sources, models, columns, and seeds. These tests help ensure that the data conforms to expected patterns, validating aspects like uniqueness, null values, data type consistency, and specific value ranges. By leveraging these tests, teams can automate the detection of data anomalies and enforce data quality standards throughout the entire transformation pipeline, making it easier to maintain trust in the data across the organisation.

Missing Value, Unique Value and Type Validations

Data types, missing values, and data consistency are critical to ensuring that a column contains valid data and matches the expected format. Data types are particularly important because they influence how SQL functions behave when writing dbt models. If a column’s data type does not align with expectations, it can lead to errors and potentially break the model. Mismatched types can prevent SQL functions from executing correctly, making it essential to validate data types before running transformations to maintain model stability and reliability.

For this dbt-expectations has following tests:

				
					expect_column_values_to_be_null
expect_column_values_to_not_be_null
expect_column_values_to_be_unique
expect_column_values_to_be_of_type
expect_column_values_to_be_in_type_list
expect_column_values_to_have_consistent_casing

The tests would look like:

				
					: # or seeds
  - name: sample_model
    description: model for testing dbt
    columns:
      - name: order_id
        tests:
          - dbt_expectations.expect_column_values_to_be_of_type:
              column_type: integer

				
					model: # or seeds
  - name: sample_model
    description: model for testing dbt
    columns:
      - name: order_id
        tests:
          - dbt_expectations.expect_column_values_to_be_unique:
              row_condition: "id is not null" # (Optional)

String Pattern (REGEX) Validations

String assertions in dbt-expectations play a key role in maintaining data quality across various fields by validating that string values meet defined expectations. These checks ensure that string data adheres to specific criteria, such as matching certain patterns, having a required length, or following a defined format (e.g., email addresses, phone numbers). By applying these assertions, organisations can enforce data governance policies, ensuring that string data is consistent, accurate, and fits the expected standards. This, in turn, enhances the overall reliability of the data used in downstream analysis and decision-making processes.

These string validations can be especially useful for enforcing business rules, such as ensuring customer IDs follow a specific pattern or that product codes are formatted correctly. By embedding these checks within the dbt pipeline, teams can automate data governance and prevent erroneous or invalid string data from propagating through the system, ultimately improving data quality across the entire organisation

Example for string assertions are expect_column_values_to_match_regex, expect_column_values_to_match_regex_list etc. The test would look like:

				
					models:
  - name: order_details
    columns:
      - name: email
        tests:
          - dbt_expectations.expect_column_values_to_match_regex:
              regex: "^[A-Z0-9+_.-]+@[A-Z0-9.-]+$"

				
					models:
  - name: order_details
   columns:
     - name: order_code
       tests:
         - dbt_expectations.expect_column_values_to_match_regex_list:
             regex_list: ["^AUS\\d{3}-[A-Z]{4}$", "^NZ\\d{3}-[A-Z]{4}$"]
             row_condition: "status = 'confirmed'"

dbt-expectations provides a wide range of data quality tests that can address various issues, such as data freshness, which is a critical factor for business users. Stale data can significantly impact decision-making processes, especially in real-time or near-real-time data pipelines. Ensuring that data remains as up-to-date as expected helps prevent potential disruptions and highlights issues within the pipeline early on.

Another powerful feature of dbt-expectations is the ability to detect anomalies in the data. By identifying abnormal patterns, such as unexpected spikes or drops in values, you can proactively catch and address underlying issues in the dataset, helping maintain the integrity and reliability of the data. This anomaly detection can serve as an early warning system, alerting teams to potential problems before they escalate.

For a comprehensive list of all available tests in dbt-expectations, you can refer to the GitHub repository, which includes a variety of assertions tailored to different data quality needs, enabling more robust and reliable data management across the entire pipeline.

Monitoring and Logging

To effectively monitor dbt runs and dbt-expectations assertions, we can leverage Amazon CloudWatch by configuring our ECS task to send logs directly to CloudWatch. This setup allows us to track critical metrics, such as errors, warnings, and performance issues, across our dbt pipelines. By centralising these logs, we can gain better visibility into the health and status of our data pipelines.

Additionally, CloudWatch alarms can be set up to notify the team whenever data validation failures or other anomalies occur. These alarms enable proactive monitoring by providing real-time alerts when issues arise, ensuring that problems are addressed before they impact business processes. These alerts can also be integrated with communication tools like Slack or Microsoft Teams, enabling the responsible team to receive instant notifications and respond promptly to any issues, minimising downtime and ensuring the smooth operation of the pipeline. This automated alerting and monitoring system improves the overall reliability and resilience of data workflows.

Conclusion

dbt-expectations provides the capability to implement a comprehensive suite of data quality tests across your dbt models. While it offers a wide range of possible tests, it’s crucial to prioritise and identify which aspects of your data pipeline should be tested first. A practical approach is to begin by analysing the most common errors or data issues that have occurred during past data processing. This helps you focus on addressing the most critical data quality concerns early on.

Cevo Australia specialises in building comprehensive Data Platforms, from cloud-based Data Warehouse to modern Lakehouse solutions for enterprises. We have successfully supported clients in developing Data-as-a-Service platforms on AWS, delivering secure, scalable, and cost-effective data infrastructure tailored for enterprise needs.

Sumit Bhardwaj

Enjoyed this blog?

Share it with your network!

Prioritising Data Quality with dbt-expectations: A Practical Approach to Building Reliable Data Pipelines

Setting Up dbt in a Docker Container

Setting Up ECS Cluster

Connecting dbt to Redshift

Adding dbt-expectation

Missing Value, Unique Value and Type Validations

String Pattern (REGEX) Validations

Monitoring and Logging

Conclusion

Building trust through AI governance in Financial Services

Modern data foundations for AI: The strategic imperative for AI at scale

Metadata-Driven PII Masking in dbt on AWS Glue

AWS DMS CDC from SQL Server to S3 Using CDK – Pipeline Guide