Data Engineering Archives

Metadata-Driven PII Masking in dbt on AWS Glue

Learn how to implement PII masking in dbt on AWS Glue using a metadata-driven approach powered by Jinja macros and post-hooks. This guide shows how to protect sensitive data in non-production environments without modifying model SQL, using Spark SQL and Iceberg tables for scalable, deterministic masking.

Transforming Data Engineering with DevOps on the Databricks Platform

The role of the Data Engineer is rapidly changing, from writing ETL scripts to engineering production-grade data products. On the Databricks Lakehouse Platform, this shift demands more than technical know-how; it requires a DevOps mindset. By embracing software engineering best practices, automated testing, and CI/CD pipelines, data teams can deliver scalable, reliable, and secure solutions. This blog explores how DevOps principles and tools like Git Folders and Databricks Asset Bundles are transforming data engineering into a discipline of continuous innovation and delivery.

Prioritising Data Quality with dbt-expectations: A Practical Approach to Building Reliable Data Pipelines

Discover how dbt-expectations enhances data quality checks within dbt pipelines, ensuring reliable analytics and streamlined workflows.

Accelerating Analytics with Apache Spark and Kubernetes

This blog explores Cevo’s Apache Spark on AWS EKS solution, designed to solve some of the challenges of big data analytics.

Data Build Tool (dbt) for Effective Data Transformation on AWS – Part 2 Glue

The data build tool (dbt) is an effective data transformation tool and it supports key AWS analytics services – Redshift, Glue, EMR and Athena. In part 2 of the dbt on AWS series, we discuss data transformation pipelines using dbt on AWS Glue. Subsets of IMDb data are used as source and data models are developed in multiple layers according to the dbt best practices.

Serverless Application Model (SAM) for Data Professionals

We’ll discuss how to build a serverless data processing application using the Serverless Application Model (SAM). A Lambda function is developed, which is triggered whenever an object is created in a S3 bucket. 3rd party packages are necessary for data processing and they are made available by Lambda layers.