Metadata-Driven PII Masking in dbt on AWS Glue

TL;DR 
Implementing PII masking in dbt on AWS Glue isn’t straightforward, especially when working with Spark SQL and Iceberg tables. In this blog, I walk through a metadata-driven approach to masking sensitive data using dbt macros and post-hooks, without modifying model SQL. 

 

Introduction 

When building a dbt data pipeline, we must mask Personally Identifiable Information (PII) in non-production environments so that developers could work safely with realistic-looking data without ever touching real customer details.  
 
No mature library exists for dbt + Glue/Spark, so I engineered a clean, metadata-driven solution using native dbt Jinja macros and a single post-hook entry in dbt_project.yml.  
This results in zero changes to individual model SQL files, masking declared purely in YAML.  

Table of Contents

Why PII Masking indbt is critical for Dev Environments 

Data pipelines built for large-scale migrations inevitably deal with sensitive customer information: names, email addresses, phone numbers, dates of birth.  

In a production environment, access controls and governance policies protect this data. But in development and testing environments, engineers need to run the full pipeline, including building and validating target tables, which means the data has to flow through. 

The naive solution (give dev engineers read access to the same production data) is a compliance and privacy risk. The equally naive opposite (use entirely synthetic data) often breaks model logic that depends on real-world value distributions, formats, and relationships. 

What I needed was a middle path: the pipeline runs on real data, but by the time a developer queries a dev-environment table, every PII column has been replaced with a deterministic, reversible-looking, but non-identifying value. The business logic stays valid, the privacy risk disappears. 

The constraint: the pipeline runs on AWS Glue using the dbt-glue adapter, with data stored as Apache Iceberg tables on S3 and catalogued in the AWS Glue Data Catalog. 

Why not just use an existing library?

Before writing a single line of Jinja, I evaluated the ecosystem. 

 

What exists 

Snowflake, for instance, has native Dynamic Data Masking, a first-class platform feature that attaches masking policies to columns at the warehouse level. On top of that, the community package dbt_snow_mask wraps those policies into dbt-idiomatic YAML configuration. It’s polished, well-documented, and broadly adopted. 

 

What exists for Spark/Glue 

Nothing equivalent. Spark SQL has no native column-level masking policies. The dbt-glue adapter has no masking extension. A search of dbt Hub yields no Spark-specific PII package with meaningful adoption.  

The closest conceptual match is dbt-privacy, a community project that follows the same meta-driven macro philosophy, but it targets Snowflake and Postgres SQL syntax and is not production-ready for Spark. 

 

The idiomatic approach 

The pattern that does exist across the dbt ecosystem, and that projects like dbt-privacy follow, is: 

  • Annotate columns in the model’s .yml file with a meta.pii_mask tag. 
  • Implement masking as plain Jinja macros that emit adapter-specific SQL expressions. 
  • Invoke those macros via a post-hook that reads the metadata at compile time. 

This is exactly what I built. The difference is that I implemented the SQL expressions in the Spark dialect, targeting our actual runtime. 

The lesson: no library gap means no excuse to skip the feature, it means an opportunity to engineer the right solution for your stack. 

“Masking declared purely in YAML, with zero changes to individual model SQL files.” Metadata-Driven PII Masking in dbt on AWS Glue

Designing a Metadata-Driven PII Masking solution in dbt 

The design goals were: 

Goal 

Approach 

Zero changes to model SQL 

Post-hook reads YAML metadata; model files unchanged 

Single place to enable masking per folder 

One +post-hook entry in dbt_project.yml 

Masking only in non-production 

Target-name guard inside the macro 

Support multiple PII types 

Pluggable mask_type strategy per column 

Derived JSON columns inherit masking 

Separate view reads from already-masked base table 

Deterministic output 

SHA2-based hashing, same input → same output 

 

The solution is three components: 

 macros/ 
  mask/ 
    pii_mask.sql            ← low-level masking expressions (SHA2, REGEXP_REPLACE, etc.) 
    apply_pii_masking.sql   ← post-hook orchestrator (reads meta, builds UPDATE statement) 
 
models/ 
  my_folder/ 
    my_model.sql            ← unchanged — no masking logic here 
    my_model.yml            ← pii_mask: email/alpha/phone/digits/date per column 
 
dbt_project.yml             ← +post-hook: “{{ apply_pii_masking() }}” for the folder 

When to use this approach

This metadata-driven PII masking approach is particularly effective in the following scenarios: 

  • You are using dbt with AWS Glue / Spark 

There is no native column-level masking, and no mature dbt package to rely on. 

  • You need realistic data in non-production environments 

Synthetic data breaks business logic, but exposing real PII is not acceptable. 

  • You want zero impact on model SQL 

Masking logic is fully externalised in YAML and macros, keeping transformations clean. 

  • You are working with Iceberg tables 

The solution relies on UPDATE support, which is available in Iceberg but not in Hive-style tables. 

  • You manage multiple models or domains 
    A single post-hook applied at the folder level scales automatically across projects. 

  

This approach may be less suitable if your platform provides native masking capabilities (e.g. Snowflake Dynamic Data Masking), where policy-based solutions are often simpler and more integrated. 

 

dbtMacros for PII Masking in Spark SQL 

Each masking strategy is a small, composable Jinja macro that returns a Spark SQL expression. All strategies are based on SHA2-512 hashing, ensuring that the masked value is: 

  • Deterministic: the same input always produces the same masked output, which preserves referential integrity across joined tables. 
  • Non-reversible: SHA2 is a one-way function; you cannot recover the original value. 
  • Format-preserving: email addresses keep their domain, phone numbers keep their area code prefix, names retain their original length. 

 

Low-level helpers

				
					 
macros/ 
  mask/ 
    pii_mask.sql            ← low-level masking expressions (SHA2, REGEXP_REPLACE, etc.) 
    apply_pii_masking.sql   ← post-hook orchestrator (reads meta, builds UPDATE statement) 
 
models/ 
  my_folder/ 
    my_model.sql            ← unchanged — no masking logic here 
    my_model.yml            ← pii_mask: email/alpha/phone/digits/date per column 
 
dbt_project.yml             ← +post-hook: "{{ apply_pii_masking() }}" for the folder 
				
			

SHA2(value, 512) is the Spark SQL function for SHA-512REGEXP_REPLACE with character-class patterns strips everything except lowercase letters (for alpha) or digits (for digits). An optional seed can be injected to produce environment-scoped masks, useful when you want dev and staging to produce different outputs from the same source data. 

 

Masking strategies 

Email: replaces the local part (before @) with a 20-character hash string while preserving the domain. A developer can see that an email address was stored; they just cannot tell whose it is. 

				
					{% macro mask_email(column_name, seed=None) %} 
    CONCAT( 
        RIGHT({{ hash_alpha(column_name, seed) }}, 20), 
        '@', 
        SUBSTRING({{ column_name }}, LOCATE('@', {{ column_name }}) + 1) 
    ) 
{% endmacro %}
				
			

Example output: john.smith@example.com → xkqvbtnzaefmlwcprhds@example.com 

Phone: preserves the first 4 characters (the Australian area code or mobile prefix) and replaces the remainder with hashed digits of the same count. 

				
					 
{% macro mask_phone(column_name, seed=None) %} 
    CONCAT( 
        LEFT({{ column_name }}, 4), 
        RIGHT( 
            {{ hash_digits(column_name, seed) }}, 
            GREATEST(0, LENGTH(REGEXP_REPLACE({{ column_name }}, '[^0-9]', '')) - 4) 
        ) 
    ) 
{% endmacro %} 
				
			

Example output: 0412 345 678 → 0412836109 

Alpha names: replaces with a hash-derived string of exactly the same length. The column stays non-null and retains its character count, so any downstream LENGTH() checks or UI display constraints remain valid. 

				
					 
{% macro mask_alpha(column_name, seed=None) %} 
    RIGHT({{ hash_alpha(column_name, seed) }}, LENGTH({{ column_name }})) 
{% endmacro %} 
				
			

Example output: Alexandra → nfqbzlxkw 

Digits: same principle as alpha, for purely numeric fields. 

				
					 
{% macro mask_digits(column_name, seed=None) %} 
    RIGHT({{ hash_digits(column_name, seed) }}, LENGTH({{ column_name }})) 
{% endmacro %} 
				
			

Date: replaces any non-null date with a fixed sentinel value (1900-01-01). NULL is preserved so that downstream NOT NULL tests still pass correctly. 

				
					 
{% macro mask_date(column_name) %} 
    CASE WHEN {{ column_name }} IS NOT NULL 
         THEN DATE('1900-01-01') 
         ELSE NULL 
    END 
{% endmacro %} 
				
			

The dispatcher 

pii_mask() routes to the right strategy based on a mask_type string, the same value you’ll declare in YAML: 

{% macro pii_mask(column_name, mask_type='email', seed=None) %} 
{%- if   mask_type == 'email'  -%} {{ mask_email(column_name, seed)  }} 
{%- elif mask_type == 'phone'  -%} {{ mask_phone(column_name, seed)  }} 
{%- elif mask_type == 'alpha'  -%} {{ mask_alpha(column_name, seed)  }} 
{%- elif mask_type == 'digits' -%} {{ mask_digits(column_name, seed) }} 
{%- elif mask_type == 'date'   -%} {{ mask_date(column_name)         }} 
{%- endif -%} 
{% endmacro %} 

Inline convenience macro 

For use directly inside model SQL (handy when masking must happen before a derived column is computed, see Section 7): 

				
					 
{% macro pii_mask_if_dev(column_name, mask_type, seed=None) %} 
{%- if target.name in ['prod', 'pre_prod'] -%} 
    {{ column_name }} 
{%- else -%} 
    {{ pii_mask(column_name, mask_type, seed) }} 
{%- endif -%} 
{% endmacro %} 
				
			

Usage in model SQL (not recommended): 

				
					 
{{ pii_mask_if_dev('c.email', 'email') }}       AS contact_email, 
{{ pii_mask_if_dev('c.first_name', 'alpha') }}  AS given_name, 
{{ pii_mask_if_dev('c.date_of_birth', 'date') }} AS date_of_birth, 
				
			

The post-hook orchestrator

The post-hook macro is what makes the solution metadata-driven. It runs after a model is materialised, inspects the model’s column metadata, and if any column has a pii_mask value defined, it emits a single UPDATE statement covering all PII columns at once. 

				
					 
{% macro apply_pii_masking() %} 
  {% if target.name not in ['prod', 'pre_prod'] %} 
 
    {% set ns = namespace(has_pii=false, updates=[]) %} 
 
    {% for col_name, col in model.columns.items() %} 
      {% set mask_type = col.meta.get('pii_mask') %} 
      {% if mask_type %} 
        {% set expr = pii_mask(col_name, mask_type) %} 
        {% do ns.updates.append(col_name ~ ' = ' ~ expr) %} 
        {% set ns.has_pii = true %} 
      {% endif %} 
    {% endfor %} 
 
    {% if ns.has_pii %} 
      UPDATE {{ this }} 
      SET 
        {{ ns.updates | join(',\n        ') }} 
    {% endif %} 
 
  {% endif %} 
{% endmacro %} 
				
			

What this compiles to at runtime (for a model with emailgiven_name, and date_of_birth columns): 

				
					 
UPDATE gold. marts.tgt_my_customer_model 
SET 
  contact_email = CONCAT(RIGHT(REGEXP_REPLACE(SHA2(contact_email, 512), '[^a-z]', ''), 20), '@', SUBSTRING(contact_email, LOCATE('@', contact_email) + 1)), 
  given_name    = RIGHT(REGEXP_REPLACE(SHA2(given_name, 512), '[^a-z]', ''), LENGTH(given_name)), 
  date_of_birth = CASE WHEN date_of_birth IS NOT NULL THEN DATE('1900-01-01') ELSE NULL END 
				
			

Key design decisions: 

  • Single UPDATE, not one per column: minimises write amplification on S3-backed Iceberg tables, which is critical because each write operation can trigger expensive file rewrites under the hood. 
  • Target-name guard: the entire block is a no-op on prod and pre_prod. If a model is accidentally run against production, nothing is masked (which is the correct behaviour, production data should remain unmasked). 
  • Model.columns: dbt exposes the compiled column metadata at compile time. This means the post-hook reads from the same .yml that documents the model, so documentation and masking configuration are always in sync. 
  • No-op on views: if +post-hook is applied to a folder containing both tables and views, the macro generates no SQL for views (they have no pii_mask columns by convention), so there is no error. 

 

Wiring it up-YAML + yml 

Step 1 – Annotate columns in the model’s .yml 

No changes to the .sql file. Masking intent lives in the companion YAML: 

# models/marts/customer.yml 
version: 2 
 
models: 
  - name: customer 
    description: > 
      One row per customer. PII columns are masked on non-production targets 
      via the apply_pii_masking() post-hook configured in dbt_project.yml. 
 
    columns: 
      - name: customer_id 
        description: Primary key. 
        data_tests: 
          - not_null 
          - unique 
 
      - name: given_name 
        description: Customer first name. 
        meta: 
          pii_mask: alpha       # ← masking type declared here 
 
      - name: family_name 
        description: Customer last name. 
        meta: 
          pii_mask: alpha 
 
      - name: date_of_birth 
        description: Date of birth. Masked to 1900-01-01 on non-prod. 
        meta: 
          pii_mask: date 
 
      - name: email_address 
        description: Primary contact email. 
        meta: 
          pii_mask: email 
 
      - name: mobile_number 
        description: Mobile phone number. 
        meta: 
          pii_mask: phone

Step 2 – Enable the post-hook for the entire folder in dbt_project.yml 

No per-model configuration. One line applies masking to every model in the folder: 

# dbt_project.yml 
models: 
  my_project: 
    marts: 
      +database: gold 
      +schema: marts 
      +materialized: table 
      +file_format: iceberg 
      +post-hook: "{{ apply_pii_masking() }}"   # ← applied to all models in this folder

That’s it. Add a new model to the marts/ folder, annotate its columns in YAML, and masking is automatic on the next dbt build. 

 

Handling derived JSON columns – The view pattern 

A subtle challenge arises when a model includes a column derived from PII columns, most commonly a JSON payload column assembled with TO_JSON() or NAMED_STRUCT(). 

Consider this model: 

				
					-- customer.sql 
SELECT 
    customer_id, 
    given_name, 
    email_address, 
    TO_JSON(NAMED_STRUCT( 
        'givenName', given_name, 
        'email',     email_address 
    )) AS api_payload_json 
FROM source_customers
				
			

If apply_pii_masking() runs as a post-hook and updates given_name and email_address in place, the api_payload_json column is not regenerated, it still holds the pre-mask values. An UPDATE on a computed column is not a re-materialisation. 

This separation ensures that masking and data shaping remain independent concerns, which simplifies maintenance and avoids subtle bugs when derived columns depend on sensitive data. 

 

Solution: The view pattern 

Split the model into two objects: 

  1. Base table: flat columns only, no JSON. The post-hook masks these. 
  2. Companion view: reads from the (already masked) base table and assembles the JSON. 
				
					-- customer.sql  (base table — flat columns only) 
SELECT 
    customer_id, 
    given_name, 
    email_address, 
    mobile_number, 
    date_of_birth 
FROM source_customers 

 

-- vw_customer_payload.sql  (view — reads masked base table) 
SELECT 
    customer_id, 
    TO_JSON(NAMED_STRUCT( 
        'givenName',    given_name, 
        'email',        email_address, 
        'dateOfBirth',  CAST(date_of_birth AS STRING) 
    )) AS api_payload_json 
FROM {{ ref('customer') }} 
				
			

Because the view is defined as SELECT * FROM customer, it reads whatever is currently in the base table at query time. After the post-hook has updated given_name and email_address, any query against vw_customer_payload will produce a fully masked JSON, with no extra masking logic in the view itself. 

The companion view should be configured with +materialized: view in dbt_project.yml (or its own .yml config) so that it never triggers the post-hook: 

 
models: 
  my_project: 
    marts: 
      +post-hook: "{{ apply_pii_masking() }}" 
      vw_customer_payload: 
        +materialized: view    # views are not writable; post-hook is a no-op 

End-to-end flow at a glance

dbt build –select +customer+ 
  │ 
  ├─ 1. Compile customer.sql 
  │       → SELECT customer_idgiven_nameemail_address, … FROM source 
  │ 
  ├─ 2. Materialise as Iceberg table (gold.marts.customer) 
  │       → rows contain real PII at this point 
  │ 
  ├─ 3. Post-hook: apply_pii_masking() runs 
  │       → reads meta.pii_mask from customer.yml 
  │       → emits UPDATE gold.marts.customer SET … 
  │       → SHA2-based expressions applied to given_nameemail_address, etc. 
  │       → (no-op if target.name is ‘prod’ or ‘pre_prod‘) 
  │ 
  └─ 4. Compile & materialise vw_customer_payload (view) 
          → SELECT … FROM customer 
          → api_payload_json reflects already-masked values automatically

Architecture overview 

dbt model build 

      │ 

      ▼ 

Iceberg table (raw PII data) 

      │ 

      ▼ 

Post-hook: apply_pii_masking() 

      │ 

      ▼ 

Masked Iceberg table 

      │ 

      ▼ 

View layer (JSON / derived columns) 

      │ 

      ▼ 

Safe consumption in dev / test 

Portability considerations

This implementation is intentionally Spark SQL / AWS Glue-specific. The SQL functions used (SHA2, REGEXP_REPLACE, LOCATE, DATE(‘…’)) are Spark dialect and will not compile on other adapters without changes. 

For teams considering a similar approach on other platforms: 

Platform 

Native masking 

dbt package 

Macro portability 

Snowflake 

✅ Dynamic Data Masking 

✅ dbt-snowflake-masking 

Needs SHA2_HEX()REGEXP_REPLACE()CHARINDEX() equivalents 

BigQuery 

✅ Column-level security 

Partial (dbt-privacy) 

SHA512()REGEXP_REPLACE()STRPOS() 

Redshift 

 

 

FUNC_SHA2()REGEXP_REPLACE()POSITION() 

Athena (Trino) 

 

 

sha512()regexp_replace()strpos(), UPDATE requires Iceberg 

AWS Glue (Spark) 

 

 

✅ This implementation 

 

The UPDATE statement in the post-hook also requires a writablemutable table format. On Glue, this means Apache Iceberg. On standard Hive-style tables, UPDATE is not supported and the post-hook would fail. 

“No library gap means no excuse to skip the feature - it means an opportunity to engineer the right solution for your stack.” Metadata-Driven PII Masking in dbt on AWS Glue

Why human expertise still fills the gap

There’s a tempting belief in the modern data stack that every problem has a package. Browse dbt Hub, find the library, add it to packages.yml, done. That belief is comfortable but increasingly dangerous as teams venture beyond the well-lit corridors of the Snowflake ecosystem. 

AWS Glue + dbt is a legitimate, production-grade combination used by organisations managing petabyte-scale data lakes on AWS. And yet, when it comes to PII masking, the tooling gap is real.  

There is no dbt-glue-masking package. There is no AWS Glue equivalent of Snowflake’s Dynamic Data Masking. The official dbt documentation on masking points at Snowflake examples. 

This is precisely where experienced engineers matter. The solution described in this article required: 

  • Deep knowledge of Spark SQL to write correct, performant expressions for each masking strategy. 
  • Understanding of the dbt macro system: Jinja namespaces, model.columns introspection, adapter.dispatch() patterns, the lifecycle of post-hook. 
  • Iceberg internals awareness: knowing that UPDATE is only possible on Iceberg (not Hive-style) tables, and understanding how S3-backed Iceberg handles row-level updates. 
  • Privacy engineering judgement: deciding what “masked” means for each data type: preserving email domain for realism, preserving phone area code for routing logic validity, using a sentinel date rather than a random one so that NOT NULL tests still pass. 
  • Architecture taste: the decision to separate the flat base table from the JSON payload view, so that masking and serialisation are independent concerns. 

 

No library provides that combination of skills. The library might provide the skeleton; the engineer provides the rest.  

As data platforms become more complex and automation continues to accelerate, the engineers who understand the why behind their tools, and who can design fit-for-purpose solutions when those tools fall short, become increasingly valuable. 

A missing tool is not a blocker. It is often a signal that thoughtful engineering is required. 

Enjoyed this blog?

Share it with your network!