Metadata-Driven PII Masking in dbt on AWS Glue

Domenico Campagnolo

20 April, 2026

TL;DR
Implementing PII masking in dbt on AWS Glue isn’t straightforward, especially when working with Spark SQL and Iceberg tables. In this blog, I walk through a metadata-driven approach to masking sensitive data using dbt macros and post-hooks, without modifying model SQL.

Introduction

When building a dbt data pipeline, we must mask Personally Identifiable Information (PII) in non-production environments so that developers could work safely with realistic-looking data without ever touching real customer details.

No mature library exists for dbt + Glue/Spark, so I engineered a clean, metadata-driven solution using native dbt Jinja macros and a single post-hook entry in dbt_project.yml.
This results in zero changes to individual model SQL files, masking declared purely in YAML.

Why PII Masking in dbt is critical for Dev Environments

Data pipelines built for large-scale migrations inevitably deal with sensitive customer information: names, email addresses, phone numbers, dates of birth.

In a production environment, access controls and governance policies protect this data. But in development and testing environments, engineers need to run the full pipeline, including building and validating target tables, which means the data has to flow through.

The naive solution to give dev engineers read access to the same production data, is a compliance and privacy risk. The equally naive opposite to use entirely synthetic data, often breaks model logic that depends on real-world value distributions, formats, and relationships.

What I needed was a middle path: the pipeline runs on real data, but by the time a developer queries a dev-environment table, every PII column has been replaced with a deterministic, reversible-looking, but non-identifying value. The business logic stays valid, the privacy risk disappears.

The constraint: the pipeline runs on AWS Glue using the dbt-glue adapter, with data stored as Apache Iceberg tables on S3 and catalogued in the AWS Glue Data Catalog.

Why not just use an existing library?

Before writing a single line of Jinja, I evaluated the ecosystem.

What exists

Snowflake, for instance, has native Dynamic Data Masking, a first-class platform feature that attaches masking policies to columns at the warehouse level. On top of that, the community package dbt_snow_mask wraps those policies into dbt-idiomatic YAML configuration. It’s polished, well-documented, and broadly adopted.

What exists for Spark/Glue

Nothing equivalent. Spark SQL has no native column-level masking policies. The dbt-glue adapter has no masking extension. A search of dbt Hub yields no Spark-specific PII package with meaningful adoption.

The closest conceptual match is dbt-privacy, a community project that follows the same meta-driven macro philosophy, but it targets Snowflake and Postgres SQL syntax and is not production-ready for Spark.

The idiomatic approach

The pattern that does exist across the dbt ecosystem, and that projects like dbt-privacy follow, is:

Annotate columns in the model’s .yml file with a meta.pii_mask tag.

Implement masking as plain Jinja macros that emit adapter-specific SQL expressions.

Invoke those macros via a post-hook that reads the metadata at compile time.

This is exactly what I built. The difference is that I implemented the SQL expressions in the Spark dialect, targeting our actual runtime.

The lesson: no library gap means no excuse to skip the feature, it means an opportunity to engineer the right solution for your stack.

Designing a Metadata-Driven PII Masking solution in dbt

The design goals were:

Goal	Approach
Zero changes to model SQL	Post-hook reads YAML metadata; model files unchanged
Single place to enable masking per folder	One +post-hook entry in dbt_project.yml
Masking only in non-production	Target-name guard inside the macro
Support multiple PII types	Pluggable mask_type strategy per column
Derived JSON columns inherit masking	Separate view reads from already-masked base table
Deterministic output	SHA2-based hashing, same input → same output

The solution is three components:

 macros/ 
  mask/ 
    pii_mask.sql            ← low-level masking expressions (SHA2, REGEXP_REPLACE, etc.) 
    apply_pii_masking.sql   ← post-hook orchestrator (reads meta, builds UPDATE statement) 
 
models/ 
  my_folder/ 
    my_model.sql            ← unchanged — no masking logic here 
    my_model.yml            ← pii_mask: email/alpha/phone/digits/date per column 
 
dbt_project.yml             ← +post-hook: "{{ apply_pii_masking() }}" for the folder

When to use this approach

This metadata-driven PII masking approach is particularly effective in the following scenarios:

You are using dbt with AWS Glue / Spark

There is no native column-level masking, and no mature dbt package to rely on.

You need realistic data in non-production environments

Synthetic data breaks business logic, but exposing real PII is not acceptable.

You want zero impact on model SQL

Masking logic is fully externalised in YAML and macros, keeping transformations clean.

You are working with Iceberg tables

The solution relies on UPDATE support, which is available in Iceberg but not in Hive-style tables.

You manage multiple models or domains
A single post-hook applied at the folder level scales automatically across projects.

This approach may be less suitable if your platform provides native masking capabilities (e.g. Snowflake Dynamic Data Masking), where policy-based solutions are often simpler and more integrated.

dbt Macros for PII Masking in Spark SQL

Each masking strategy is a small, composable Jinja macro that returns a Spark SQL expression. All strategies are based on SHA2-512 hashing, ensuring that the masked value is:

Deterministic: the same input always produces the same masked output, which preserves referential integrity across joined tables.

Non-reversible: SHA2 is a one-way function; you cannot recover the original value.

Format-preserving: email addresses keep their domain, phone numbers keep their area code prefix, names retain their original length.

Low-level helpers

				
					{% macro hash_alpha(column_name, seed=None) %} 
    REGEXP_REPLACE( 
        SHA2({% if seed %}{{ seed }} || {{ column_name }}{% else %}{{ column_name }}{% endif %}, 512), 
        '[^a-z]', '' 
    ) 
{% endmacro %} 
 
{% macro hash_digits(column_name, seed=None) %} 
    REGEXP_REPLACE( 
        SHA2({% if seed %}{{ seed }} || {{ column_name }}{% else %}{{ column_name }}{% endif %}, 512), 
        '[^0-9]', '' 
    ) 
{% endmacro %}

SHA2(value, 512) is the Spark SQL function for SHA-512. REGEXP_REPLACE with character-class patterns strips everything except lowercase letters (for alpha) or digits (for digits). An optional seed can be injected to produce environment-scoped masks, useful when you want dev and staging to produce different outputs from the same source data.

Masking strategies

Email: replaces the local part (before @) with a 20-character hash string while preserving the domain. A developer can see that an email address was stored; they just cannot tell whose it is.

				
					{% macro mask_email(column_name, seed=None) %} 
    CONCAT( 
        RIGHT({{ hash_alpha(column_name, seed) }}, 20), 
        '@', 
        SUBSTRING({{ column_name }}, LOCATE('@', {{ column_name }}) + 1) 
    ) 
{% endmacro %}

Example output: john.smith@example.com → xkqvbtnzaefmlwcprhds@example.com

Phone: preserves the first 4 characters (the Australian area code or mobile prefix) and replaces the remainder with hashed digits of the same count.

				
					 
{% macro mask_phone(column_name, seed=None) %} 
    CONCAT( 
        LEFT({{ column_name }}, 4), 
        RIGHT( 
            {{ hash_digits(column_name, seed) }}, 
            GREATEST(0, LENGTH(REGEXP_REPLACE({{ column_name }}, '[^0-9]', '')) - 4) 
        ) 
    ) 
{% endmacro %}

Example output: 0412 345 678 → 0412836109

Alpha names: replaces with a hash-derived string of exactly the same length. The column stays non-null and retains its character count, so any downstream LENGTH() checks or UI display constraints remain valid.

				
					 
{% macro mask_alpha(column_name, seed=None) %} 
    RIGHT({{ hash_alpha(column_name, seed) }}, LENGTH({{ column_name }})) 
{% endmacro %}

Example output: Alexandra → nfqbzlxkw

Digits: same principle as alpha, for purely numeric fields.

				
					 
{% macro mask_digits(column_name, seed=None) %} 
    RIGHT({{ hash_digits(column_name, seed) }}, LENGTH({{ column_name }})) 
{% endmacro %}

Date: replaces any non-null date with a fixed sentinel value (1900-01-01). NULL is preserved so that downstream NOT NULL tests still pass correctly.

				
					 
{% macro mask_date(column_name) %} 
    CASE WHEN {{ column_name }} IS NOT NULL 
         THEN DATE('1900-01-01') 
         ELSE NULL 
    END 
{% endmacro %}

The dispatcher

pii_mask() routes to the right strategy based on a mask_type string, the same value you’ll declare in YAML:

{% macro pii_mask(column_name, mask_type='email', seed=None) %} 
{%- if   mask_type == 'email'  -%} {{ mask_email(column_name, seed)  }} 
{%- elif mask_type == 'phone'  -%} {{ mask_phone(column_name, seed)  }} 
{%- elif mask_type == 'alpha'  -%} {{ mask_alpha(column_name, seed)  }} 
{%- elif mask_type == 'digits' -%} {{ mask_digits(column_name, seed) }} 
{%- elif mask_type == 'date'   -%} {{ mask_date(column_name)         }} 
{%- endif -%} 
{% endmacro %}

Inline convenience macro

For use directly inside model SQL (handy when masking must happen before a derived column is computed, see Section 7):

				
					 
{% macro pii_mask_if_dev(column_name, mask_type, seed=None) %} 
{%- if target.name in ['prod', 'pre_prod'] -%} 
    {{ column_name }} 
{%- else -%} 
    {{ pii_mask(column_name, mask_type, seed) }} 
{%- endif -%} 
{% endmacro %}

Usage in model SQL (not recommended):

				
					 
{{ pii_mask_if_dev('c.email', 'email') }}       AS contact_email, 
{{ pii_mask_if_dev('c.first_name', 'alpha') }}  AS given_name, 
{{ pii_mask_if_dev('c.date_of_birth', 'date') }} AS date_of_birth,

The post-hook orchestrator

The post-hook macro is what makes the solution metadata-driven. It runs after a model is materialised, inspects the model’s column metadata, and if any column has a pii_mask value defined, it emits a single UPDATE statement covering all PII columns at once.

				
					 
{% macro apply_pii_masking() %} 
  {% if target.name not in ['prod', 'pre_prod'] %} 
 
    {% set ns = namespace(has_pii=false, updates=[]) %} 
 
    {% for col_name, col in model.columns.items() %} 
      {% set mask_type = col.meta.get('pii_mask') %} 
      {% if mask_type %} 
        {% set expr = pii_mask(col_name, mask_type) %} 
        {% do ns.updates.append(col_name ~ ' = ' ~ expr) %} 
        {% set ns.has_pii = true %} 
      {% endif %} 
    {% endfor %} 
 
    {% if ns.has_pii %} 
      UPDATE {{ this }} 
      SET 
        {{ ns.updates | join(',\n        ') }} 
    {% endif %} 
 
  {% endif %} 
{% endmacro %}

What this compiles to at runtime (for a model with email, given_name, and date_of_birth columns):

				
					 
UPDATE gold. marts.tgt_my_customer_model 
SET 
  contact_email = CONCAT(RIGHT(REGEXP_REPLACE(SHA2(contact_email, 512), '[^a-z]', ''), 20), '@', SUBSTRING(contact_email, LOCATE('@', contact_email) + 1)), 
  given_name    = RIGHT(REGEXP_REPLACE(SHA2(given_name, 512), '[^a-z]', ''), LENGTH(given_name)), 
  date_of_birth = CASE WHEN date_of_birth IS NOT NULL THEN DATE('1900-01-01') ELSE NULL END

Key design decisions:

Single UPDATE, not one per column: minimises write amplification on S3-backed Iceberg tables, which is critical because each write operation can trigger expensive file rewrites under the hood.

Target-name guard: the entire block is a no-op on prod and pre_prod. If a model is accidentally run against production, nothing is masked (which is the correct behaviour, production data should remain unmasked).

Model.columns: dbt exposes the compiled column metadata at compile time. This means the post-hook reads from the same .yml that documents the model, so documentation and masking configuration are always in sync.

No-op on views: if +post-hook is applied to a folder containing both tables and views, the macro generates no SQL for views (they have no pii_mask columns by convention), so there is no error.

Wiring it up – YAML + dbt_project.yml

Step 1 – Annotate columns in the model’s .yml

No changes to the .sql file. Masking intent lives in the companion YAML:

# models/marts/customer.yml 
version: 2 
 
models: 
  - name: customer 
    description: > 
      One row per customer. PII columns are masked on non-production targets 
      via the apply_pii_masking() post-hook configured in dbt_project.yml. 
 
    columns: 
      - name: customer_id 
        description: Primary key. 
        data_tests: 
          - not_null 
          - unique 
 
      - name: given_name 
        description: Customer first name. 
        meta: 
          pii_mask: alpha       # ← masking type declared here 
 
      - name: family_name 
        description: Customer last name. 
        meta: 
          pii_mask: alpha 
 
      - name: date_of_birth 
        description: Date of birth. Masked to 1900-01-01 on non-prod. 
        meta: 
          pii_mask: date 
 
      - name: email_address 
        description: Primary contact email. 
        meta: 
          pii_mask: email 
 
      - name: mobile_number 
        description: Mobile phone number. 
        meta: 
          pii_mask: phone

Step 2 – Enable the post-hook for the entire folder in dbt_project.yml

No per-model configuration. One line applies masking to every model in the folder:

# dbt_project.yml 
models: 
  my_project: 
    marts: 
      +database: gold 
      +schema: marts 
      +materialized: table 
      +file_format: iceberg 
      +post-hook: "{{ apply_pii_masking() }}"   # ← applied to all models in this folder

That’s it. Add a new model to the marts/ folder, annotate its columns in YAML, and masking is automatic on the next dbt build.

Handling derived JSON columns – The view pattern

A subtle challenge arises when a model includes a column derived from PII columns, most commonly a JSON payload column assembled with TO_JSON() or NAMED_STRUCT().

Consider this model:

				
					-- customer.sql 
SELECT 
    customer_id, 
    given_name, 
    email_address, 
    TO_JSON(NAMED_STRUCT( 
        'givenName', given_name, 
        'email',     email_address 
    )) AS api_payload_json 
FROM source_customers

If apply_pii_masking() runs as a post-hook and updates given_name and email_address in place, the api_payload_json column is not regenerated, it still holds the pre-mask values. An UPDATE on a computed column is not a re-materialisation.

This separation ensures that masking and data shaping remain independent concerns, which simplifies maintenance and avoids subtle bugs when derived columns depend on sensitive data.

Solution: The view pattern

Split the model into two objects:

Base table: flat columns only, no JSON. The post-hook masks these.
Companion view: reads from the (already masked) base table and assembles the JSON.

				
					-- customer.sql  (base table — flat columns only) 
SELECT 
    customer_id, 
    given_name, 
    email_address, 
    mobile_number, 
    date_of_birth 
FROM source_customers 

 

-- vw_customer_payload.sql  (view — reads masked base table) 
SELECT 
    customer_id, 
    TO_JSON(NAMED_STRUCT( 
        'givenName',    given_name, 
        'email',        email_address, 
        'dateOfBirth',  CAST(date_of_birth AS STRING) 
    )) AS api_payload_json 
FROM {{ ref('customer') }}

Because the view is defined as SELECT * FROM customer, it reads whatever is currently in the base table at query time. After the post-hook has updated given_name and email_address, any query against vw_customer_payload will produce a fully masked JSON, with no extra masking logic in the view itself.

The companion view should be configured with +materialized: view in dbt_project.yml (or its own .yml config) so that it never triggers the post-hook:

 
models: 
  my_project: 
    marts: 
      +post-hook: "{{ apply_pii_masking() }}" 
      vw_customer_payload: 
        +materialized: view    # views are not writable; post-hook is a no-op

End-to-end flow at a glance

dbt build --select +customer+ 
  │ 
  ├─ 1. Compile customer.sql 
  │       → SELECT customer_id, given_name, email_address, ... FROM source 
  │ 
  ├─ 2. Materialise as Iceberg table (gold.marts.customer) 
  │       → rows contain real PII at this point 
  │ 
  ├─ 3. Post-hook: apply_pii_masking() runs 
  │       → reads meta.pii_mask from customer.yml 
  │       → emits UPDATE gold.marts.customer SET ... 
  │       → SHA2-based expressions applied to given_name, email_address, etc. 
  │       → (no-op if target.name is 'prod' or 'pre_prod') 
  │ 
  └─ 4. Compile & materialise vw_customer_payload (view) 
          → SELECT ... FROM customer 
          → api_payload_json reflects already-masked values automatically

Architecture overview

dbt model build 

      │ 

      ▼ 

Iceberg table (raw PII data) 

      │ 

      ▼ 

Post-hook: apply_pii_masking() 

      │ 

      ▼ 

Masked Iceberg table 

      │ 

      ▼ 

View layer (JSON / derived columns) 

      │ 

      ▼ 

Safe consumption in dev / test

Portability considerations

This implementation is intentionally Spark SQL / AWS Glue-specific. The SQL functions used (SHA2, REGEXP_REPLACE, LOCATE, DATE(‘…’)) are Spark dialect and will not compile on other adapters without changes.

For teams considering a similar approach on other platforms:

Platform	Native masking	dbt package	Macro portability
Snowflake	✅ Dynamic Data Masking	✅ dbt-snowflake-masking	Needs SHA2_HEX(), REGEXP_REPLACE(), CHARINDEX() equivalents
BigQuery	✅ Column-level security	Partial (dbt-privacy)	SHA512(), REGEXP_REPLACE(), STRPOS()
Redshift	❌	❌	FUNC_SHA2(), REGEXP_REPLACE(), POSITION()
Athena (Trino)	❌	❌	sha512(), regexp_replace(), strpos(), UPDATE requires Iceberg
AWS Glue (Spark)	❌	❌	✅ This implementation

The UPDATE statement in the post-hook also requires a writable, mutable table format. On Glue, this means Apache Iceberg. On standard Hive-style tables, UPDATE is not supported and the post-hook would fail.

Why human expertise still fills the gap

There’s a tempting belief in the modern data stack that every problem has a package. Browse dbt Hub, find the library, add it to packages.yml, done. That belief is comfortable but increasingly dangerous as teams venture beyond the well-lit corridors of the Snowflake ecosystem.

AWS Glue + dbt is a legitimate, production-grade combination used by organisations managing petabyte-scale data lakes on AWS. And yet, when it comes to PII masking, the tooling gap is real.

There is no dbt-glue-masking package. There is no AWS Glue equivalent of Snowflake’s Dynamic Data Masking. The official dbt documentation on masking points at Snowflake examples.

This is precisely where experienced engineers matter. The solution described in this article required:

Deep knowledge of Spark SQL to write correct, performant expressions for each masking strategy.

Understanding of the dbt macro system: Jinja namespaces, model.columns introspection, adapter.dispatch() patterns, the lifecycle of post-hook.

Iceberg internals awareness: knowing that UPDATE is only possible on Iceberg (not Hive-style) tables, and understanding how S3-backed Iceberg handles row-level updates.

Privacy engineering judgement: deciding what “masked” means for each data type: preserving email domain for realism, preserving phone area code for routing logic validity, using a sentinel date rather than a random one so that NOT NULL tests still pass.

Architecture taste: the decision to separate the flat base table from the JSON payload view, so that masking and serialisation are independent concerns.

No library provides that combination of skills. The library might provide the skeleton; the engineer provides the rest.

As data platforms become more complex and automation continues to accelerate, the engineers who understand the why behind their tools, and who can design fit-for-purpose solutions when those tools fall short, become increasingly valuable.

A missing tool is not a blocker. It is often a signal that thoughtful engineering is required.

Conclusion

Implementing PII masking in dbt on AWS Glue is entirely achievable without relying on an off-the-shelf library.

By using a metadata-driven approach, where masking is declared in YAML and applied via dbt macros and post-hooks, it is possible to build a solution that is both scalable and maintainable in secure, Spark-based data platforms.

The key advantages of this approach are:

No changes to model SQL, keeping transformations clean and focused

Centralised control via dbt_project.yml, enabling consistent masking across domains

Support for Iceberg-backed tables, allowing efficient in-place updates

Clear separation of concerns, using the view pattern for derived data

Most importantly, this pattern demonstrates that even in ecosystems without native masking support, teams can still implement robust PII protection in dbt pipelines using first principles.

As modern data platforms continue to evolve, solutions like this highlight a broader truth: It’s not just the tools that matter, but how you apply them.

Domenico Campagnolo

Dom is an accomplished Lead Data Engineer with over 20 years in technology and more than a decade dedicated to large-scale data platforms across telecom, health, and enterprise domains. He specialises in architecting and modernising cloud-native data platforms, building trusted analytics-ready datasets, and leading engineering teams to deliver robust, scalable, and cost-efficient pipelines.

Enjoyed this blog?

Share it with your network!

Metadata-Driven PII Masking in dbt on AWS Glue

Introduction

Table of Contents

Why PII Masking in dbt is critical for Dev Environments

Why not just use an existing library?

What exists

What exists for Spark/Glue

The idiomatic approach

Designing a Metadata-Driven PII Masking solution in dbt

When to use this approach

dbt Macros for PII Masking in Spark SQL

Low-level helpers

Masking strategies

The dispatcher

Inline convenience macro

The post-hook orchestrator

Wiring it up – YAML + dbt_project.yml

Step 1 – Annotate columns in the model’s .yml

Step 2 – Enable the post-hook for the entire folder in dbt_project.yml

Handling derived JSON columns – The view pattern

Solution: The view pattern

End-to-end flow at a glance

Architecture overview

Portability considerations

Why human expertise still fills the gap

Conclusion

The governance gap: Why AI-driven threats demand organisational change, not just technical fixes

How to use AI coding tools like a senior engineer

Building trust through AI Governance with Australian regulators

Metadata-Driven PII Masking in dbt on AWS Glue

Introduction

Table of Contents

Why PII Masking in dbt is critical for Dev Environments

Why not just use an existing library?

What exists

What exists for Spark/Glue

The idiomatic approach

Designing a Metadata-Driven PII Masking solution in dbt

When to use this approach

dbt Macros for PII Masking in Spark SQL

Low-level helpers

Masking strategies

The dispatcher

Inline convenience macro

The post-hook orchestrator

Wiring it up – YAML + dbt_project.yml

Step 1 – Annotate columns in the model’s .yml

Step 2 – Enable the post-hook for the entire folder in dbt_project.yml

Handling derived JSON columns – The view pattern

Solution: The view pattern

End-to-end flow at a glance

Architecture overview

Portability considerations

Why human expertise still fills the gap

Conclusion

AWS doubled SCP limits. Here’s what to review in your landing zone

The governance gap: Why AI-driven threats demand organisational change, not just technical fixes

How to use AI coding tools like a senior engineer

Building trust through AI Governance with Australian regulators