Engineering May 5, 2025

Using dbt Metadata to Build a GDPR-Compliant Lineage Graph

If your organisation uses dbt for data transformation, you already have a lineage graph. The manifest.json artifact that dbt generates at compile time contains a complete directed acyclic graph of every model, its source tables, its column schema, and its downstream dependencies. What it does not contain is any annotation about which nodes in that graph process personal data — and that is the gap that creates GDPR compliance risk.

This article walks through what dbt metadata provides, how to extract field-level lineage from it, and how to bind that lineage to GDPR obligations through a policy layer — including keeping it current as the dbt project evolves.

What dbt metadata gives you

The dbt manifest.json contains, for every model node: the compiled SQL, the column-level schema, upstream source references, declared tests, and any tags or descriptions added by the model author. The catalog.json artifact (generated by dbt docs generate) adds warehouse-resolved data types and row count statistics. Together, these two artifacts provide the structural metadata needed to reconstruct a field-level lineage graph without querying the warehouse itself.

For GDPR purposes, this translates directly into the data lineage component of your Article 30 ROPA. If you annotate each model node with the personal data categories it processes, you have a machine-readable lineage graph that can answer: which personal data fields originate from which sources, pass through which transformations, and land in which downstream tables or exports — updated on every dbt build.

Extracting field-level lineage from manifest.json

The manifest.json schema is documented and stable across dbt major versions. The key structure for lineage extraction is nodes[node_id].depends_on.nodes — a list of upstream model and source node references for each model. Walking this graph recursively from any downstream model produces the full lineage chain back to raw source tables.

For GDPR field-level tracing, the extraction combines three manifest structures: the node dependency graph from depends_on.nodes; column metadata from nodes[node_id].columns (names, data types, descriptions); and any meta fields or tags applied at the model or column level. The resulting per-column lineage trace answers: where does this email address column originate, what SQL transformations does it pass through, and in which downstream model endpoints does a derived version land?

A practical extraction script walks the dependency graph in topological order, propagating column-level annotations downstream. A column tagged gdpr_category: email in a source node will appear as a propagated annotation in every downstream model that selects it — unless an intermediate transformation drops or irreversibly anonymises it, which the SQL analysis can detect in most cases.

The dbt manifest does not know which columns are personal data. That knowledge has to come from somewhere else — either a warehouse classification system that resolves against the same column names, or manual annotations in dbt schema YAML files. The annotation layer is the compliance work; the lineage extraction is the automation.

Binding lineage to GDPR obligations

Lineage metadata on its own does not answer the compliance question. The compliance question is: does the processing this lineage represents have a documented legal basis, is it reflected in the ROPA, and does it respect the retention schedule for the data category involved?

Binding lineage to obligations requires a second layer: a policy definition that maps data categories to legal bases and retention rules. This is the privacy officer’s domain. The data engineer can establish that stg_crm__contacts.email flows into mart_marketing__campaign_targets.recipient_email. The privacy officer determines whether marketing email processing on the basis of legitimate interest — or contract performance — is documented, proportionate, and defensible under the Article 6(1)(f) balancing test.

The operative compliance state per lineage path is: (a) personal data category; (b) declared legal basis; (c) processing purpose; (d) retention window; (e) cross-border transfer flag and mechanism if applicable. A lineage path with all five fields populated and reviewed is compliant. A lineage path missing any field is an open compliance item. The manifest extraction gives you (a) automatically — the rest requires human input structured through the policy layer.

The dbt CI gate: compliance in the pipeline

The highest-value architectural change for GDPR compliance in a dbt workflow is adding a compliance gate to the CI/CD pipeline. The pattern: on every pull request that modifies a dbt model, the CI job compiles the manifest and compares the new model’s column lineage against the personal data annotation registry. If the PR introduces a new column path that carries a personal data annotation from upstream, the CI job fails with a compliance review required status and notifies the privacy officer.

This turns GDPR review from a retrospective audit into a proactive gate. The data engineer gets immediate feedback before a new pipeline ships to production. The privacy officer reviews only the changes that introduce new personal data processing, not the entire lineage graph. The compliance record shows that new processing activities were reviewed at creation time, satisfying the Article 5(2) accountability requirement prospectively.

The CI gate approach also prevents a common failure mode: analytics engineers creating derived features from personal data columns for model performance reasons, without realising the derived column counts as personal data processing under GDPR because it is still reasonably linkable to an individual. The annotation propagation logic catches this automatically.

Schema migrations and lineage currency

dbt projects in active development change rapidly. Models are refactored, sources are deprecated, new staging layers are added. Every change updates the manifest. If the compliance lineage graph is generated from a single manifest snapshot taken at an audit date, it ages at the pace of dbt development — which in active teams means meaningful divergence within weeks.

Operational lineage currency requires archiving manifest artifacts on every CI build and ingesting the diff into the compliance graph. The diff view — which column paths were added, modified, or removed since the last reviewed manifest — is the unit of compliance review, not the full graph. Most builds produce zero personal data changes; the privacy officer’s attention is required only on builds where the diff touches annotated columns.

Source YAML as the annotation interface

For organisations that prefer to keep personal data annotations close to the dbt project rather than in an external classification system, dbt schema YAML files provide a natural annotation interface. Column-level meta fields — gdpr_category: email, retention_days: 730, legal_basis: contract — travel with the model definition, are version-controlled in git, and are readable by the manifest extraction pipeline.

The limitation is coverage: annotations in YAML only cover columns that a model author explicitly documents. A column added to a source table at the warehouse level — without a corresponding YAML update — will be invisible to the annotation system until the YAML is updated. Complementing YAML annotations with warehouse-level classification that runs independently provides defence-in-depth coverage against this gap.

Conclusion

dbt metadata is one of the richest lineage sources available to data engineering teams operating under GDPR, and it is already being generated on every build. The compliance gap is not the lineage — it is the annotation layer on top of it, and the policy binding that connects lineage paths to legal obligations. Adding a lightweight CI gate and an annotation registry to an existing dbt workflow can move an organisation from retrospective ROPA audits to continuous compliance without replacing the data infrastructure.

Source notes

GDPR Articles 5(2) and 30 — accountability principle and ROPA content requirements for data flows and processing activities
dbt Labs, manifest.json artifact schema reference (dbt Core v1.6+) — nodes, depends_on, columns, meta, and tags structure
dbt Labs, catalog.json artifact documentation — warehouse-resolved column type and row statistics
EDPB, Guidelines 07/2020 on the concepts of controller and processor — processing activity scope for automated transformation pipelines
ENISA, Pseudonymisation techniques and best practices (2019) — anonymisation and pseudonymisation thresholds relevant to derived column propagation