flowchart LR
subgraph Staging ["Staging"]
ST1["stg_nyc_taxi_trips"]
ST2["stg_taxi_zone_lookup"]
end
subgraph Silver ["Silver"]
SI1["silver_nyc_taxi_cleaned"]
SI2["silver_nyc_taxi_enriched"]
end
subgraph Gold ["Gold KPIs"]
G1["kpi_trips_by_hour"]
G2["kpi_revenue_by_hour"]
G3["kpi_top_pickup_zones"]
G4["kpi_data_quality_metrics"]
end
ST1 --> SI1
SI1 --> SI2
ST2 --> SI2
SI2 --> G1
SI2 --> G2
SI2 --> G3
SI2 --> G4
classDef staging fill:#475569,color:#fff,stroke:#334155
classDef silver fill:#0369a1,color:#fff,stroke:#075985
classDef gold fill:#01065c,color:#fff,stroke:#000940
class ST1,ST2 staging
class SI1,SI2 silver
class G1,G2,G3,G4 gold
Module 4: dbt Pipeline
Transform-as-code on Snowflake and Databricks
Duration: 75 min — Animation (3) · Think & Discuss (7) · Theory (15) · Quiz (3) · Practice (47)
1. Animation
2. Think & Discuss
Situation: Elena adds dbt on top of Snowflake for tests, lineage, and SQL-first transforms. Priya wants revenue pages and a data quality scorecard.
Prompts:
- Is dbt a replacement for Snowflake? State dbt’s job in one sentence.
- How would you prove which model feeds a Power BI revenue tile?
- Name two tests that stop bad data from breaking a KPI.
3. Theory
Elena: We keep Snowflake as the engine. dbt owns transform SQL, tests, and lineage — what Marcus’s board asked for.
Snowflake Horizon Catalog is the governance umbrella (tags, masking, lineage) — parallel to Unity Catalog in the Databricks track.
3.1 Theory: dbt Essentials
What is dbt?
Before diving into dbt, consider the two user personas on Databricks: Spark/Python engineers who build pipelines in notebooks and SQL analysts who write queries in SQL Warehouse or external BI tools. dbt bridges this gap — it lets both personas collaborate on the same codebase using SQL, while the underlying engine (Spark or Snowflake) handles execution.
dbt (data build tool) is a transformation framework that:
- Turns SQL
SELECTstatements into managed tables/views - Adds testing, documentation, and lineage to SQL
- Runs on top of existing compute (Databricks, Snowflake, BigQuery, etc.)
- Does not handle ingestion — it assumes Bronze tables already exist
Problems dbt solves
Before dbt, teams like Marcus’s at YellowLine NYC typically face:
| Pain Point | Without dbt | With dbt |
|---|---|---|
| Complex SQL | Hundreds of lines of CREATE TABLE / INSERT boilerplate |
One SELECT per model — dbt handles DDL/DML |
| Dependency management | Manually ordered scripts; break when upstream changes | ref() builds a DAG — dbt runs models in correct order |
| Testing | No automated data quality checks; errors found in dashboards | Built-in tests (not_null, unique, accepted_values) |
| Documentation | Stale wikis or no docs at all | Auto-generated docs + interactive lineage graph (dbt docs) |
| Code reuse | Copy-pasted SQL across notebooks and scripts | Macros and packages — DRY principle for SQL |
| Version control | Notebooks in folders; no code review | Git-native workflow with PRs and CI/CD |
Software engineering for SQL
dbt’s core insight is applying software engineering best practices — version control, testing, modularity, CI/CD, documentation — to SQL transformations. This is exactly what Elena needed when she proposed dbt to address the board’s governance requirements.
Core Concepts
| Concept | What It Does | Example |
|---|---|---|
| Model | SQL SELECT → table/view | silver_nyc_taxi_cleaned.sql |
| ref() | Reference another model | { ref('silver_nyc_taxi_enriched') } |
| source() | Reference a raw table | { source('bronze', 'nyc_taxi_trips') } |
| Test | Validate data expectations | not_null, accepted_values |
| Macro | Reusable SQL function | { time_of_day('pickup_hour') } |
| Seed | CSV → table | taxi_zone_lookup.csv |
How ref() builds the DAG — each ref() call creates a dependency edge. dbt reads all model files, resolves the dependency graph, and runs models in topological order:
The dbt Fusion engine, generally available in 2026, significantly speeds up compilation for large projects by transpiling SQL models into Rust — reducing parse and compile times from minutes to seconds. For teams with hundreds of models, this is a major productivity improvement. Additionally, the dbt VS Code extension provides real-time model validation, inline DAG visualization, and one-click run/test, while dbt Canvas offers a visual, low-code editor for building and editing dbt models — a recommended alternative to the CLI-only workflow for analysts new to dbt.
Cross-Platform SQL
dbt uses target.type to write platform-specific SQL:
{%- if target.type == 'snowflake' -%}
dayname(pickup_datetime)
{%- else -%}
date_format(pickup_datetime, 'EEEE')
{%- endif -%}This conditional compilation is powered by Jinja — a templating engine embedded in dbt SQL files. Beyond target.type, Jinja provides ref() and source() for dependency management, macros for reusable SQL snippets (similar to functions in Python), and configuration blocks ({ config(...) }) that control how each model is materialized. The result: SQL files that are both portable across platforms and DRY (Don’t Repeat Yourself).
Materialization Strategies
Every dbt model must choose a materialization — the strategy for how the SELECT statement becomes a database object:
| Materialization | What it creates | When to use | Tradeoff |
|---|---|---|---|
| view | A database view (query runs on access) | Simple transforms, always-fresh data | Slow queries on large tables |
| table | A physical table (rebuilt on each run) | Gold KPI aggregations, stable lookups | Storage cost; stale between runs |
| incremental | Appends/merges only new rows | Large Silver tables with append-only sources | Complexity: requires is_incremental() logic |
| ephemeral | No database object (inlined as CTE) | Intermediate logic shared via ref() |
Not queryable outside dbt; no testing |
| materialized_view | Database-managed refresh | Streaming or near-real-time (Snowflake only) | Adapter-specific; higher compute cost |
-- Example: incremental materialization for a large Silver table
{{ config(
materialized = 'incremental',
unique_key = 'trip_id'
) }}
SELECT *
FROM {{ source('bronze', 'nyc_taxi_trips') }}
WHERE fare_amount > 0
{% if is_incremental() %}
AND tpep_pickup_datetime > (SELECT MAX(tpep_pickup_datetime) FROM {{ this }})
{% endif %}For this workshop, most models use table materialization — simplicity first. Incremental patterns appear in production (Module 5) and streaming (Module 8).
Testing Framework
The slide above walks through the full dbt project workflow: Init (dbt init to scaffold the project and configure profiles.yml), Bronze (registering raw source tables via source() declarations), Silver (cleaning and transforming with ref() chains), Gold (building KPI aggregations for dashboards), and finally Testing & Docs (dbt build to run + test, then dbt docs generate to produce lineage graphs and data dictionaries). Each phase maps to a step in the medallion architecture you have already built in Modules 2 and 3.
dbt’s testing framework is what sets it apart from ad-hoc SQL scripts. Tests are first-class citizens — they run alongside your models via dbt build (which combines run + test in one command).
Built-in generic tests cover the most common data quality checks:
not_null— column has no nulls (essential for primary keys)unique— no duplicate values (critical for IDs)accepted_values— column only contains expected values (e.g.,payment_type IN ('Credit Card', 'Cash'))relationships— foreign key integrity between models
# models/silver/schema.yml
models:
- name: silver_nyc_taxi_enriched
columns:
- name: trip_id
tests: [unique, not_null]
- name: payment_type_desc
tests:
- accepted_values:
values: ['Credit Card', 'Cash', 'No Charge', 'Dispute']
- name: fare_amount
tests:
- not_nullBeyond built-in tests, you can write custom SQL tests (tests/assert_revenue_positive.sql) that validate business rules — for example, “Gold revenue KPIs must never be negative.” Every test result is logged and surfaced in dbt docs, creating an auditable data quality record.
Always use dbt build
In production, always run dbt build (which executes run + test together) instead of separate dbt run followed by dbt test. If a model’s tests fail, dbt build stops downstream models from running — preventing bad data from propagating to Gold tables and dashboards. This single habit prevents the most common production data quality incidents.
dbt is not an ingestion tool
A common mistake is trying to use dbt for Bronze ingestion — reading files from cloud storage and loading them into tables. dbt cannot do this. It assumes Bronze tables already exist and only transforms data via SELECT statements. Ingestion is the job of Databricks (spark.read.parquet()) or Snowflake (COPY INTO). If you find yourself wanting to write CREATE TABLE ... AS SELECT FROM EXTERNAL LOCATION, that’s an ingestion concern — not a dbt model.
3.2 Demo: Trainer Walkthrough
# Install packages
dbt deps
# Run against Databricks backend
dbt run --target databricks
dbt test --target databricks
# Run SAME models against Snowflake backend
dbt run --target snowflake
dbt test --target snowflake
# Compare Gold KPI outputs — they should match!
# Generate and view documentation
dbt docs generate
dbt docs serve3.3 Key Takeaways
- dbt adds testing, docs, and lineage to SQL transformations — it is the governance layer on top of compute engines
- Same models, different backends —
target.typeand Jinja templating enable cross-platform SQL without code duplication - dbt does not replace Databricks or Snowflake — it sits on top as a transformation-only layer (no ingestion)
- refs create a dependency graph (DAG) — dbt knows the correct execution order and builds models in the right sequence
- Materialization choice (view, table, incremental) controls the tradeoff between freshness, cost, and complexity
dbt build(run + test together) is the production standard — failing tests block downstream models- In production, use dbt Cloud or GitHub Actions CI (see Module 5)
4. Quiz
Quiz: Module 4 — dbt Pipeline Quiz
Before moving on, make sure you can answer:
- What is the difference between
ref()andsource()— when would you use each? - Name two materialization types and explain when you’d choose one over the other.
- What does
dbt builddo thatdbt runalone does not, and why does it matter in production? - Why can’t dbt load raw Parquet files from cloud storage into Bronze tables?
5. Practice
Hands-on lab
Project: dbt_project/ targeting Snowflake.
When all 12 Gold kpi_* tables exist, Priya can connect Power BI Desktop (free) to the same schema:
- Self-paced lab (~45–60 min): Exercise: Power BI Dashboard — five pages, all 12 KPIs
- Install & connect: Power BI setup
- DAX & theme reference: Power BI dashboard guide
Priya / Power BI: Revenue & Payments pages plus kpi_data_quality_metrics scorecard.
Next module
Module 5: Production Patterns — Elena asks: What executes every night when we are not in the room?