Exercise: dbt Pipeline

YellowLine NYC story · full hands-on lab

title: “Exercise: dbt Pipeline” subtitle: “YellowLine NYC story · full hands-on lab” —

Estimated time: 30–35 min (Base: 15 min · Stretch goals: 15–20 min)

YellowLine NYC context (Module 4)

The board wants lineage and tests. Add dbt on top of Snowflake.

TipWorking Environment — GitHub Codespaces (Recommended)

Open your fork on GitHub → Code → Codespaces → Create codespace on main. dbt Core, Python, and all dependencies are pre-installed. Skip straight to Step 1.

Local alternative: Install dbt Core 1.8+ manually:

pip install dbt-core dbt-databricks dbt-snowflake

Prerequisites

  • Codespaces: dbt is pre-installed ✅
  • Local: dbt Core 1.8+ installed (dbt --version)
  • profiles.yml configured with at least one target (Databricks or Snowflake)
  • Bronze tables exist (from Databricks or Snowflake exercise)

Base Exercise

Step 1: Setup

cd dbt_project/

# Copy and configure profiles
cp profiles.yml.example profiles.yml
# Edit profiles.yml with your credentials

# Verify connection
dbt debug

# Install packages
dbt deps

# Load seed data
dbt seed

Step 2: Run the Pipeline

# Run all models
dbt run --target snowflake   # or --target databricks

# You should see:
# staging: 2 models (views)
# silver: 2 models (tables)
# gold: 12 models (tables)

Step 3: Run Tests

dbt test

# Check results — all tests should pass
# Tests check: not_null, accepted_values, data quality constraints

Step 4: Generate Documentation

dbt docs generate
dbt docs serve

# Opens browser with interactive documentation
# Explore the lineage graph — click any model to see SQL and tests

Step 5: Verify Gold KPIs

Connect to your target database and query the Gold tables:

SELECT * FROM kpi_trips_by_hour ORDER BY pickup_hour;
SELECT * FROM kpi_top_pickup_zones WHERE trip_rank <= 5;
SELECT * FROM kpi_data_quality_metrics;

Stretch Goals

A: Switch Target

# Run against the OTHER backend
dbt run --target databricks   # if you ran snowflake first

# Compare outputs — KPI values should match across backends

B: Write a Custom Test

Create dbt_project/tests/generic/test_no_empty_gold_tables.sql:

-- This test fails if any Gold KPI table has zero rows
{% test has_rows(model) %}
    select count(*) as row_count
    from {{ model }}
    having count(*) = 0
{% endtest %}

C: Add a New KPI Model

Create dbt_project/models/gold/kpi_avg_tip_by_borough.sql:

select
    pickup_borough,
    count(*) as total_trips,
    round(avg(tip_amount), 2) as avg_tip,
    round(avg(tip_percentage), 2) as avg_tip_pct
from {{ ref('silver_nyc_taxi_enriched') }}
where pickup_borough is not null
group by pickup_borough
order by avg_tip desc

Then run:

dbt run --select kpi_avg_tip_by_borough
dbt test --select kpi_avg_tip_by_borough

D: Explore Macros

  1. Open macros/time_of_day.sql
  2. Run dbt compile to see how the macro expands in a model
  3. Check target/compiled/ for the expanded SQL

E: Use dbt build

# Run + test in dependency order (most efficient for CI)
dbt build --target snowflake
NoteReady to compare all three tools?

Head to the Batch Pipeline Comparison page for a full side-by-side breakdown of Databricks, Snowflake, and dbt — verified against official documentation — including storage formats, compute models, cost implications, and key gotchas.


Cleanup

dbt models are stored as tables/views in your target database. To clean up:

# Drop all dbt-created models from the target
dbt run --target snowflake --full-refresh  # resets incremental state
# Then manually drop schemas if needed:
# DROP SCHEMA IF EXISTS {attendee_id}_GOLD CASCADE;

If using the Snowflake backend, also suspend the warehouse:

ALTER WAREHOUSE DE_WORKSHOP_WH SUSPEND;

Return to module

Source: merged from frozen workshop-2026-v1/exercises/ex-dbt.qmd — do not edit workshop-2026-v1/.