Exercise: dbt Pipeline

YellowLine NYC story · full hands-on lab

title: “Exercise: dbt Pipeline” subtitle: “YellowLine NYC story · full hands-on lab” —

Estimated time: 30–35 min (Base: 15 min · Stretch goals: 15–20 min)

YellowLine NYC context (Module 4)

The board wants lineage and tests. Add dbt on top of Snowflake.

Working Environment — GitHub Codespaces (Recommended)

Open your fork on GitHub → Code → Codespaces → Create codespace on main. dbt Core, Python, and all dependencies are pre-installed from pyproject.toml. Complete Configure .env below, then Step 1.

Local alternative — from your fork root:

uv pip install --system ".[streaming,ml,notebook]"
dbt --version

GitHub blocked? (emergency only)

The normal path is fork + Codespace (Prerequisites § Step 2). Use Lab source files only if your facilitator approved it — e.g. you cannot create or use GitHub before class.

Prerequisites

Prerequisites — fork, Codespace (Modules 2–3 do not need .env)
Bronze tables exist (from Databricks or Snowflake exercise)
Snowflake target: role DE_WORKSHOP_ROLE + warehouse DE_WORKSHOP_WH Started (Exercise: Snowflake § role)
Databricks target (optional): SQL warehouse de-workshop-wh via DATABRICKS_HTTP_PATH — not your PySpark cluster (Exercise: Databricks § cluster vs warehouse)

Configure `.env` (Codespaces)

Once, before Step 1. Not needed for Modules 2–3 (Databricks notebooks + Snowsight use browser login).

Where: repo root in Codespace (cp .env.template .env → fill → sync-env → verify-env).

Variable	Where to find it
`ATTENDEE_ID`	My Workshop
`DATABRICKS_HOST`	Workspace URL without `https://`
`DATABRICKS_TOKEN`	Databricks → Settings → Developer → Access tokens
`DATABRICKS_HTTP_PATH`	SQL Warehouses → `de-workshop-wh` → Connection details
`SNOWFLAKE_ACCOUNT`	Snowsight → account identifier e.g. `el30551.west-europe.azure` (§ Snowflake values)
`SNOWFLAKE_USER` / `PASSWORD`	Trial signup

Leave defaults: DE_WORKSHOP_WH, DE_MASTERCLASS, PUBLIC, DE_WORKSHOP_ROLE.

cp .env.template .env

Edit .env — in Codespaces: click .env in the Explorer sidebar (left) to open it in the built-in editor, then fill in the values from the table above.

sync-env     # regenerate ~/.snowflake/config.toml + dbt profiles after editing .env
             # (equivalent to: bash .devcontainer/setup-environment.sh)
export ATTENDEE_ID=de_XX_yourname   # same value as in .env — required for dbt (shell env var)
verify-env   # one-shot health check — all CLIs + live Snowflake / Databricks / dbt connectivity

Verify your environment

verify-env is the fastest way to confirm your Codespace is wired correctly — it checks that snow, databricks, and dbt are installed and that your .env credentials actually connect. A healthy run ends with 🎉 Environment verified — you're ready for the labs. If any line shows ❌, fix the matching values in .env, run sync-env, then verify-env again. Re-run it any time.

Snowflake values

SNOWFLAKE_ACCOUNT — account identifier from your Snowsight URL.

Look at the browser address bar — it looks like:

https://app.snowflake.com/west-europe.azure/el30551/#/homepage

Take the two path segments after /app.snowflake.com/ and swap their order: west-europe.azure/el30551 → el30551.west-europe.azure
Set in .env: SNOWFLAKE_ACCOUNT=el30551.west-europe.azure

Common mistakes

❌ https://el30551.west-europe.azure — no https://
❌ el30551.west-europe.azure.snowflakecomputing.com — no .snowflakecomputing.com
❌ pasting the full URL — only the two-part identifier

Alternative path: Snowsight → click your name (bottom-left corner) → View account details → copy Account identifier (the short locator.region.cloud form, not the org-based locator).

SNOWFLAKE_USER — your Snowflake login username (not your email address).

During trial signup you chose a username — it appears at the top when you click your name (bottom-left corner) in Snowsight.
If you forgot it: Snowsight → bottom-left avatar → your username is shown in the pop-up header.

SNOWFLAKE_PASSWORD — the password you set when activating your Snowflake trial.

Same password you use to log into app.snowflake.com.
If you forgot it: click Forgot password on the Snowflake sign-in page.

Databricks values

DATABRICKS_HOST — your workspace hostname (no https://).

Open your Databricks workspace in the browser
Copy the hostname from the address bar — everything before the first # or /:
```
adb-1234567890123456.7.azuredatabricks.net
```
Set in .env: DATABRICKS_HOST=adb-1234567890123456.7.azuredatabricks.net

Warning

Do not include https://. The host ends in .azuredatabricks.net.

DATABRICKS_TOKEN — personal access token (PAT).

In Databricks, click your username (top-right corner) → Settings
Left sidebar → Developer → Access tokens
Click Generate new token
Name it dbt-workshop, set expiry to 30 days → click Generate
Copy the token immediately — it starts with dapi and is shown only once
Set in .env: DATABRICKS_TOKEN=dapiXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Warning

If you close the dialog without copying, you must generate a new token — existing tokens cannot be retrieved.

DATABRICKS_HTTP_PATH — SQL Warehouse path (not your cluster path).

In Databricks, click SQL Warehouses in the left sidebar (if not visible: open the left-sidebar grid icon → search “SQL Warehouses”)
Click de-workshop-wh → select the Connection details tab
Copy the HTTP Path value — it looks like:
```
/sql/1.0/warehouses/abc1234def56789a
```
Set in .env: DATABRICKS_HTTP_PATH=/sql/1.0/warehouses/abc1234def56789a

Important

This is the SQL Warehouse path, not your All-Purpose Cluster path. Cluster paths look like /sql/protocolv1/o/… and will cause dbt connection errors — cluster vs SQL warehouse.

dbt troubleshooting

Issue	Solution
`dbt debug` fails	Re-check `.env`
No source / Bronze missing	Run Databricks or Snowflake pipeline first
Snowflake auth error	Account ID `el30551.west-europe.azure` format
`PASS` not 17 (run) or 22 (test)	Use `--exclude tag:ml tag:streaming` on run and test
Gold in Silver schema	Pull latest `dbt_project/` — gold uses `+schema: dbt_gold`
Streaming on Databricks	`--exclude tag:streaming` — Dynamic Tables are Snowflake-only

Base Exercise

Step 1: Setup

You should have completed Configure .env — ~/.dbt/profiles.yml is already generated. Skip cp profiles.yml.example.

cd dbt_project/

export ATTENDEE_ID=de_XX_yourname   # same value as in .env — required for dbt

# Verify connection
dbt debug --target snowflake

# Install packages
dbt deps

# Load seed data
dbt seed

If you did not use Codespaces, pick your OS:

cd dbt_project/
copy profiles.yml.example profiles.yml
# Open profiles.yml in your editor (e.g. VS Code: `code profiles.yml`)
# Replace <YOUR_SNOWFLAKE_ACCOUNT>, <YOUR_USERNAME>, <YOUR_PASSWORD> with actual values
$env:ATTENDEE_ID = "de_XX_yourname"
dbt debug --target snowflake
dbt deps
dbt seed

cd dbt_project/
cp profiles.yml.example profiles.yml
# Open profiles.yml in your editor (e.g. `code profiles.yml`)
# Replace <YOUR_SNOWFLAKE_ACCOUNT>, <YOUR_USERNAME>, <YOUR_PASSWORD> with actual values
export ATTENDEE_ID=de_XX_yourname
dbt debug --target snowflake
dbt deps
dbt seed

Step 2: Run the Pipeline

# Batch medallion only — exclude optional ML + streaming models (Modules 8–9)
dbt run --target snowflake --exclude tag:ml tag:streaming   # or --target databricks

# You should see PASS=17 in the summary (1 hook + 16 models):
# staging: 2 models (views)
# silver: 2 models (tables)
# gold: 12 models (tables)

Note

Why --exclude tag:ml tag:streaming? The full dbt_project/ includes ML feature models (Module 9) and Dynamic Table models (Module 8). These need extra setup (.env, streaming prerequisites). Excluding them keeps the base run clean — you will add them back in later modules.

Step 3: Run Tests

dbt test --target snowflake --exclude tag:ml tag:streaming   # match Step 2 target

# Expect PASS=22 (silver: 6 tests · gold: 16 tests)
# Tests check: not_null, accepted_values, and value-range constraints

Reading the test summary

dbt prints one PASS/FAIL line per test, then a roll-up like Done. PASS=22 WARN=0 ERROR=0 SKIP=0 TOTAL=22. All 22 should PASS with 0 errors. If a test fails, dbt prints a compiled Code at target/compiled/... path — open that file in the Explorer and run the query in Snowsight to see the exact rows that broke the rule. Every test traces back to a column in the models/**/*.yml files.

Note

Why dbt test matters: In Databricks and Snowflake you wrote SELECT COUNT(*) queries to check results manually. dbt runs not_null, accepted_values, unique, and custom assertions automatically after every dbt run. In a production pipeline, this means bad data is caught before it reaches the Gold layer — without writing a single manual verification query.

Step 4: Generate & Explore Documentation

dbt docs generate --target snowflake   # match your run target
dbt docs serve                         # serves at http://localhost:8080

In Codespaces, click Open in Browser on the port-8080 forwarded-port toast (or use the Ports tab). Then explore three things:

Lineage graph — click the blue circle (bottom-right) for the full DAG: source → staging → silver → gold. It is generated automatically from your ref() and source() calls — no manual diagram to maintain.
Column descriptions — open any model (e.g. silver_nyc_taxi_enriched) and scroll to the Columns section. Every column carries a plain-English description sourced from the models/**/*.yml files.
Tests — on a model page, each tested column lists its attached tests (not_null, accepted_values, …) right beside the column name — the same 22 tests you ran in Step 3.

Note

Docs vs. catalog: column descriptions and tests come from your YAML (the manifest), so they show up immediately. Data types and row counts come from the warehouse (the catalog), which is why dbt docs generate connects to Snowflake to populate them.

Step 5: Verify Gold KPIs

dbt gold/ml models land in {ATTENDEE_ID}_dbt_gold / {ATTENDEE_ID}_DBT_GOLD — isolated from the SQL track’s _SQL_GOLD so the two pipelines never overwrite each other. dbt staging/silver models use {ATTENDEE_ID}_dbt / {ATTENDEE_ID}_DBT (see profiles.yml).

Where to run: open app.snowflake.com → Workspaces (or Projects → Workspaces) → + Add New → SQL File. Replace de_XX_yourname with your actual ATTENDEE_ID, then run each statement with ▶ (or Ctrl+Enter).

-- Snowflake SQL File — replace de_XX_yourname with your ATTENDEE_ID
SET attendee_id = 'de_XX_yourname';
SET kpi_trips_by_hour    = 'DE_MASTERCLASS.' || $attendee_id || '_DBT_GOLD.kpi_trips_by_hour';
SET kpi_top_pickup_zones = 'DE_MASTERCLASS.' || $attendee_id || '_DBT_GOLD.kpi_top_pickup_zones';
SET kpi_data_quality     = 'DE_MASTERCLASS.' || $attendee_id || '_DBT_GOLD.kpi_data_quality_metrics';

SELECT * FROM IDENTIFIER($kpi_trips_by_hour)    ORDER BY pickup_hour;
SELECT * FROM IDENTIFIER($kpi_top_pickup_zones) WHERE trip_rank <= 5;
SELECT * FROM IDENTIFIER($kpi_data_quality)     ORDER BY metric_name;

Expect 12 kpi_* tables; kpi_top_pickup_zones and kpi_popular_routes each have 20 rows.

Stretch Goals

A: Switch Target

# Run against the OTHER backend
dbt run --target databricks --exclude tag:ml tag:streaming   # if you ran snowflake first

# Compare outputs — KPI values should match across backends

B: Write a Custom Test

1 — Define the test macro

In your Codespace Explorer (left sidebar), navigate to dbt_project/tests/generic/ (create the folder if it does not exist). Create a new file has_rows.sql:

-- Generic test: fails (returns rows) when the model contains zero rows.
-- dbt tests pass when the query returns 0 rows.
{% test has_rows(model) %}
    select 1
    from {{ model }}
    having count(*) = 0
{% endtest %}

2 — Apply the test to a model

Open dbt_project/models/gold/_gold.yml and add - has_rows under the tests: block of one of the KPI models, e.g. kpi_trips_by_hour:

  - name: kpi_trips_by_hour
    tests:
      - has_rows          # ← add this line
    columns:
      ...

3 — Run the test

dbt test --target snowflake --select kpi_trips_by_hour
# Expect: PASS=1 (has_rows), plus any existing column tests for that model

Note

How dbt generic tests work: the test fails when the SQL query returns rows. has_rows returns a row only when count(*) = 0 (empty table) — so a populated KPI table makes it return nothing → PASS. An empty table returns one row → FAIL.

C: Add a New KPI Model

In your Codespace Explorer (left sidebar), navigate to dbt_project/models/gold/. Create a new file kpi_avg_tip_by_borough.sql:

select
    pickup_borough,
    count(*) as total_trips,
    round(avg(tip_amount), 2) as avg_tip,
    round(avg(tip_percentage), 2) as avg_tip_pct
from {{ ref('silver_nyc_taxi_enriched') }}
where pickup_borough is not null
group by pickup_borough
order by avg_tip desc

Then run:

dbt run --target snowflake --select kpi_avg_tip_by_borough
dbt test --target snowflake --select kpi_avg_tip_by_borough

D: Explore Macros

In your Codespace Explorer (left sidebar), open dbt_project/macros/time_of_day.sql
Run dbt compile (from dbt_project/) to see how the macro expands in a model
In the Explorer, open the generated file under dbt_project/target/compiled/ — compare with the raw macro

E: Use dbt build

# Run + test in dependency order (most efficient for CI)
dbt build --target snowflake --exclude tag:ml tag:streaming

Ready to compare all three tools?

Head to the Batch Pipeline Comparison page for a full side-by-side breakdown of Databricks, Snowflake, and dbt — verified against official documentation — including storage formats, compute models, cost implications, and key gotchas.

Discussion questions

These wrap up the dbt lab — your facilitator may use them to open the floor.

You ran the same models against Snowflake and Databricks with one flag change (--target). When would you choose a native single-target tool (PySpark or Snowflake SQL) over dbt’s multi-target approach?
The lineage graph was generated automatically from your model code. In a team of 10 data engineers, how would you keep architecture documentation up to date if you were not using dbt?

Ready to compare all three tools?

The full cross-tool observation table, the “what you should have noticed” insights, and the tool-choice discussion questions live on the Batch Pipeline Comparison page — fill them in once after running every pipeline (Databricks, Snowflake, dbt).

Reference — What the Silver layer did (read later)

Not part of the lab steps. dbt models read Bronze/Silver sources your pipeline already built — this explains what those tables contain.

Silver outputs

Table	Contents
`silver_nyc_taxi_cleaned`	Quality filters, corrections, derived columns — before zone join
`silver_nyc_taxi_enriched`	Cleaned trips plus zone lookup — Gold KPIs read this table

1. Filtered out (rows removed)

Silver drops trips that would distort KPIs. Workshop month (Oct 2024): 499,609 of 3,646,319 Bronze rows removed (13.7%); 3,146,710 remain in Silver. See Workshop dataset volumes.

Rule	Condition
Valid timestamps	Pickup and dropoff not null; pickup ≤ dropoff
Reasonable duration	Trip ≤ 24 hours (1–1440 minutes)
Positive distance	`trip_distance > 0`
Positive fare & total	`fare_amount > 0`, `total_amount > 0`
Valid passengers	`passenger_count` between 1 and 8
Airport sanity	No zero-distance trips with airport fee
Deduplication	Exact duplicate rows removed

Databricks only: extra cross-month filter on raw Parquet (adjacent-month outliers).

Full list: Data Model — Data Quality Rules.

2. Corrected (row kept, value fixed)

Field	Change
`tip_amount`	Negative tips set to 0
`payment_type_desc`	Code → label (`Credit card`, `Cash`, `Unknown`, …)
`rate_code_desc`	Code → label (`Standard rate`, `JFK`, `Unknown`, …)

3. Derived (new columns)

Examples: trip_duration_minutes, fare_per_mile, tip_percentage, avg_speed, pickup_hour, day_of_week, is_weekend, is_peak_hour, time_of_day, distance_band.

Catalog: Data Model — Silver schema.

4. Joined / enriched

LEFT JOIN taxi_zone_lookup on pickup_location_id and dropoff_location_id:

pickup_zone, pickup_borough, pickup_service_zone
dropoff_zone, dropoff_borough, dropoff_service_zone
is_same_borough

Join succeeds for TLC zone IDs 1–265 (including catch-all zones below).

5. Kept on purpose (not filtered in Silver)

LocationID	Borough	Zone	Why you may see it in Gold / Power BI
264	`Unknown`	`N/A`	Valid TLC “unknown zone” bucket
265	`N/A`	Outside of NYC	Valid out-of-NYC bucket
1	`EWR`	Newark Airport	Valid airport zone — not a NYC borough

True null zones (ID not in lookup, e.g. corrupt 0) → pickup_zone IS NULL — tracked in kpi_data_quality_metrics; excluded from some Gold KPIs (kpi_top_pickup_zones, kpi_borough_analysis when borough is null).
Power BI Map page: filter EWR, Unknown, and N/A on pickup_borough if you want NYC boroughs only — optional visual filter, not a pipeline change.

Detail: Data Model — Zone lookup edge cases.

6. Silver → Gold → Power BI

Bronze (raw) → Silver cleaned → Silver enriched → 12 Gold kpi_* tables → Power BI

Gold aggregates enriched trips (hour, zone, borough, payment, …). It does not re-apply Silver quality rules — rows in silver_nyc_taxi_enriched can include Unknown / N/A boroughs unless a Gold model filters them. Per-table business meaning and design: Data Model — KPI catalog.

Cleanup

Snowflake: all dbt-created schemas ({ATTENDEE_ID}_DBT, Gold tables) are removed by the central cleanup script — no separate step needed.

Workspaces SQL file: run snowflake/sql/setup/99_cleanup.sql
Workspaces notebook: open 00_setup/99_cleanup and click Run all

Databricks: drop dbt schemas if needed:

DROP SCHEMA IF EXISTS {attendee_id}_dbt CASCADE;

Return to module

Module 4 — story wrapper

Prerequisites

Configure .env (Codespaces)

Base Exercise

Step 1: Setup

Step 2: Run the Pipeline

Step 3: Run Tests

Step 4: Generate & Explore Documentation

Step 5: Verify Gold KPIs

Stretch Goals

A: Switch Target

B: Write a Custom Test

C: Add a New KPI Model

D: Explore Macros

E: Use dbt build

Discussion questions

Reference — What the Silver layer did (read later)

Silver outputs

1. Filtered out (rows removed)

2. Corrected (row kept, value fixed)

3. Derived (new columns)

4. Joined / enriched

5. Kept on purpose (not filtered in Silver)

6. Silver → Gold → Power BI

Cleanup

Return to module

Configure `.env` (Codespaces)