Exercise: Databricks Pipeline

YellowLine NYC story · full hands-on lab

title: “Exercise: Databricks Pipeline” subtitle: “YellowLine NYC story · full hands-on lab” —

Estimated time: 30–35 min (Base: 20 min · Stretch: 10–15 min)

YellowLine NYC context (Module 2)

Elena approved the Databricks prototype. Ingest ADLS2 Parquet into Bronze, clean in Silver, materialize Priya’s 12 Gold KPIs.

Working Environment

Use GitHub Codespaces for dbt, Snowflake scripts, and file editing — open your fork → Code → Codespaces → Create codespace on main.

Databricks notebooks run in your workspace (not Codespaces). Get them there using either:

Normal path: Git folder sync from your fork (editable copy in Home)
Fallback: manual import or facilitator shared folder if Git integration is blocked

GitHub blocked? (emergency only)

The normal path is fork + Codespace (Prerequisites § Step 2). Use Lab source files only if your facilitator approved it — e.g. you cannot create or use GitHub before class.

Before you start

Complete once before Bronze ingestion:

Databricks access + notebooks — Git folder from your fork (or import / facilitator fallback)
Workspace setup below — cluster Running, 00_setup.py Run all, schemas in Catalog
Git → Pull on your folder after facilitator updates

Official reference: Git folders in Databricks.

Cluster vs SQL warehouse

	All-purpose cluster (`de-workshop-…`)	SQL warehouse (`de-workshop-wh`)	Serverless (Connect)
Used for	PySpark notebooks `00`–`04` (Modules 2–3; parts of Module 6)	dbt on Databricks (Module 4), Genie (Module 6)	Databricks-managed Spark — default for many notebooks
Connect	Notebook Connect → cluster	SQL editor or `DATABRICKS_HTTP_PATH` in `.env` (Exercise: dbt § `.env`)	Notebook Connect → Serverless

Module 2 medallion labs attach a classic all-purpose cluster from Connect — not Serverless and not the SQL warehouse. This lab configures ADLS2 with a storage account key in 00_setup.py; production serverless pipelines use Unity Catalog external locations instead (Module 2 § Lab compute).

Create your cluster

Create and Start before 00_setup.py:

In the sidebar (left), click Compute
Click Create compute (top of the Compute page)
Policy: select Workshop from the dropdown
Name: de-workshop-{attendee_id} · Runtime: 16.4 LTS (min 15.4 LTS) · Workers: 1 · Auto termination: 30 min
Wait for the status badge to show Running (green dot, ~2 min) → attach in notebook Connect (toolbar top)

When idle terminates: sidebar (left) → Compute → Start → re-attach in Connect (toolbar top). Terminate when finished (cleanup).

Run `00_setup.py`

In the sidebar (left), open Home → your Git folder → databricks/notebooks/00_setup.py
Connect (toolbar top, dropdown) → select your running cluster (not SQL warehouse)
Set ATTENDEE_ID (not tr_01_trainer) and STORAGE_ACCOUNT_KEY (facilitator provides — never commit to Git)
Run all → confirm {attendee_id}_bronze, _silver, _gold in Catalog

Each notebook has its own Spark session — config is saved to _workshop.user_config. Re-run 00_setup after cluster restart or Git pull that updates bootstrap. Run all is idempotent.

ADLS2 key

STORAGE_ACCOUNT_KEY is session-only in the notebook — sensitive; facilitator provides during Module 2.

Prerequisites

Databricks workspace access (setup guide)
Notebooks in workspace (Git folder, shared copy, or import) — see Before you start
00_setup.py completed (schemas created, ADLS2 configured)

Base Exercise

Step 1: Bronze Ingestion

Attach the same cluster you used for 00_setup — Connect dropdown (toolbar top of the notebook).

Note

Why store raw data in Bronze? The medallion architecture separates ingestion (Bronze) from cleaning (Silver) and business logic (Gold). Bronze stores data exactly as it arrived — no filtering, no transforms. This means you can always re-process from Bronze if a Silver quality rule changes, without re-ingesting from the source.

In your Git folder (sidebar → Home), open databricks/notebooks/01_bronze_ingestion.py
Run all cells (toolbar top)
Verify:

Note

Re-run safe: Notebooks 01–03 write with overwrite — Run all again replaces tables (no duplicate rows). Wide samples use preview_df() from _workshop_bootstrap instead of .show() for readable output.

SELECT COUNT(*) FROM mhpdeworkshop_databricks_2026.{attendee_id}_bronze.nyc_taxi_trips;
-- Expected: 3,646,319 rows (Oct 2024 workshop month)

SELECT COUNT(*) FROM mhpdeworkshop_databricks_2026.{attendee_id}_bronze.taxi_zone_lookup;
-- Expected: 265 rows

Step 2: Silver Cleaning

In your Git folder, open databricks/notebooks/02_silver_cleaning.py
Run all cells
Verify:

Note

Why do ~14% of rows disappear? The Silver layer applies quality filters: passenger_count between 1–8, fare_amount > 0, trip_distance > 0, and similar rules. For Oct 2024, 499,609 of 3,646,319 Bronze rows are removed (13.7%); 3,146,710 remain. These remove corrupt or impossible records that would distort Gold KPIs. The key insight: Silver is where data quality decisions live — change these filters and every downstream KPI changes.

-- Check row reduction (Oct 2024: 3,646,319 → 3,146,710; 499,609 removed, 13.7%)
SELECT
    (SELECT COUNT(*) FROM mhpdeworkshop_databricks_2026.{attendee_id}_bronze.nyc_taxi_trips) AS bronze_count,
    (SELECT COUNT(*) FROM mhpdeworkshop_databricks_2026.{attendee_id}_silver.silver_nyc_taxi_cleaned) AS silver_count;

-- Check derived columns exist
SELECT trip_duration, fare_per_mile, time_of_day, payment_type_desc
FROM mhpdeworkshop_databricks_2026.{attendee_id}_silver.silver_nyc_taxi_cleaned
LIMIT 5;

-- Check zone enrichment
SELECT pickup_zone, pickup_borough, dropoff_zone, dropoff_borough
FROM mhpdeworkshop_databricks_2026.{attendee_id}_silver.silver_nyc_taxi_enriched
WHERE pickup_zone IS NOT NULL
LIMIT 5;

Step 3: Gold KPIs

In your Git folder, open databricks/notebooks/03_gold_kpis.py
Run all cells
Verify:

Note

Why 12 small tables instead of one big dashboard table? Each Gold table answers one business question (trips by hour, top zones, payment mix). This pattern — one table per KPI — means each can be refreshed independently, tested independently, and consumed by different downstream tools (Power BI, APIs, exports) without coupling.

-- Check a few KPI tables
SELECT * FROM mhpdeworkshop_databricks_2026.{attendee_id}_gold.kpi_trips_by_hour ORDER BY pickup_hour;
SELECT * FROM mhpdeworkshop_databricks_2026.{attendee_id}_gold.kpi_top_pickup_zones ORDER BY trip_rank;
SELECT * FROM mhpdeworkshop_databricks_2026.{attendee_id}_gold.kpi_payment_type_analysis ORDER BY total_trips DESC;

Expected Results

Table	Expected Rows	Notes
nyc_taxi_trips	3,646,319	Raw Parquet, no filters (Oct 2024)
taxi_zone_lookup	265	Reference data
silver_nyc_taxi_cleaned	3,146,710	After quality filters (499,609 removed)
silver_nyc_taxi_enriched	3,146,710	Same count, with zones
kpi_trips_by_hour	~48–96	Hour × weekday/weekend × peak/off-peak slices (not one row per hour)
kpi_trips_by_day	7	One row per weekday (if all days appear in the month)
kpi_top_pickup_zones	20	Top 20 zones

Stretch Goals

A: Add Custom KPI

In 03_gold_kpis.py, add a new cell after df is loaded from Silver enriched:

kpi_tip_by_borough = (
    df
    .filter(col("pickup_borough").isNotNull())
    .groupBy("pickup_borough")
    .agg(
        count("*").alias("total_trips"),
        round(avg("tip_amount"), 2).alias("avg_tip"),
        round(avg("tip_percentage"), 2).alias("avg_tip_pct"),
    )
)
kpi_tip_by_borough.write.mode("overwrite").saveAsTable(f"{CATALOG_NAME}.{GOLD_SCHEMA}.kpi_tip_by_borough")

(count, avg, round, and col come from _workshop_bootstrap.)

B: Explore Delta Lake Features

-- Time travel: see the table as it was before your latest write
SELECT * FROM mhpdeworkshop_databricks_2026.{attendee_id}_bronze.nyc_taxi_trips VERSION AS OF 0 LIMIT 5;

-- History: see all operations on the table
DESCRIBE HISTORY mhpdeworkshop_databricks_2026.{attendee_id}_silver.silver_nyc_taxi_cleaned;

-- Schema: inspect column types
DESCRIBE TABLE EXTENDED mhpdeworkshop_databricks_2026.{attendee_id}_silver.silver_nyc_taxi_enriched;

C: Modify Quality Rules

Edit 02_silver_cleaning.py and change the passenger filter from .between(1, 8) to .between(1, 6). Re-run and compare the Silver row count — how many rows were additionally filtered?

Discussion questions

These wrap up the Databricks lab — your facilitator may use them to open the floor.

You ingested data with spark.read.parquet() from ADLS2 directly. Snowflake required a COPY INTO with an external stage. What does the Databricks approach simplify, and what does it make harder to govern?
The cluster kept running after your pipeline finished. What cost and operational decisions does this create for a daily batch job?
Delta Lake gave you time travel and DESCRIBE HISTORY for free. In which situations would you use version history in a production pipeline?

Ready to compare all three tools?

The full cross-tool observation table, the “what you should have noticed” insights, and the tool-choice discussion questions live on the Batch Pipeline Comparison page — fill them in once after running every pipeline (Databricks, Snowflake, dbt).

Cleanup

When you are finished exploring, terminate your cluster to avoid ongoing charges:

Navigate to Compute in the Databricks sidebar
Select your cluster → Terminate

(Optional) Drop your schemas to free Unity Catalog storage (use your catalog and attendee id):

DROP SCHEMA IF EXISTS mhpdeworkshop_databricks_2026.{attendee_id}_pipeline CASCADE;
DROP SCHEMA IF EXISTS mhpdeworkshop_databricks_2026.{attendee_id}_gold CASCADE;
DROP SCHEMA IF EXISTS mhpdeworkshop_databricks_2026.{attendee_id}_silver CASCADE;
DROP SCHEMA IF EXISTS mhpdeworkshop_databricks_2026.{attendee_id}_bronze CASCADE;

Reference — What the Silver layer did (read later)

Not part of the lab steps. Use this when interpreting Gold KPIs or comparing row counts across layers.

Silver outputs

Table	Contents
`silver_nyc_taxi_cleaned`	Quality filters, corrections, derived columns — before zone join
`silver_nyc_taxi_enriched`	Cleaned trips plus zone lookup — Gold KPIs read this table

1. Filtered out (rows removed)

Silver drops trips that would distort KPIs. Workshop month (Oct 2024): 499,609 of 3,646,319 Bronze rows removed (13.7%); 3,146,710 remain in Silver. See Workshop dataset volumes.

Rule	Condition
Valid timestamps	Pickup and dropoff not null; pickup ≤ dropoff
Reasonable duration	Trip ≤ 24 hours (1–1440 minutes)
Positive distance	`trip_distance > 0`
Positive fare & total	`fare_amount > 0`, `total_amount > 0`
Valid passengers	`passenger_count` between 1 and 8
Airport sanity	No zero-distance trips with airport fee
Deduplication	Exact duplicate rows removed

Databricks only: extra cross-month filter on raw Parquet (adjacent-month outliers).

Full list: Data Model — Data Quality Rules.

2. Corrected (row kept, value fixed)

Field	Change
`tip_amount`	Negative tips set to 0
`payment_type_desc`	Code → label (`Credit card`, `Cash`, `Unknown`, …)
`rate_code_desc`	Code → label (`Standard rate`, `JFK`, `Unknown`, …)

3. Derived (new columns)

Examples: trip_duration_minutes, fare_per_mile, tip_percentage, avg_speed, pickup_hour, day_of_week, is_weekend, is_peak_hour, time_of_day, distance_band.

Catalog: Data Model — Silver schema.

4. Joined / enriched

LEFT JOIN taxi_zone_lookup on pickup_location_id and dropoff_location_id:

pickup_zone, pickup_borough, pickup_service_zone
dropoff_zone, dropoff_borough, dropoff_service_zone
is_same_borough

Join succeeds for TLC zone IDs 1–265 (including catch-all zones below).

5. Kept on purpose (not filtered in Silver)

LocationID	Borough	Zone	Why you may see it in Gold / Power BI
264	`Unknown`	`N/A`	Valid TLC “unknown zone” bucket
265	`N/A`	Outside of NYC	Valid out-of-NYC bucket
1	`EWR`	Newark Airport	Valid airport zone — not a NYC borough

True null zones (ID not in lookup, e.g. corrupt 0) → pickup_zone IS NULL — tracked in kpi_data_quality_metrics; excluded from some Gold KPIs (kpi_top_pickup_zones, kpi_borough_analysis when borough is null).
Power BI Map page: filter EWR, Unknown, and N/A on pickup_borough if you want NYC boroughs only — optional visual filter, not a pipeline change.

Detail: Data Model — Zone lookup edge cases.

6. Silver → Gold → Power BI

Bronze (raw) → Silver cleaned → Silver enriched → 12 Gold kpi_* tables → Power BI

Gold aggregates enriched trips (hour, zone, borough, payment, …). It does not re-apply Silver quality rules — rows in silver_nyc_taxi_enriched can include Unknown / N/A boroughs unless a Gold model filters them. Per-table business meaning and design: Data Model — KPI catalog.

Return to module

Module 2 — story wrapper

Before you start

Cluster vs SQL warehouse

Create your cluster

Run 00_setup.py

Prerequisites

Base Exercise

Step 1: Bronze Ingestion

Step 2: Silver Cleaning

Step 3: Gold KPIs

Expected Results

Stretch Goals

A: Add Custom KPI

B: Explore Delta Lake Features

C: Modify Quality Rules

Discussion questions

Cleanup

Reference — What the Silver layer did (read later)

Silver outputs

1. Filtered out (rows removed)

2. Corrected (row kept, value fixed)

3. Derived (new columns)

4. Joined / enriched

5. Kept on purpose (not filtered in Silver)

6. Silver → Gold → Power BI

Return to module

Run `00_setup.py`