Exercise: Databricks Pipeline
YellowLine NYC story · full hands-on lab
title: “Exercise: Databricks Pipeline” subtitle: “YellowLine NYC story · full hands-on lab” —
Estimated time: 30–35 min (Base: 20 min · Stretch: 10–15 min)
YellowLine NYC context (Module 2)
Elena approved the Databricks prototype. Ingest ADLS2 Parquet into Bronze, clean in Silver, materialize Priya’s 12 Gold KPIs.
Use GitHub Codespaces for a ready-to-use environment — all tools pre-installed. Open your fork on GitHub → Code → Codespaces → Create codespace on main. All workshop files are at /workspace/. Notebooks are uploaded to your Databricks workspace by the trainer.
Prerequisites
- Databricks workspace access
00_setup.pycompleted (schemas created, ADLS2 configured)
Base Exercise
Step 1: Bronze Ingestion
Open
databricks/notebooks/01_bronze_ingestion.pyRun all cells
Verify:
SELECT COUNT(*) FROM {attendee_id}_bronze.bronze_nyc_taxi_trips; -- Expected: ~3 million rows (varies by month) SELECT COUNT(*) FROM {attendee_id}_bronze.bronze_taxi_zone_lookup; -- Expected: 265 rows
Step 2: Silver Cleaning
Open
databricks/notebooks/02_silver_cleaning.pyRun all cells
Verify:
-- Check row reduction (quality filters remove ~5-15% of rows) SELECT (SELECT COUNT(*) FROM {attendee_id}_bronze.bronze_nyc_taxi_trips) AS bronze_count, (SELECT COUNT(*) FROM {attendee_id}_silver.silver_nyc_taxi_cleaned) AS silver_count; -- Check derived columns exist SELECT trip_duration, fare_per_mile, time_of_day, payment_type_desc FROM {attendee_id}_silver.silver_nyc_taxi_cleaned LIMIT 5; -- Check zone enrichment SELECT pickup_zone, pickup_borough, dropoff_zone, dropoff_borough FROM {attendee_id}_silver.silver_nyc_taxi_enriched WHERE pickup_zone IS NOT NULL LIMIT 5;
Step 3: Gold KPIs
Open
databricks/notebooks/03_gold_kpis.pyRun all cells
Verify:
-- Check a few KPI tables SELECT * FROM {attendee_id}_gold.gold_kpi_trips_by_hour ORDER BY pickup_hour; SELECT * FROM {attendee_id}_gold.gold_kpi_top_pickup_zones ORDER BY trip_rank; SELECT * FROM {attendee_id}_gold.gold_kpi_payment_type_analysis ORDER BY total_trips DESC;
Expected Results
| Table | Expected Rows | Notes |
|---|---|---|
| bronze_nyc_taxi_trips | ~3M | Raw data, no filters |
| bronze_taxi_zone_lookup | 265 | Reference data |
| silver_nyc_taxi_cleaned | ~2.5-2.8M | After quality filters |
| silver_nyc_taxi_enriched | ~2.5-2.8M | Same count, with zones |
| gold_kpi_trips_by_hour | 24 | One row per hour |
| gold_kpi_trips_by_day | 7 | One row per day |
| gold_kpi_top_pickup_zones | 20 | Top 20 zones |
Stretch Goals
A: Add Custom KPI
Create a new Gold table: average tip by borough
kpi_tip_by_borough = (
enriched_df
.filter(col("pickup_borough").isNotNull())
.groupBy("pickup_borough")
.agg(
F.count("*").alias("total_trips"),
F.round(F.avg("tip_amount"), 2).alias("avg_tip"),
F.round(F.avg("tip_percentage"), 2).alias("avg_tip_pct")
)
)
kpi_tip_by_borough.write.mode("overwrite").saveAsTable(f"{GOLD_SCHEMA}.gold_kpi_tip_by_borough")B: Explore Delta Lake Features
-- Time travel: see the table as it was before your latest write
SELECT * FROM {attendee_id}_bronze.bronze_nyc_taxi_trips VERSION AS OF 0 LIMIT 5;
-- History: see all operations on the table
DESCRIBE HISTORY {attendee_id}_silver.silver_nyc_taxi_cleaned;
-- Schema: inspect column types
DESCRIBE TABLE EXTENDED {attendee_id}_silver.silver_nyc_taxi_enriched;C: Modify Quality Rules
Edit 02_silver_cleaning.py and change the passenger count threshold from 8 to 6. Re-run and compare the Silver row count — how many rows were additionally filtered?
Compare Your Results
Complete this after you have also run the Snowflake exercise and dbt exercise. Filling in the table from memory is fine — the goal is to notice the differences.
Same pipeline, three tools
| What you observed | Databricks (PySpark) | Snowflake (SQL) | Snowflake (Snowpark) | dbt |
|---|---|---|---|---|
| Language | Python (PySpark) | SQL | Python (Snowpark API) | SQL (Jinja) |
| Bronze row count | ~3M | ~3M | ~3M | ~3M |
| Silver row count | ~2.5–2.8M | ~2.5–2.8M | ~2.5–2.8M | ~2.5–2.8M |
| Gold KPI count (trips_by_hour) | 24 rows | 24 rows | 24 rows | 24 rows |
| How long did Silver take? | ______ | ______ | ______ | ______ |
| Where did you see the output? | Unity Catalog | Snowsight | Snowsight | dbt CLI + target DB |
| How did you write output? | .write.saveAsTable() |
CREATE TABLE AS SELECT |
.write.save_as_table() |
Materialisation config |
| Did you need to manage compute? | Cluster size matters | Warehouse size matters | Warehouse size matters | Delegated to DB |
| How were tests run? | Manual SELECT COUNT(*) |
Manual SELECT COUNT(*) |
Manual SELECT COUNT(*) |
dbt test (automated) |
1. The Gold KPI numbers are identical across all tools — 24 rows for trips_by_hour, the same top pickup zones, the same borough distribution. The platform does not change the data — only how you express the transform and where compute runs.
2. PySpark and Snowpark look almost the same — df.groupBy().agg(), df.filter(col(...) > 0), df.write.saveAsTable() — the API surface is deliberately similar. The key difference: Snowpark runs on Snowflake’s compute, PySpark runs on Spark clusters.
3. dbt was the only tool that ran automated tests — dbt test validated nulls, row counts, and accepted values automatically. In Databricks and Snowflake you had to manually write SELECT COUNT(*) queries and visually check results.
4. dbt needed Bronze tables to already exist — it sources from whichever Bronze tables were created by Databricks or Snowflake. dbt is a transformation layer, not an ingestion layer.
5. Snowflake SQL was the most concise — a single CREATE TABLE AS SELECT statement for each layer. PySpark needed type handling, .cache(), and explicit .write calls. SQL is more declarative; PySpark gives more control.
Discussion questions
- All three produced identical KPI numbers. Given that, what criteria would you use to choose a tool for a new project?
- PySpark and Snowpark have almost the same API. Why did Snowflake design it that way? Who is the target audience?
- dbt ran tests automatically; Databricks and Snowflake required manual verification. What does this mean for a team shipping pipelines to production daily?
- Databricks needed a running cluster; Snowflake auto-suspended the warehouse. What does this imply about cost models for a pipeline that runs once per day?
Head to the Batch Pipeline Comparison page for a full side-by-side breakdown of Databricks, Snowflake, and dbt — verified against official documentation — including storage formats, compute models, cost implications, and key gotchas.
Cleanup
When you are finished exploring, stop your cluster to avoid ongoing charges:
Navigate to Compute in the Databricks sidebar
Select your cluster → Terminate
(Optional) Drop your schemas to free Unity Catalog storage:
DROP SCHEMA IF EXISTS {attendee_id}_gold CASCADE; DROP SCHEMA IF EXISTS {attendee_id}_silver CASCADE; DROP SCHEMA IF EXISTS {attendee_id}_bronze CASCADE;
Return to module
Source: merged from frozen workshop-2026-v1/exercises/ex-databricks.qmd — do not edit workshop-2026-v1/.