Exercise: Databricks Pipeline

YellowLine NYC story · full hands-on lab

title: “Exercise: Databricks Pipeline” subtitle: “YellowLine NYC story · full hands-on lab” —

Estimated time: 30–35 min (Base: 20 min · Stretch: 10–15 min)

YellowLine NYC context (Module 2)

Elena approved the Databricks prototype. Ingest ADLS2 Parquet into Bronze, clean in Silver, materialize Priya’s 12 Gold KPIs.

TipWorking Environment

Use GitHub Codespaces for a ready-to-use environment — all tools pre-installed. Open your fork on GitHub → Code → Codespaces → Create codespace on main. All workshop files are at /workspace/. Notebooks are uploaded to your Databricks workspace by the trainer.

Prerequisites

  • Databricks workspace access
  • 00_setup.py completed (schemas created, ADLS2 configured)

Base Exercise

Step 1: Bronze Ingestion

  1. Open databricks/notebooks/01_bronze_ingestion.py

  2. Run all cells

  3. Verify:

    SELECT COUNT(*) FROM {attendee_id}_bronze.bronze_nyc_taxi_trips;
    -- Expected: ~3 million rows (varies by month)
    
    SELECT COUNT(*) FROM {attendee_id}_bronze.bronze_taxi_zone_lookup;
    -- Expected: 265 rows

Step 2: Silver Cleaning

  1. Open databricks/notebooks/02_silver_cleaning.py

  2. Run all cells

  3. Verify:

    -- Check row reduction (quality filters remove ~5-15% of rows)
    SELECT
        (SELECT COUNT(*) FROM {attendee_id}_bronze.bronze_nyc_taxi_trips) AS bronze_count,
        (SELECT COUNT(*) FROM {attendee_id}_silver.silver_nyc_taxi_cleaned) AS silver_count;
    
    -- Check derived columns exist
    SELECT trip_duration, fare_per_mile, time_of_day, payment_type_desc
    FROM {attendee_id}_silver.silver_nyc_taxi_cleaned
    LIMIT 5;
    
    -- Check zone enrichment
    SELECT pickup_zone, pickup_borough, dropoff_zone, dropoff_borough
    FROM {attendee_id}_silver.silver_nyc_taxi_enriched
    WHERE pickup_zone IS NOT NULL
    LIMIT 5;

Step 3: Gold KPIs

  1. Open databricks/notebooks/03_gold_kpis.py

  2. Run all cells

  3. Verify:

    -- Check a few KPI tables
    SELECT * FROM {attendee_id}_gold.gold_kpi_trips_by_hour ORDER BY pickup_hour;
    SELECT * FROM {attendee_id}_gold.gold_kpi_top_pickup_zones ORDER BY trip_rank;
    SELECT * FROM {attendee_id}_gold.gold_kpi_payment_type_analysis ORDER BY total_trips DESC;

Expected Results

Table Expected Rows Notes
bronze_nyc_taxi_trips ~3M Raw data, no filters
bronze_taxi_zone_lookup 265 Reference data
silver_nyc_taxi_cleaned ~2.5-2.8M After quality filters
silver_nyc_taxi_enriched ~2.5-2.8M Same count, with zones
gold_kpi_trips_by_hour 24 One row per hour
gold_kpi_trips_by_day 7 One row per day
gold_kpi_top_pickup_zones 20 Top 20 zones

Stretch Goals

A: Add Custom KPI

Create a new Gold table: average tip by borough

kpi_tip_by_borough = (
    enriched_df
    .filter(col("pickup_borough").isNotNull())
    .groupBy("pickup_borough")
    .agg(
        F.count("*").alias("total_trips"),
        F.round(F.avg("tip_amount"), 2).alias("avg_tip"),
        F.round(F.avg("tip_percentage"), 2).alias("avg_tip_pct")
    )
)
kpi_tip_by_borough.write.mode("overwrite").saveAsTable(f"{GOLD_SCHEMA}.gold_kpi_tip_by_borough")

B: Explore Delta Lake Features

-- Time travel: see the table as it was before your latest write
SELECT * FROM {attendee_id}_bronze.bronze_nyc_taxi_trips VERSION AS OF 0 LIMIT 5;

-- History: see all operations on the table
DESCRIBE HISTORY {attendee_id}_silver.silver_nyc_taxi_cleaned;

-- Schema: inspect column types
DESCRIBE TABLE EXTENDED {attendee_id}_silver.silver_nyc_taxi_enriched;

C: Modify Quality Rules

Edit 02_silver_cleaning.py and change the passenger count threshold from 8 to 6. Re-run and compare the Silver row count — how many rows were additionally filtered?


Compare Your Results

Complete this after you have also run the Snowflake exercise and dbt exercise. Filling in the table from memory is fine — the goal is to notice the differences.

Same pipeline, three tools

What you observed Databricks (PySpark) Snowflake (SQL) Snowflake (Snowpark) dbt
Language Python (PySpark) SQL Python (Snowpark API) SQL (Jinja)
Bronze row count ~3M ~3M ~3M ~3M
Silver row count ~2.5–2.8M ~2.5–2.8M ~2.5–2.8M ~2.5–2.8M
Gold KPI count (trips_by_hour) 24 rows 24 rows 24 rows 24 rows
How long did Silver take? ______ ______ ______ ______
Where did you see the output? Unity Catalog Snowsight Snowsight dbt CLI + target DB
How did you write output? .write.saveAsTable() CREATE TABLE AS SELECT .write.save_as_table() Materialisation config
Did you need to manage compute? Cluster size matters Warehouse size matters Warehouse size matters Delegated to DB
How were tests run? Manual SELECT COUNT(*) Manual SELECT COUNT(*) Manual SELECT COUNT(*) dbt test (automated)
TipThings you should have noticed

1. The Gold KPI numbers are identical across all tools — 24 rows for trips_by_hour, the same top pickup zones, the same borough distribution. The platform does not change the data — only how you express the transform and where compute runs.

2. PySpark and Snowpark look almost the samedf.groupBy().agg(), df.filter(col(...) > 0), df.write.saveAsTable() — the API surface is deliberately similar. The key difference: Snowpark runs on Snowflake’s compute, PySpark runs on Spark clusters.

3. dbt was the only tool that ran automated testsdbt test validated nulls, row counts, and accepted values automatically. In Databricks and Snowflake you had to manually write SELECT COUNT(*) queries and visually check results.

4. dbt needed Bronze tables to already exist — it sources from whichever Bronze tables were created by Databricks or Snowflake. dbt is a transformation layer, not an ingestion layer.

5. Snowflake SQL was the most concise — a single CREATE TABLE AS SELECT statement for each layer. PySpark needed type handling, .cache(), and explicit .write calls. SQL is more declarative; PySpark gives more control.

Discussion questions

  1. All three produced identical KPI numbers. Given that, what criteria would you use to choose a tool for a new project?
  2. PySpark and Snowpark have almost the same API. Why did Snowflake design it that way? Who is the target audience?
  3. dbt ran tests automatically; Databricks and Snowflake required manual verification. What does this mean for a team shipping pipelines to production daily?
  4. Databricks needed a running cluster; Snowflake auto-suspended the warehouse. What does this imply about cost models for a pipeline that runs once per day?
NoteReady to compare all three tools?

Head to the Batch Pipeline Comparison page for a full side-by-side breakdown of Databricks, Snowflake, and dbt — verified against official documentation — including storage formats, compute models, cost implications, and key gotchas.


Cleanup

When you are finished exploring, stop your cluster to avoid ongoing charges:

  1. Navigate to Compute in the Databricks sidebar

  2. Select your cluster → Terminate

  3. (Optional) Drop your schemas to free Unity Catalog storage:

    DROP SCHEMA IF EXISTS {attendee_id}_gold CASCADE;
    DROP SCHEMA IF EXISTS {attendee_id}_silver CASCADE;
    DROP SCHEMA IF EXISTS {attendee_id}_bronze CASCADE;

Return to module

Source: merged from frozen workshop-2026-v1/exercises/ex-databricks.qmd — do not edit workshop-2026-v1/.