Exercise: Batch Comparison

YellowLine NYC story · full hands-on lab

title: “Exercise: Batch Comparison” subtitle: “YellowLine NYC story · full hands-on lab” —

Estimated time: 10–15 min (fill table from memory: 5 min · discussion: 5–10 min)

YellowLine NYC context (Module 7)

Fill the observation table after Modules 2–4, then discuss.

NoteHow to use this page

Complete the Databricks, Snowflake, and dbt exercises first, then come back here.

Fill in the observation table from memory — the exact numbers matter less than noticing where the tools felt different. Your trainer will lead a discussion based on the open questions at the bottom.


What did you observe?

Fill this in after running all three pipelines on the same NYC Taxi data.

What you ran Databricks (PySpark) Snowflake (SQL) Snowflake (Snowpark) dbt
Silver transform — how long? ______ ______ ______ ______
How long until you could query output? ______ ______ ______ ______
trips_by_hour row count ______ ______ ______ ______
Top pickup zone ______ ______ ______ ______
How did you write the Silver table? .write.saveAsTable() CREATE TABLE AS SELECT .write.save_as_table() materialized= config
How did you verify row counts? Manual SELECT COUNT(*) Manual SELECT COUNT(*) Manual SELECT COUNT(*) dbt test — automatic
Did the cluster/warehouse need warming up? ______ ______ ______ N/A
Where did you see the output table? Unity Catalog Snowsight Snowsight dbt CLI + target DB
ImportantThe KPI numbers must match across all tools

If your trips_by_hour row counts differ between Databricks and Snowflake, there is a data quality issue. Most common cause: case-sensitive column name mismatch in the Silver filter (FARE_AMOUNT vs fare_amount).


Things you probably noticed

These are things nearly every attendee notices without being told. Check which ones you experienced:


Open questions for group discussion

For the trainer: these are designed to be open-ended. There is no single right answer. Use them to surface what attendees found surprising or counterintuitive.


Q1 — All three tools produced the same KPI numbers.

“If the output is identical, what would actually drive your choice between these three tools for a new project at your company?”


Q2 — PySpark and Snowpark have almost the same API.

“Snowflake deliberately designed Snowpark to look like PySpark. Why? Who is the intended audience? Does that change how you would upskill your team?”


Q3 — dbt ran automated tests; Databricks and Snowflake required manual verification.

“In your current team, who would be responsible for data quality checks in the Databricks or Snowflake pipelines? How would you change that before going to production?”


Q4 — Snowflake auto-suspended the warehouse; Databricks kept the cluster running.

“Your pipeline needs to run every 5 minutes. Your pipeline needs to run once per day. Which cost model — pay-per-second-active vs. pay-per-VM-hour — is better for each scenario?”


Q5 — dbt cannot ingest data.

“You want to use dbt for transformation governance and automated testing, but your team also handles ingestion. How would you split responsibilities between dbt and one other tool?”


Q6 — After today’s hands-on exercise:

“What surprised you most? Was there anything that worked better than you expected — or worse?”


NoteWant the full technical deep-dive?

The Module 7: Comparison & Wrap-up page contains the complete reference: full 20-row comparison table, architecture diagram, side-by-side code, key architectural facts from official docs, and all further reading links.


Return to module

Source: merged from frozen workshop-2026-v1/exercises/ex-batch-comparison.qmd — do not edit workshop-2026-v1/.