Animation fallback scripts
Live narration when module MP4s are unavailable
Animation fallback — live narration scripts
Use when: The module MP4 (media/modules/mod-XX-*.mp4) will not play — missing file, codec issue, or projector failure.
How to deliver:
- Open the matching module page and show the story callout on screen.
- Read the narration below in scene order (pause at Discuss / Write your recommendation cues).
- Continue with reflection prompts as usual.
Optional modules: 8 and 9 are not part of the main-day set unless you run advanced sessions. When running both, deliver Module 8 before Module 9.
Module 0 — mod-00-welcome.mp4
Marcus
I’m Marcus Chen, Operations Manager at YellowLine NYC. We run yellow taxis across all five boroughs — millions of trips every month.
But here’s my problem: everyone has a different number. Finance uses one spreadsheet. Dispatch uses another. I can’t tell you — with confidence — where we make money, or when we should send drivers where.
Should we add cars in Midtown at six p.m.? Pull them from the airport at midnight? We’re guessing. And while we guess, ride-hail apps don’t.
Narrator
YellowLine NYC doesn’t need another report. They need an analytics foundation.
Elena
Marcus asked MHP for help. I’m Elena Vasquez, Data Architect. We won’t start with tools — we start with architecture and the questions the business actually needs answered.
Bob
I’m Bob. I’ll be building the pipelines — and learning which platforms fit Marcus’s team.
Priya
I’m Priya, BI Analyst. Before Bob writes a line of code, I need to know what Marcus wants on his dashboard.
When are peak revenue hours? Which pickup zones drive volume? Which routes matter? How does revenue vary by borough? And are bad records — null fares, zero-distance trips — skewing our decisions?
James
I’ll validate the logic in SQL once the data is clean. Priya connects the Gold KPIs to Power BI.
Elena
We standardize on medallion architecture. Bronze holds raw trip data exactly as it arrives. Silver cleans, enriches, and joins zone lookups. Gold delivers analytics-ready KPI tables Priya can report from.
Narrator
Same dataset. Clear layers. One path from source to dashboard.
Elena
Then we choose how to build. Lakehouse platforms. Cloud warehouses. Transformation frameworks. Each has strengths. Bob will prototype and we’ll decide what Marcus’s team can run in production.
Bob
Three pipelines. One NYC Taxi dataset. Let’s see what works.
Narrator
Before the first notebook opens — pause and think. You are Bob today.
How would you design the solution for YellowLine NYC?
Module 1 — mod-01-fundamentals.mp4
Elena
Marcus has data. He doesn’t have layers. In medallion architecture, Bronze is raw ingestion — Parquet from Azure, no filters. Silver is where we enforce quality, standardize columns, and join taxi zone lookups. Gold is business-ready — aggregated KPI tables, one per question Priya needs.
Classic ETL transforms before load — great for legacy systems. Modern cloud platforms favor ELT: load raw first, transform in place. That’s our approach. Raw stays reproducible; Silver and Gold can be rebuilt anytime.
Narrator
Today you’ll work with the NYC Taxi dataset — the same public trip records used in data engineering training worldwide.
Priya
My dashboard wireframe is ready. My KPI list is ready. I’m waiting for Bob’s Gold tables. Overview charts, maps, revenue pages — they all depend on the pipeline layers Elena just described.
James
I’ll explore the schema with you before we build — Vendor ID, pickup times, fares, zones — so Silver rules match what Marcus expects.
Elena
Databricks brings Spark and lakehouse scale. Snowflake brings SQL-first analytics. dbt brings transformation as code with tests and lineage. AWS, Cloudera — options in enterprise landscapes. We evaluate against Marcus’s skills, cost, and maintainability — not hype.
Bob
I won’t pick a winner yet. First I need to understand the fundamentals we’re about to use all day.
Narrator
Module 1: fundamentals. After this video — discuss with your partner. What belongs in Silver? What belongs in Gold? Then we go deep on theory and the NYC Taxi schema.
Module 2 — mod-02-databricks.mp4
Bob
Elena, I’ve worked with PySpark before. Databricks is powerful and popular for lakehouse workloads. Can we prototype ingest and transforms there first?
Elena
Approved — for scale and flexibility. Sofia will pair with you. Same medallion design: Bronze, Silver, Gold. Same KPIs Priya needs.
Sofia
Read Parquet directly from Azure ADLS2. Land it in Delta Bronze. Don’t skip raw — Marcus may ask us to reprocess December when January looks wrong.
Narrator
Bronze: external Parquet into Delta Lake. Silver: filter invalid fares, compute trip duration, enrich with borough and zone. Gold: twelve KPI tables — trips by hour, by day, revenue bands, efficiency metrics.
Bob
One dataset. One architecture. Databricks executes the first full path.
Priya
Bob, I just connected
kpi_trips_by_hourandkpi_trips_by_day. My Overview page is alive. Marcus can finally see peak hours — but I still need maps and revenue pages from the next layers.
Narrator
Module 2: your hands-on Databricks lab. Think first: how would you ingest Parquet at scale? What breaks if you skip Bronze? Then open the notebooks and build.
Module 3 — mod-03-snowflake.mp4
Marcus
Bob, this works. I believe the numbers. But my team lives in SQL. They’re not going to maintain PySpark notebooks after MHP leaves. Is there a path with less programming — something they can extend?
Elena
Fair constraint. Bob — rebuild the same medallion pipeline on Snowflake. SQL-first. Snowpark optional for engineers who want Python.
Narrator
Identical business logic. Identical Gold table names. Different platform — external stage from ADLS2,
COPY INTOBronze, SQL transforms for Silver and Gold. Same architecture. Different implementation philosophy.
Bob
Marcus’s analysts can read every transform in a worksheet. That’s the maintainability test.
Sofia
Don’t reinvent the rules — port the Silver filters from PySpark line by line. Consistency matters for Priya’s dashboard.
Priya
I pointed the same Power BI report at Snowflake Gold.
kpi_borough_analysis,kpi_top_pickup_zones— the Map page works unchanged. Identical KPIs. Different engine. That’s the point.
Narrator
Module 3: Snowflake lab. Discuss: who maintains this after MHP? How do you load Parquet without Spark? Then build Bronze → Silver → Gold in SQL.
Module 4 — mod-04-dbt.mp4
Marcus
Priya’s dashboard is exactly what I wanted. But my board asked a harder question: Where does each number come from? We need documentation. Data lineage. And tests — so an analyst’s Friday edit doesn’t break Monday’s revenue tile silently.
Elena
We keep Snowflake as the engine. We add dbt for transformations — SQL models, automated tests, and generated docs. Bob, dbt is not a replacement warehouse. It’s how we govern the transform layer.
Bob
Every Gold model references Silver with
ref(). Tests catch null keys and negative fares.dbt docs generateshows Marcus’s auditors the full chain — source Parquet to dashboard tile.
Narrator
dbt runs on Snowflake. Same KPIs. Added discipline.
Priya
Revenue by hour, payment type breakdown — connected. And
kpi_data_quality_metricsfeeds my quality scorecard. When Marcus asks “can we trust this data?”, I show him the number and the lineage behind it.
Elena
The same dbt models can target Databricks or Snowflake. Today we focus on Snowflake. The pattern travels with you.
Narrator
Module 4: dbt lab. Discuss: how do you prove where a KPI comes from? What tests would you add? Then run models, tests, and generate the docs Marcus’s board wants.
Module 5 — mod-05-production.mp4
Elena
The pipeline works when you’re in the room. Marcus needs it to work when you’re not. What runs every night? What happens when Silver fails at two a.m.? Who gets paged?
Marcus
I don’t need heroes. I need schedules, retries, and alerts.
Narrator
Databricks Workflows and Delta Live Tables. Snowflake Tasks and Streams. dbt in CI/CD with pull request checks. Incremental loads instead of full rewrites. Idempotent jobs — safe to rerun.
Bob
Production isn’t more code. It’s the same code — with reliability wrapped around it.
Narrator
Module 5: production patterns. Discuss: what breaks in production that never breaks in class? How would you schedule tonight’s run on each platform?
Module 6 — mod-06-ai.mp4
Marcus
Priya answers my questions in minutes now. But my analysts ask dozens of questions a day. Can AI help them explore data faster — without breaking what Bob built?
Elena
AI can accelerate exploration — not replace your medallion pipeline or Priya’s KPI definitions.
Priya
I still own the dashboard. AI might help James draft SQL against Silver — under human review.
James
If AI writes the query, I still validate against Gold. Garbage in, garbage out — AI doesn’t fix bad Bronze.
Narrator
Cortex, Genie, Copilot — assistants on top of a governed stack. Phase 1.5, not a shortcut past Bronze, Silver, and Gold.
Narrator
Module 6: AI features. Discuss: where is AI useful in this project? Where would you not trust it? Then try the exercises — always anchored to the pipeline you built.
Module 7 — mod-07-wrapup.mp4
Trainer note: Pause after the Write your recommendation and open-discussion cues — do not read through them; let trainees write and debate.
Narrator
Six months later. MHP’s engagement is over. YellowLine NYC runs the platform now — SQL analysts on rotation, dashboards refreshing nightly. Let’s see what stuck.
Priya
Marcus — your operations dashboard. Overview: trips and revenue by hour and day. Map: borough performance and top pickup zones. Time analysis: rush hours and heatmaps. Revenue and payments: fare breakdown and card versus cash. Efficiency: distance bands, speed, revenue per minute.
Twelve Gold KPI tables. One schema. Built by Bob across three tool paths — consumed here in Power BI.
Marcus
This is what I asked for on day one. Now I need the harder answer.
You proved they all work. What should YellowLine NYC run in production? What can my SQL team maintain? What would you choose — and why?
Elena
There isn’t one universal answer. Platform, transform layer, and dashboard are separate decisions. Today you recommend — not me.
Narrator
One NYC Taxi dataset. Three implementations. Medallion architecture throughout. Priya’s Power BI at the finish line. You built as Bob. Now think as the architect.
Take a moment. Which stack would you recommend for YellowLine NYC? One sentence why. You’ll discuss with the room next.
Compare Databricks, Snowflake, and dbt. Defend your choice. Challenge your classmates. Revisit what you designed this morning — what would you change now?
Elena
The best engineers don’t pick tools from hype. They pick from constraints, skills, and proof.
Narrator
Technology is a decision. Architecture is responsibility.
Module 8 — mod-08-streaming.mp4 (optional)
Marcus
The dashboard Bob and Priya built is perfect — for yesterday. But when a concert lets out in Brooklyn, I need to know now, not at midnight. Can we see demand as it happens?
Elena
Phase 2: streaming. Different latency, different cost, different failure modes. We only build it if the business truly needs sub-hour answers.
Narrator
Batch reads a snapshot on a schedule. Streaming processes events as they arrive — seconds to minutes of latency. Streaming adds checkpoints, watermarks, and always-on compute. Most teams still need batch. Some need both.
James
Rule of thumb: if Marcus can wait an hour, keep batch. If empty taxis in the wrong zone costs money in ten minutes, consider streaming.
Sofia
Events land on a Kafka topic — partitioned for parallel consumers. Offsets track progress. Databricks Structured Streaming reads directly. Snowflake often uses a relay to files, then Snowpipe — same medallion idea, different ingest path.
Elena
For this workshop we stream simulated user-activity events from Aiven — not live taxi GPS. The architecture patterns are the same: Bronze append, Silver clean, Gold windowed aggregates.
Bob
I’ll build streaming Bronze, Silver, and Gold on Databricks and Snowflake — and dbt dynamic tables on Snowflake.
Priya
Import mode is for batch. For streaming Gold, I use DirectQuery with one-minute page refresh — aligned to Snowflake Dynamic Table lag. When events stop, the chart flatlines. When they resume, Marcus sees it within minutes.
Narrator
Module 8: optional advanced lab. Think first: does Marcus really need streaming? Then Kafka, Structured Streaming, Dynamic Tables, and live Power BI.
Module 9 — mod-09-ml.mp4 (optional)
Marcus
Priya showed me when we earn — but not why some trips tip well and others don’t. Can we predict tips on card trips? Maybe adjust incentives before the evening rush?
Elena
Phase 2: machine learning. Same NYC Taxi data you already cleaned in Silver — new question, not a new dataset. Different skills, different tools, same medallion discipline.
Narrator
Data engineers own Bronze through Silver — and often the feature table. Data scientists train models. BI consumes predictions from Gold, just like KPIs. Today you play both roles.
James
The target is
tip_amounton credit-card trips. But watch leakage — if your features include the answer, the model looks brilliant and fails in production.
Sofia
Never use
total_amount— it already contains the tip. Filter to credit card only; cash trips record zero tip digitally and would fool the model.
Bob
Databricks: sklearn and MLflow — full flexibility. Snowflake Cortex: ML in pure SQL for analysts. Snowpark ML: Python training without moving data out of the warehouse.
Elena
And dbt? It does not train models. It defines and tests the feature table both platforms read. Always separate features from training.
Priya
I’ll add predicted tips beside actuals on a new page — fed from a Gold scoring table Bob batch-writes after training. Same connector, new table, same governance story.
Narrator
Module 9: optional ML lab. Think first: what is leakage? Who owns features? Then train on three paths and compare RMSE, effort, and who can maintain each approach.