Animation fallback scripts

Live narration when module MP4s are unavailable

Animation fallback — live narration scripts

Use when: The module MP4 (media/modules/mod-XX-*.mp4) will not play — missing file, codec issue, or projector failure.

How to deliver:

Open the matching module page and show the story callout on screen.
Read the narration below in scene order (pause at Discuss / Write your recommendation cues).
Continue with reflection prompts as usual.

Optional modules: 8 and 9 are not part of the main-day set unless you run advanced sessions. When running both, deliver Module 8 before Module 9.

Module 0 — `mod-00-welcome.mp4`

Marcus

I’m Marcus Chen, Operations Manager at YellowLine NYC. We run yellow taxis across all five boroughs — millions of trips every month.

But here’s my problem: everyone has a different number. Finance uses one spreadsheet. Dispatch uses another. I can’t tell you — with confidence — where we make money, or when we should send drivers where.

Should we add cars in Midtown at six p.m.? Pull them from the airport at midnight? We’re guessing. And while we guess, ride-hail apps don’t.

Narrator

YellowLine NYC doesn’t need another report. They need an analytics foundation.

Elena

Marcus asked MHP for help. I’m Elena Vasquez, Data Architect. We won’t start with tools — we start with architecture and the questions the business actually needs answered.

Bob

I’m Bob. I’ll be building the pipelines — and learning which platforms fit Marcus’s team.

Priya

I’m Priya, BI Analyst. Before Bob writes a line of code, I need to know what Marcus wants on his dashboard.

When are peak revenue hours? Which pickup zones drive volume? Which routes matter? How does revenue vary by borough? And are bad records — null fares, zero-distance trips — skewing our decisions?

James

I’ll validate the logic in SQL once the data is clean. Priya connects the Gold KPIs to Power BI.

Elena

We standardize on medallion architecture. Bronze holds raw trip data exactly as it arrives. Silver cleans, enriches, and joins zone lookups. Gold delivers analytics-ready KPI tables Priya can report from.

Narrator

Same dataset. Clear layers. One path from source to dashboard.

Elena

Then we choose how to build. Lakehouse platforms. Cloud warehouses. Transformation frameworks. Each has strengths. Bob will prototype and we’ll decide what Marcus’s team can run in production.

Bob

Three pipelines. One NYC Taxi dataset. Let’s see what works.

Narrator

Before the first notebook opens — pause and think. You are Bob today.

How would you design the solution for YellowLine NYC?

Module 1 — `mod-01-fundamentals.mp4`

Elena

Marcus has data. He doesn’t have layers. In medallion architecture, Bronze is raw ingestion — Parquet from Azure, no filters. Silver is where we enforce quality, standardize columns, and join taxi zone lookups. Gold is business-ready — aggregated KPI tables, one per question Priya needs.

Classic ETL transforms before load — great for legacy systems. Modern cloud platforms favor ELT: load raw first, transform in place. That’s our approach. Raw stays reproducible; Silver and Gold can be rebuilt anytime.

Narrator

Today you’ll work with the NYC Taxi dataset — the same public trip records used in data engineering training worldwide.

Priya

My dashboard wireframe is ready. My KPI list is ready. I’m waiting for Bob’s Gold tables. Overview charts, maps, revenue pages — they all depend on the pipeline layers Elena just described.

James

I’ll explore the schema with you before we build — Vendor ID, pickup times, fares, zones — so Silver rules match what Marcus expects.

Elena

Databricks brings Spark and lakehouse scale. Snowflake brings SQL-first analytics. dbt brings transformation as code with tests and lineage. AWS, Cloudera — options in enterprise landscapes. We evaluate against Marcus’s skills, cost, and maintainability — not hype.

Bob

I won’t pick a winner yet. First I need to understand the fundamentals we’re about to use all day.

Narrator

Module 1: fundamentals. After this video — discuss with your partner. What belongs in Silver? What belongs in Gold? Then we go deep on theory and the NYC Taxi schema.

Module 2 — `mod-02-databricks.mp4`

Bob

Elena, I’ve worked with PySpark before. Databricks is powerful and popular for lakehouse workloads. Can we prototype ingest and transforms there first?

Elena

Approved — for scale and flexibility. Sofia will pair with you. Same medallion design: Bronze, Silver, Gold. Same KPIs Priya needs.

Sofia

Read Parquet directly from Azure ADLS2. Land it in Delta Bronze. Don’t skip raw — Marcus may ask us to reprocess December when January looks wrong.

Narrator

Bronze: external Parquet into Delta Lake. Silver: filter invalid fares, compute trip duration, enrich with borough and zone. Gold: twelve KPI tables — trips by hour, by day, revenue bands, efficiency metrics.

Bob

One dataset. One architecture. Databricks executes the first full path.

Priya

Bob, I just connected kpi_trips_by_hour and kpi_trips_by_day. My Overview page is alive. Marcus can finally see peak hours — but I still need maps and revenue pages from the next layers.

Narrator

Module 2: your hands-on Databricks lab. Think first: how would you ingest Parquet at scale? What breaks if you skip Bronze? Then open the notebooks and build.

Module 3 — `mod-03-snowflake.mp4`

Marcus

Bob, this works. I believe the numbers. But my team lives in SQL. They’re not going to maintain PySpark notebooks after MHP leaves. Is there a path with less programming — something they can extend?

Elena

Fair constraint. Bob — rebuild the same medallion pipeline on Snowflake. SQL-first. Snowpark optional for engineers who want Python.

Narrator

Identical business logic. Identical Gold table names. Different platform — external stage from ADLS2, COPY INTO Bronze, SQL transforms for Silver and Gold. Same architecture. Different implementation philosophy.

Bob

Marcus’s analysts can read every transform in a SQL file. That’s the maintainability test.

Sofia

Don’t reinvent the rules — port the Silver filters from PySpark line by line. Consistency matters for Priya’s dashboard.

Priya

I pointed the same Power BI report at Snowflake Gold. kpi_borough_analysis, kpi_top_pickup_zones — the Map page works unchanged. Identical KPIs. Different engine. That’s the point.

Narrator

Module 3: Snowflake lab. Discuss: who maintains this after MHP? How do you load Parquet without Spark? Then build Bronze → Silver → Gold in SQL.

Module 4 — `mod-04-dbt.mp4`

Marcus

Priya’s dashboard is exactly what I wanted. But my board asked a harder question: Where does each number come from? We need documentation. Data lineage. And tests — so an analyst’s Friday edit doesn’t break Monday’s revenue tile silently.

Elena

We keep Snowflake as the engine. We add dbt for transformations — SQL models, automated tests, and generated docs. Bob, dbt is not a replacement warehouse. It’s how we govern the transform layer.

Bob

Every Gold model references Silver with ref(). Tests catch null keys and negative fares. dbt docs generate shows Marcus’s auditors the full chain — source Parquet to dashboard tile.

Narrator

dbt runs on Snowflake. Same KPIs. Added discipline.

Priya

Revenue by hour, payment type breakdown — connected. And kpi_data_quality_metrics feeds my quality scorecard. When Marcus asks “can we trust this data?”, I show him the number and the lineage behind it.

Elena

The same dbt models can target Databricks or Snowflake. Today we focus on Snowflake. The pattern travels with you.

Narrator

Module 4: dbt lab. Discuss: how do you prove where a KPI comes from? What tests would you add? Then run models, tests, and generate the docs Marcus’s board wants.

Module 5 — `mod-05-production.mp4`

Elena

The pipeline works when you’re in the room. Marcus needs it to work when you’re not. What runs every night? What happens when Silver fails at two a.m.? Who gets paged?

Marcus

I don’t need heroes. I need schedules, retries, and alerts.

Narrator

Databricks Workflows and Delta Live Tables. Snowflake Tasks and Streams. dbt in CI/CD with pull request checks. Incremental loads instead of full rewrites. Idempotent jobs — safe to rerun.

Bob

Production isn’t more code. It’s the same code — with reliability wrapped around it.

Narrator

Module 5: production patterns. Discuss: what breaks in production that never breaks in class? How would you schedule tonight’s run on each platform?

Module 6 — `mod-06-ai.mp4`

Marcus

Priya answers my questions in minutes now. But my analysts ask dozens of questions a day. Can AI help them explore data faster — without breaking what Bob built?

Elena

AI can accelerate exploration — not replace your medallion pipeline or Priya’s KPI definitions.

Priya

I still own the dashboard. AI might help James draft SQL against Silver — under human review.

James

If AI writes the query, I still validate against Gold. Garbage in, garbage out — AI doesn’t fix bad Bronze.

Narrator

Cortex, Genie, Copilot — assistants on top of a governed stack. Phase 1.5, not a shortcut past Bronze, Silver, and Gold.

Narrator

Module 6: AI features. Discuss: where is AI useful in this project? Where would you not trust it? Then try the exercises — always anchored to the pipeline you built.

Module 7 — `mod-07-wrapup.mp4`

Trainer note: Pause after the Write your recommendation and open-discussion cues — do not read through them; let trainees write and debate.

Narrator

Six months later. MHP’s engagement is over. YellowLine NYC runs the platform now — SQL analysts on rotation, dashboards refreshing nightly. Let’s see what stuck.

Priya

Marcus — your operations dashboard. Overview: trips and revenue by hour and day. Map: borough performance and top pickup zones. Time analysis: rush hours and heatmaps. Revenue and payments: fare breakdown and card versus cash. Efficiency: distance bands, speed, revenue per minute.

Twelve Gold KPI tables. One schema. Built by Bob across three tool paths — consumed here in Power BI.

Marcus

This is what I asked for on day one. Now I need the harder answer.

You proved they all work. What should YellowLine NYC run in production? What can my SQL team maintain? What would you choose — and why?

Elena

There isn’t one universal answer. Platform, transform layer, and dashboard are separate decisions. Today you recommend — not me.

Narrator

One NYC Taxi dataset. Three implementations. Medallion architecture throughout. Priya’s Power BI at the finish line. You built as Bob. Now think as the architect.

Take a moment. Which stack would you recommend for YellowLine NYC? One sentence why. You’ll discuss with the room next.

Compare Databricks, Snowflake, and dbt. Defend your choice. Challenge your classmates. Revisit what you designed this morning — what would you change now?

Elena

The best engineers don’t pick tools from hype. They pick from constraints, skills, and proof.

Narrator

Technology is a decision. Architecture is responsibility.

Module 8 — `mod-08-streaming.mp4` (optional)

Marcus

The dashboard Bob and Priya built is perfect — for yesterday. But when a concert lets out in Brooklyn, I need to know now, not at midnight. Can we see demand as it happens?

Elena

Phase 2: streaming. Different latency, different cost, different failure modes. We only build it if the business truly needs sub-hour answers.

Narrator

Batch reads a snapshot on a schedule. Streaming processes events as they arrive — seconds to minutes of latency. Streaming adds checkpoints, watermarks, and always-on compute. Most teams still need batch. Some need both.

James

Rule of thumb: if Marcus can wait an hour, keep batch. If empty taxis in the wrong zone costs money in ten minutes, consider streaming.

Sofia

Events land on a Kafka topic — partitioned for parallel consumers. Offsets track progress. Databricks Structured Streaming reads directly. Snowflake often uses a relay to files, then Snowpipe — same medallion idea, different ingest path.

Elena

For this workshop we stream simulated user-activity events from Aiven — not live taxi GPS. The architecture patterns are the same: Bronze append, Silver clean, Gold windowed aggregates.

Bob

I’ll build streaming Bronze, Silver, and Gold on Databricks and Snowflake — and dbt dynamic tables on Snowflake.

Priya

Import mode is for batch. For streaming Gold, I use DirectQuery with one-minute page refresh — aligned to Snowflake Dynamic Table lag. When events stop, the chart flatlines. When they resume, Marcus sees it within minutes.

Narrator

Module 8: optional advanced lab. Think first: does Marcus really need streaming? Then Kafka, Structured Streaming, Dynamic Tables, and live Power BI.

Module 9 — `mod-09-ml.mp4` (optional)

Marcus

Priya showed me when we earn — but not why some trips tip well and others don’t. Can we predict tips on card trips? Maybe adjust incentives before the evening rush?

Elena

Phase 2: machine learning. Same NYC Taxi data you already cleaned in Silver — new question, not a new dataset. Different skills, different tools, same medallion discipline.

Narrator

Data engineers own Bronze through Silver — and often the feature table. Data scientists train models. BI consumes predictions from Gold, just like KPIs. Today you play both roles.

James

The target is tip_amount on credit-card trips. But watch leakage — if your features include the answer, the model looks brilliant and fails in production.

Sofia

Never use total_amount — it already contains the tip. Filter to credit card only; cash trips record zero tip digitally and would fool the model.

Bob

Databricks: sklearn and MLflow — full flexibility. Snowflake Cortex: ML in pure SQL for analysts. Snowpark ML: Python training without moving data out of the warehouse.

Elena

And dbt? It does not train models. It defines and tests the feature table both platforms read. Always separate features from training.

Priya

I’ll add predicted tips beside actuals on a new page — fed from a Gold scoring table Bob batch-writes after training. Same connector, new table, same governance story.

Narrator

Module 9: optional ML lab. Think first: what is leakage? Who owns features? Then train on three paths and compare RMSE, effort, and who can maintain each approach.

Animation fallback — live narration scripts

Module 0 — mod-00-welcome.mp4

Module 1 — mod-01-fundamentals.mp4

Module 2 — mod-02-databricks.mp4

Module 3 — mod-03-snowflake.mp4

Module 4 — mod-04-dbt.mp4

Module 5 — mod-05-production.mp4

Module 6 — mod-06-ai.mp4

Module 7 — mod-07-wrapup.mp4

Module 8 — mod-08-streaming.mp4 (optional)

Module 9 — mod-09-ml.mp4 (optional)

Module 0 — `mod-00-welcome.mp4`

Module 1 — `mod-01-fundamentals.mp4`

Module 2 — `mod-02-databricks.mp4`

Module 3 — `mod-03-snowflake.mp4`

Module 4 — `mod-04-dbt.mp4`

Module 5 — `mod-05-production.mp4`

Module 6 — `mod-06-ai.mp4`

Module 7 — `mod-07-wrapup.mp4`

Module 8 — `mod-08-streaming.mp4` (optional)

Module 9 — `mod-09-ml.mp4` (optional)