Reflection Prompts

Facilitator guide for Think & Discuss (oral discussion) in every module

Facilitator guide for Think & Discuss — informal oral discussion in every module.

Day-of rhythm: Animation → Think & Discuss (this doc) → Theory → Quiz (Google Form) → Practice

Related docs:

facilitator-guide · pre-class-checklist · module-delivery-pattern.qmd
Live form URLs — share with class during delivery
animation-fallback.qmd — live narration if module MP4 is unavailable
open-discussion-guide.qmd — Module 7 extended discussion

How to Facilitate Think & Discuss

Play the module animation — do not skip; trainees need shared context.
State the situation in one sentence — use the Situation recap below.
Ask the questions in order — start with situation, then challenge, then action.
Capture on whiteboard / Miro — 3–5 bullet points only; use trainee words.
Let trainees answer first — use the Trainer answers below only to steer, unblock silence, or validate during Theory.
Bridge to theory — say: “You mentioned [X]. Let’s see how the industry handles that.”

Note: Think & Discuss is oral only — no Google Form at this stage. The Quiz (Google Form) comes after Theory.

Timing

Module	Target duration	Format
Story	12 min	Individual sketch → pairs → whole group
1	8 min	Pairs → share 2 answers
2–6	7 min	Pairs → share 1 answer per question
7	5 min	Silent write → open discussion follows

Trainee role reminder (say once at start of day):

“Today you are Bob, the junior data engineer. Elena designed the architecture. Your job is to build, evaluate, and eventually recommend tools for Marcus.”

After Theory, share the matching Google Form quiz from google-forms-links.md (~2–3 min, mostly multiple choice, auto-scored).

Story — Welcome & Setup

Google Form: Submit reflection

Video: media/modules/mod-00-welcome.mp4
Duration: 12 minutes

Situation recap

YellowLine NYC has millions of taxi trips but no analytics platform. Marcus hired MHP to help optimize revenue and fleet operations. The MHP team has assembled; Priya needs KPIs; Elena wants medallion architecture — but no solution is built yet.

Questions (design worksheet)

Ask trainees to write or sketch answers before discussing:

Situation — What is Marcus’s biggest problem in your own words?
Data — Where does the trip data live today? How often should it be refreshed?
Architecture — If you had to split the data into layers, how many would you use and what would each layer contain?
Tools — Name two tools you would consider for the pipeline. Why those two?
Consumption — Priya needs a dashboard. What must exist in the pipeline before she can build it?
Risks — What could go wrong (data quality, cost, skills, maintenance)?

Trainer answers

Warning

Trainer only — Aligns with trainee Think & Discuss (module §2). Do not read aloud before trainees discuss.

#	Question	Expected themes
1	Marcus’s biggest problem	No analytics platform — fleet and revenue decisions are blind; millions of trips in files but no KPIs, dashboards, or trusted numbers for operations.
2	Data location & refresh	Today: Parquet + CSV on ADLS2 (`mhpdeworkshopsa` / `nyc-taxi-data`). Workshop uses batch refresh (nightly or on-demand); streaming is Phase 2 (Module 8).
3	Architecture layers	Three layers (medallion): Bronze = raw copy for audit/replay; Silver = cleaned, enriched trip rows; Gold = 12 pre-aggregated KPI tables for Priya. Accept “raw / trusted / consumption” if semantics match.
4	Tools	Any reasonable pair with a criterion — e.g. Databricks (scale, Spark ingest) + Snowflake (SQL ops), or Snowflake + dbt (transform-as-code). No single winner yet.
5	Before Priya’s dashboard	Gold KPI tables must exist (same schema all pipelines); Silver quality must be trustworthy; Bronze ingest must have run. Power BI connects to Gold, not raw files.
6	Risks	Bad data silently breaking KPIs; cost of cloud compute; team lacks PySpark skills; no lineage for auditors; skipping Bronze makes pipelines irreproducible.

Pair prompt

Compare your architecture sketch with your partner. Where do you agree? Where do you disagree?

Whole-group capture (whiteboard)

Column	Fill from trainees
Problems Marcus faces
Layers / zones proposed
Tools mentioned
Risks mentioned

Save this whiteboard — revisit in Module 7.

Bridge to theory

“You just did what consultants do on day one. Next we set up the environment, then Elena will formalize the architecture you sketched.”

Module 1 — Data Engineering Fundamentals

Google Form: Submit reflection

Video: media/modules/mod-01-fundamentals.mp4
Duration: 8 minutes

Situation recap

Elena has proposed medallion architecture: Bronze, Silver, Gold. Priya has listed the business questions but her Power BI dashboard is still empty. The team must agree what belongs in each layer before anyone writes code.

Questions

Situation — Why does Elena want three layers instead of one big table Marcus can report from?
Challenge — What belongs in Silver vs Gold? Give one example column or metric for each.
Action — Priya asks: “When are our peak revenue hours?” Which layer would you query to answer that — Bronze, Silver, or Gold? Why?
ETL vs ELT — Would you transform trip data before or after loading into the platform? What is one reason for your choice?
Tools — Databricks, Snowflake, AWS, Cloudera appear on the board. What is one criterion you would use to compare them — without picking a winner yet?

Trainer answers

Warning

Trainer only — Aligns with trainee Think & Discuss (module §2).

#	Question	Answer
1	Why three layers?	Separation of concerns: Bronze preserves raw source for replay/audit; Silver is the quality gate; Gold is BI-ready aggregates. One big table mixes dirty raw data with KPIs — errors compound and dashboards slow down.
2	Silver vs Gold examples	Silver: `silver_nyc_taxi_enriched` — one row per trip, cleaned fares, zone names joined, `trip_duration_minutes`. Gold: `kpi_trips_by_hour` — aggregated trip counts/revenue by hour for fast dashboard queries.
3	Peak revenue hours — which layer?	Gold — query `kpi_revenue_by_hour` or `kpi_trips_by_hour`. Silver has row-level trips (works but wrong contract); Bronze is never queried by analysts.
4	ETL vs ELT	ELT for this workshop — load raw Parquet to Bronze first, transform in-place on Databricks/Snowflake. Reason: cheap object storage + elastic compute; iterate on SQL/PySpark without re-extracting from source.
5	Tool comparison criterion	Examples: team skills (SQL vs PySpark), governance (Unity Catalog / Horizon), ingest pattern (Spark vs `COPY INTO`), total cost, time to first dashboard.

Whiteboard capture

Silver = (trainee definition)
Gold = (trainee definition)
ELT or ETL = (majority vote + one reason)

Bridge to theory

“Let’s define ETL vs ELT precisely and walk through the NYC Taxi schema Priya and James will use.”

Module 2 — Databricks Pipeline

Google Form: Submit reflection

Video: media/modules/mod-02-databricks.mp4
Duration: 7 minutes

Situation recap

Elena approved Bob’s request to prototype on Databricks. Sofia will mentor. Raw Parquet files sit in Azure ADLS2. Priya is waiting for Gold tables to populate her dashboard Overview page.

Trainer UI readiness

Before starting reflection, confirm attendees see: - Databricks workspace open in browser with left sidebar visible - At least one cluster in Terminated state (they will start it during Practice) - Notebooks 01_bronze_ingestion.py, 02_silver_cleaning.py, 03_gold_kpis.py visible in Workspace → Shared folder

Questions

Situation — Why might MHP start with Databricks for this use case instead of Excel or a single SQL script?
Challenge — How would you ingest Parquet from ADLS2 into a Bronze layer? What tool or command family would you expect to use?
Action — What could go wrong in Bronze if you skip data quality checks and jump straight to KPIs?
Silver — Name two transformations that belong in Silver (not Bronze, not Gold).
Priya — After Gold is built, which two KPI tables would unlock Priya’s trips-by-hour and day-of-week charts?

Trainer answers

Warning

Trainer only — Aligns with trainee Think & Discuss (module §2).

#	Question	Answer
1	Why Databricks vs Excel/script?	Scale + lakehouse: millions of Parquet rows, distributed `spark.read`, Delta ACID tables, Unity Catalog governance. Excel/SQL script breaks on volume and has no medallion ops story.
2	Ingest Parquet to Bronze	`spark.read.parquet("abfss://nyc-taxi-data@.../raw/trips/")` → write Delta with `saveAsTable` to `{catalog}.de_01_alice_bronze.nyc_taxi_trips` (pattern: `{attendee_id}_bronze`). ADLS2 storage account access key configured in `00_setup.py`.
3	Skip quality → jump to KPIs	Silent wrong numbers: null/zero fares, duplicate trips, cross-month outliers, bad joins — Gold looks plausible but Priya’s dashboard is wrong. Classic “skip Silver” pitfall.
4	Silver transforms (two)	Filter invalid trips (`trip_distance > 0`, valid datetimes); join `taxi_zone_lookup` for borough names; standardize column names; compute `trip_duration_minutes`, fare per mile.
5	Gold tables for Priya	`kpi_trips_by_hour` (Overview time chart) and `kpi_trips_by_day` or `kpi_time_of_day_analysis` (day-of-week / time-of-day). Accept `kpi_revenue_by_hour` if they tie it to revenue view.

Whiteboard capture

Bronze ingest approach =
Silver transforms =
Gold tables for Priya =

Bridge to theory

“You mentioned [ingest approach]. Let’s look at Unity Catalog, Delta Lake, and the notebook flow ADLS2 → Bronze → Silver → Gold.”

Module 3 — Snowflake Pipeline

Google Form: Submit reflection

Video: media/modules/mod-03-snowflake.mp4
Duration: 7 minutes

Situation recap

Marcus reviewed the Databricks prototype and said: “My team lives in SQL. They won’t maintain PySpark notebooks.” Elena asked Bob to rebuild the same medallion design on Snowflake. Priya needs to connect her Map page to Gold — same KPIs, different engine.

Trainer UI readiness

Module 3 is the first Snowflake module — attendees create their own trial accounts and run setup during this module. Before starting reflection: - Confirm each attendee has completed Snowflake trial signup (or started the process) - Distribute ADLS2 SAS tokens needed for External Stage setup - After 02_account_setup.sql runs: each attendee should have DE_MASTERCLASS database, DE_WORKSHOP_WH warehouse, and DE_WORKSHOP_ROLE role - Remind attendees to run 05_cortex_access.sql on their trial after account setup (Cortex for Modules 6 and 9 — facilitator cannot grant roles on attendee accounts) - Attendee switches to DE_WORKSHOP_ROLE in the Snowsight role selector (top-right) before starting exercises

Questions

Situation — What is Marcus really asking for — a different tool, a different skill set, or both?
Challenge — How do you keep the same 12 KPIs without PySpark? What would you rewrite?
Action — Who at YellowLine NYC will maintain this pipeline after MHP leaves? What skills do they need?
Ingest — Snowflake cannot run Spark notebooks. How would you load Parquet from ADLS2 into Snowflake Bronze?
Priya — If Gold schema stays identical, can Priya keep the same Power BI reports? Why or why not?

Trainer answers

Warning

Trainer only — Aligns with trainee Think & Discuss (module §2).

#	Question	Answer
1	Marcus asking for tool or skills?	Both — he wants a SQL-first platform his team can maintain after MHP leaves, not PySpark notebooks only a consultant understands.
2	Same 12 KPIs without PySpark	Rewrite transforms as Workspaces SQL files or Snowpark stored procedures — same Silver logic (filters, joins, aggregates), different language/runtime. Gold table names and columns stay identical.
3	Who maintains after MHP?	YellowLine NYC internal team — SQL analysts / analytics engineers. They need Workspaces SQL files, stages, tasks — not Databricks cluster admin.
4	Load Parquet without Spark	External stage on ADLS2 (SAS URL) + `COPY INTO` `bronze.nyc_taxi_trips` from `@stage/path`. Two-step but familiar to SQL teams.
5	Same Power BI reports?	Yes, mostly — same Gold schema (`kpi_` tables, same columns) means same DAX and visuals. Only the connector* changes (Snowflake vs Databricks SQL warehouse/connector). Semantic model may need a reconnect, not a rebuild.

Whiteboard capture

Marcus’s constraint =
Maintainability owner =
Snowflake ingest idea =

Bridge to theory

“Let’s compare external stages, Workspaces SQL files, and Snowpark — and see how Snowflake implements the same Silver logic you built in PySpark.”

Module 4 — dbt Pipeline

Google Form: Submit reflection

Video: media/modules/mod-04-dbt.mp4
Duration: 7 minutes

Situation recap

Marcus’s board asked: “Where does each dashboard number come from?” Elena added dbt on top of Snowflake for SQL transformations, tests, and lineage. Priya connects revenue and payment pages — and wants a data quality scorecard.

Trainer UI readiness

Before starting reflection, confirm attendees see: - Terminal open in Codespaces or Docker container - dbt_project/ directory accessible; profiles.yml configured - dbt --version shows dbt-core 1.11.8, dbt-databricks 1.12+, dbt-snowflake 1.11.5+ (fork pyproject.toml)

Questions

Situation — What problem does Marcus have that Snowflake SQL alone did not fully solve?
Challenge — Is dbt a replacement for Snowflake? If not, what is dbt’s job in one sentence?
Action — Marcus clicks a revenue tile in Power BI. How would you prove which model and source table that number came from?
Tests — Name two tests you would add so bad data never silently breaks a KPI.
Priya — Which Gold KPI table would feed a data quality scorecard? What would it measure?

Trainer answers

Warning

Trainer only — Aligns with trainee Think & Discuss (module §2).

#	Question	Answer
1	Problem Snowflake SQL alone didn’t solve	Governance for the board — no built-in lineage from KPI tile → model → source; tests not version-controlled; transform SQL scattered across SQL files.
2	Is dbt a Snowflake replacement?	No. One sentence: dbt compiles and runs SQL transformations on top of* Snowflake (or Databricks) — it owns models, tests, and docs, not storage or ingest.*
3	Prove revenue tile lineage	`dbt docs generate` lineage graph; trace `ref()` chain from Gold mart → Silver staging → Bronze source; match table name in Power BI to dbt model `kpi_revenue_by_hour` (or equivalent). Manifest + `exposures` for BI links.
4	Two tests	`not_null` on `pickup_datetime` / `trip_id`; `accepted_values` on `payment_type`; `unique` on grain keys; `relationships` test to zone lookup. Any two that catch silent KPI breakage.
5	Data quality scorecard table	`kpi_data_quality_metrics` — long-format metrics: `records_removed`, `retention_rate_pct`, `data_quality_score`, null-zone rates. Priya displays Overview cards; James validates thresholds in SQL.

Whiteboard capture

dbt’s role = (layer on warehouse, not replacement)
Lineage proof =
Tests to add =

Bridge to theory

“dbt models, ref(), tests, and dbt docs generate — let’s see the lineage graph Marcus’s auditors want.”

Module 5 — Production Patterns

Google Form: Submit reflection

Video: media/modules/mod-05-production.mp4
Duration: 7 minutes

Situation recap

The pipeline works in a notebook and in a manual dbt run. Elena asks: “What executes every night when we are not in the room?” YellowLine NYC needs scheduled, reliable, monitored jobs.

Trainer UI readiness

Before starting reflection, confirm attendees see: - Databricks Workflows page accessible (left sidebar → Workflows icon) - Snowflake Workspaces SQL file open with the Task creation SQL ready to paste - dbt project with production/ directory containing CI config files

Questions

Situation — What is different about production compared to the lab environment you used this morning?
Challenge — What breaks in production that almost never breaks in a classroom exercise?
Action — How would you schedule the nightly Bronze → Silver → Gold run on Databricks? On Snowflake? On dbt?
Failure — If Silver fails at 2 a.m., who should know? What should happen to Gold KPIs?
Change control — An analyst edits a SQL model on Friday afternoon. What process stops that change from breaking Monday’s dashboard?

Trainer answers

Warning

Trainer only — Aligns with trainee Think & Discuss (module §2).

#	Question	Answer
1	Production vs lab	Lab = manual notebook runs, interactive debug, no alerts. Production = scheduled jobs, retries, monitoring, secrets in vaults, CI/CD, access control — runs when nobody is in the room.
2	What breaks in production	Silent failures at 2 a.m., partial loads, schema drift, credential expiry, cluster/warehouse auto-stop, upstream ADLS2 file missing, dbt test failures blocking deploy.
3	Schedule nightly medallion	Databricks: Lakeflow Workflows job (setup → bronze → silver → gold). Snowflake: Tasks chain on SQL files/procedures. dbt: `dbt build` via GitHub Actions on cron (Core CLI) or an orchestrator; dbt Cloud optional.
4	Silver fails at 2 a.m.	On-call data engineer alerted (email/Slack/PagerDuty); Gold job must not run (dependency failure); dashboard shows stale data or freshness flag; incident ticket opened.
5	Friday analyst edit	Git PR + CI running `dbt test` / SQL lint; code review; deploy only after green pipeline; no direct prod SQL file edits. Asset bundles for Databricks; tagged releases for Snowflake tasks.

Whiteboard capture

Production vs lab =
Scheduling tool (per platform) =
On failure =

Bridge to theory

“Workflows, Tasks, incremental loads, CI/CD for dbt — the go-live checklist Elena uses with clients.”

Module 6 — AI Features

Google Form: Submit reflection

Video: media/modules/mod-06-ai.mp4
Duration: 7 minutes

Situation recap

Marcus saw Priya’s dashboard and asked: “Can AI help my analysts explore data faster?” MHP will demo Cortex, Genie, and Copilot-style assistants — but the medallion pipeline remains the foundation.

Trainer UI readiness

Before starting reflection, confirm attendees see: - Snowflake Snowsight open with a SQL file ready for Cortex AI SQL - Attendees ran 05_cortex_access.sql on their own trial (or can re-run if AI_COMPLETE fails) - Databricks workspace with Genie icon visible in the left sidebar (under SQL section) - Gold tables exist from at least one pipeline (Module 2 or 3 completed)

Questions

Situation — What task is Marcus trying to speed up — building pipelines, writing SQL, or reading dashboards?
Challenge — Where could AI help in this NYC Taxi project? Where would you not trust AI?
Action — If an AI tool writes SQL against Silver, what must still exist in your architecture for the answer to be trustworthy?
Governance — Who is accountable if AI-generated SQL exposes wrong revenue numbers to Marcus?
Priya — Could AI replace Priya’s Power BI dashboard? Why or why not?

Trainer answers

Warning

Trainer only — Aligns with trainee Think & Discuss (module §2).

#	Question	Answer
1	What is Marcus speeding up?	Analyst SQL and exploration — writing queries, classifying data, natural-language questions over Gold. Not replacing the medallion pipeline build (that’s still Bob/Elena).
2	Trust AI / not trust	Help: boilerplate SQL, `AI_COMPLETE` classification, Genie/Cortex Analyst on Gold, dbt doc generation. Don’t trust: architectural tool choices, financial KPI definitions without review, AI on Bronze raw, using LLM output as join keys.
3	AI SQL against Silver — what must exist?	Governed Silver tables (quality filters applied), Unity Catalog/Horizon permissions, documented KPI definitions, human accountability, and preferably dbt tests so the AI is querying trusted data — not garbage-in-garbage-out.
4	Governance / accountability	Priya + James + Elena — AI is a tool; YellowLine leadership still owns numbers shown to Marcus. Log prompts, validate samples, never auto-publish LLM classifications to exec reports.
5	AI replace Power BI?	No — Marcus needs curated dashboards (Overview, Map, Revenue) with fixed KPI contracts, refresh schedules, and mobile-friendly visuals. AI assists exploration; Power BI is the governed consumption layer.

Whiteboard capture

AI helps with =
AI should not =
Still required = (medallion, quality, lineage)

Bridge to theory

“AI augments the stack — it does not replace Bronze, Silver, Gold, or Priya’s KPI definitions.”

Trainer note — Cortex vs Module 9 ML

Say explicitly before exercises:

“Module 6 uses Cortex LLM functions — AI_COMPLETE, Copilot, Genie — to help analysts write SQL and explore data. Module 9 (optional) uses Cortex ML functions — ML.FORECAST, ML.ANOMALY_DETECTION — for prediction. Different APIs, different purpose.”

Module 7 — Power BI Payoff & Open Tool Discussion

Google Form: Submit reflection / decision matrix

Video: media/modules/mod-07-wrapup.mp4
Duration: 5 minutes reflection + 20–25 minutes open discussion

Situation recap

Priya presented the finished Power BI dashboard to Marcus — Overview, Map, Time Analysis, Revenue, Efficiency — all fed by Gold KPIs Bob built. Marcus asks: “What should we run in production?” MHP’s engagement is ending; trainees must recommend a tool strategy.

Silent reflection (2 minutes — no talking)

Hand out the architecture decision matrix (one page per trainee), then ask trainees to write privately:

My recommended stack for YellowLine NYC is: Databricks / Snowflake / dbt / combination — (circle one)
One sentence why:
One tool I would not choose as the primary platform and why:

Trainer answers — silent reflection

Warning

Trainer only — No single correct stack. Use to sanity-check trainee reasoning; full facilitation in open-discussion-guide.qmd.

Prompt	Reasonable answers
Recommended stack	Snowflake + dbt + Power BI if Marcus’s team is SQL-only and board wants lineage. Databricks-only if small eng team and ML/streaming on roadmap. Combination is valid but adds integration cost.
One sentence why	Tie to skills, governance, cost, or time-to-dashboard — not “because we used it in lab.”
Tool to avoid as primary	e.g. dbt alone (no ingest), Power BI (consumption only), Databricks if team will never maintain notebooks — must match their stated constraint.

Discussion questions (pick 4–6)

Use after silent reflection. See open-discussion-guide.qmd for full facilitation rounds.

Tool comparison

If Marcus’s team is SQL-only, what do you recommend and why?
Where did Databricks clearly win today? Where did it feel like overkill?
What did dbt add that Snowflake alone did not give you?
Could YellowLine NYC run only one of the three tools? What would they lose?

Decision criteria

Rank for your context: cost, skills, governance, speed to first dashboard — which matters most?
What would you decide differently if Marcus needed streaming (Module 8) or ML (Module 9)?

Architecture revisit

Look at the Story whiteboard. What would you change in your original design now that you have built all three pipelines?
Priya’s dashboard connects to Gold. Does that change which tool you prioritize for ingest vs transform vs consumption?

Trainer answers — discussion questions

#	Question	Answer
1	SQL-only team	Snowflake + dbt primary; Databricks optional for heavy ingest/ML later. Power BI unchanged on Gold.
2	Databricks wins / overkill	Wins: Spark ingest at scale, Delta, one platform for batch + streaming path. Overkill: simple SQL KPIs if team won’t run notebooks.
3	What dbt added	Versioned transform SQL, `dbt test`, `dbt docs` lineage — answers Marcus’s board audit question.
4	Only one tool?	Snowflake only: lose managed Spark ingest story. Databricks only: weaker SQL-analyst ergonomics. dbt only: cannot ingest raw Parquet by itself.
5	Rank criteria	No universal order — SQL team → skills + governance first; startup → speed; regulated → lineage + masking.
6	Streaming / ML changes	Streaming (Mod 8): favors Databricks Structured Streaming or Snowflake Dynamic Tables — batch-only Snowflake feels tight. ML (Mod 9): Databricks MLflow vs Snowflake Cortex — may keep Databricks for training even if Gold is in Snowflake.
7	Revise Story whiteboard	Trainees usually add dbt, explicit Silver quality, separate ingest vs transform owners, and Gold as BI contract. Celebrate what they got right day one.
8	Gold → tool priority	Consumption fixed (Power BI on Gold). Transform → dbt/SQL. Ingest → platform with best ADLS2 path (Spark or stages). Decisions are independent.

Trainer answers — §3.10 Review Questions (self-check)

Aligns with trainee Module 7 §3.10 Review Questions. Use after Theory, before Quiz and open discussion — steer if pairs struggle; do not read aloud during §2 Think & Discuss.

Warning

Trainer only — Self-check answers; trainees should attempt these from memory first.

#	Question	Answer
1	Databricks winner vs Snowflake winner	Databricks wins: large-scale Spark ingest from ADLS2, Delta/Unity Catalog, Structured Streaming path (Module 8), sklearn + MLflow (Module 9). Snowflake wins: SQL-only maintainability, external stages + `COPY INTO`, elastic warehouse auto-suspend (cost), Cortex ML in SQL, Horizon governance for sharing.
2	Why dbt never alone?	dbt does not store data or ingest raw files — it compiles and runs SQL on Databricks or Snowflake. Without a warehouse, there is nowhere to land Bronze or execute `ref()` models.
3	Marcus’s three constraints	Cost (Year 3 TCO) → primary platform license, warehouse auto-suspend, dbt Core vs paid orchestration. Performance (live dispatch) → streaming-capable platform (Mod 8), right-sized compute, near-real-time Gold. Compliance (Q3 audit) → dbt lineage/tests, Unity Catalog or Horizon, immutable Bronze for replay. See characters and open-discussion-guide.

Whiteboard synthesis (fill during discussion)

Dimension	Databricks	Snowflake	dbt
Best for
Weak for
Fit for Marcus’s SQL team

Trainer close (2 minutes)

There is no single vendor answer.
Platform (where data runs), transform layer (how logic is managed), and consumption (Power BI) are separate decisions.
Revisit Story whiteboard — celebrate what trainees got right on day one.

Optional Power BI demo bridge

“Priya built the dashboard in the story. Let’s connect the same Gold tables live — or see the build guide in powerbi/README.md.”

Quick Reference — All Modules

Module	Core reflection question (one line)
Story	How would you design the solution?
1	What belongs in Silver vs Gold?
2	How do you ingest ADLS2 Parquet at scale?
3	How do you rebuild the same KPIs for a SQL team?
4	How do you prove where a KPI number comes from?
5	What runs every night without you?
6	Where does AI help — and where must the pipeline still govern truth?
7	What would you choose for YellowLine NYC in production?
8 (optional)	When does streaming beat batch — and how does Priya consume live Gold?
9 (optional)	Who owns features vs models — and how do predictions reach Power BI?

Module 8 — Streaming Data Processing (Optional)

Google Form: Submit reflection

Video: media/modules/mod-08-streaming.mp4
Duration: 8 minutes
When delivered: After main day (90 min block) or standalone advanced session
Prerequisites: Modules 2–3 required · Module 4 recommended (dbt dynamic_table track)
Story order: Deliver before Module 9 when following YellowLine NYC Phase 2 timeline

Situation recap

Marcus wants live visibility — dispatch needs demand signals in minutes, not after tonight’s batch job. Elena schedules Phase 2. For training, MHP uses Aiven Kafka user-activity events as a teaching proxy (same streaming patterns; different dataset than NYC Taxi Parquet).

Trainer UI readiness

Before starting reflection, confirm attendees see: - Databricks cluster running with Kafka Maven libraries installed (check Compute → cluster → Libraries tab) - Snowflake warehouse started; Workspaces SQL files open for Dynamic Table SQL - Aiven Kafka topic user-activity has events flowing (trainer verifies via producer logs)

Questions

Situation — What does Marcus need that yesterday’s batch pipeline cannot give him?
Batch vs stream — When is batch still the right answer? Name one reason streaming is not worth the complexity.
Kafka — In one sentence each: what is a topic, a partition, and an offset?
Action — Would you read Kafka directly in Databricks, or relay to ADLS2 then Snowpipe? What trade-off drives that choice?
Windows — Why do stream processors use watermarks? What breaks without them?
Priya — Priya wants a Power BI page that refreshes every minute. Which Gold output from this module supports that — and why DirectQuery instead of Import?

Trainer answers

Warning

Trainer only — Aligns with trainee Think & Discuss (module §2).

#	Question	Answer
1	What batch can’t give Marcus	Sub-hour dispatch signals — batch Gold refreshes overnight; dispatch needs minutes-old demand/activity counts, not yesterday’s aggregates.
2	When batch still wins	Historical analytics, cost sensitivity, simpler ops, NYC Taxi monthly reports — streaming adds Kafka ops, late data, exactly-once complexity.
3	Topic / partition / offset	Topic = named event stream; partition = ordered shard for parallelism; offset = position of a consumer in a partition (replay checkpoint).
4	Kafka in Databricks vs ADLS2 relay	Direct Structured Streaming from Kafka = lower latency, more streaming ops. ADLS2 landing + Snowpipe = higher latency, reuses batch skills. Trade-off: latency vs operational familiarity.
5	Watermarks	Handle late-arriving events — bound how long to wait for out-of-order data before closing a window. Without watermarks, windows never close or state grows unbounded.
6	Power BI every minute	Streaming Gold aggregate table (e.g. events-per-minute). DirectQuery hits live warehouse table; Import would snapshot stale data between refreshes.

Whiteboard capture

Marcus needs live =
Batch still wins when =
Databricks path = / Snowflake path =
Watermark purpose =

Bridge to theory

“Batch vs streaming, Kafka fundamentals, Structured Streaming, Dynamic Tables, and dbt dynamic_table — then you build the same Bronze → Silver → Gold pattern on live events.”

Trainer note — dataset honesty

Say explicitly:

“YellowLine NYC would stream taxi GPS or dispatch events. We use Aiven user-activity events so every attendee gets a live Kafka topic without NYC TLC streaming infrastructure. The patterns transfer; the schema does not.”

Module 9 — Machine Learning (Optional)

Google Form: Submit reflection

Video: media/modules/mod-09-ml.mp4
Duration: 8 minutes
When delivered: After main day (90 min block) or standalone advanced session
Prerequisites: Modules 2–3 required · Module 4 recommended (dbt feature table track)
Story order: Deliver after Module 8 when following YellowLine NYC Phase 2 timeline

Situation recap

Marcus wants to predict tip amounts on credit-card trips so operations can tune driver incentives and Priya can add a “predicted vs actual tip” view. Elena assigns Phase 2 ML. The lab uses existing NYC Taxi Silver data — same dataset as the main workshop, new use case (unlike Module 8’s Aiven proxy).

Trainer UI readiness

Before starting reflection, confirm attendees see: - Databricks cluster with Runtime ML (15.x+) available; AI/ML → Experiments in sidebar - Workspaces SQL files ready for ML.FORECAST SQL; each attendee confirmed SNOWFLAKE.CORTEX_USER on their trial (05_cortex_access.sql) - silver_nyc_taxi_enriched table exists in both Databricks and Snowflake schemas

Questions

Situation — What business decision could tip prediction support for YellowLine NYC? Who consumes the output — Marcus, drivers, or Priya?
Lifecycle — Where does ML sit relative to Bronze, Silver, Gold, and Power BI? Who builds the feature table vs who trains the model?
Leakage — Why must total_amount not be a feature when predicting tip_amount? Name one other column that would leak the answer.
Filter — Why train on credit card trips only? What would the model learn if cash trips were included?
Tools — Match each to its primary ML role (no winner yet):
- Databricks sklearn + MLflow
- Snowflake Cortex ML (ML.FORECAST, ML.ANOMALY_DETECTION, …)
- Snowflake Snowpark ML
- dbt ml_features_tip_prediction
Priya — How could predicted tips land in Power BI next to existing Gold KPIs? Batch scoring table vs live inference?

Trainer answers

Warning

Trainer only — Aligns with trainee Think & Discuss (module §2).

#	Question	Answer
1	Business decision / consumer	Driver incentive tuning, pricing experiments — consumed by Marcus/ops (decisions) and Priya (predicted vs actual tip dashboard).
2	ML in medallion lifecycle	Feature table in Silver/Gold-adjacent schema (often dbt); training in Databricks/Snowflake; scored predictions written to Gold table; Power BI reads Gold like any KPI. Features = analytics engineer; model = ML engineer / DE with ML skills.
3	Why not `total_amount` as feature	Target leakage — `total_amount` = fare + tolls + tip; model would “cheat.” Also avoid post-trip fields that encode the tip indirectly.
4	Credit-card trips only	Cash trips often have tip = $0 or null in data — model would learn payment artifact, not tipping behavior. Filter `payment_type_desc = 'Credit card'` for honest signal.
5	Tool roles	Databricks sklearn + MLflow = full algorithm choice + experiment tracking. Cortex `ML.FORECAST` = SQL-native, low code, less flexible. Snowpark ML = Python in Snowflake. dbt `ml_features_tip_prediction` = feature contract + tests, not training.
6	Predictions in Power BI	Batch scoring table `gold.tip_predictions` joined to enriched trips — simplest for workshop. Live inference = real-time endpoint — overkill for Marcus’s nightly ops review.

Whiteboard capture

ML consumer =
Feature table owner = / Model trainer =
Leakage example =
dbt role in ML = (features + tests, not training)

Bridge to theory

“Feature engineering, leakage, credit-card filter — then Databricks + MLflow, Snowflake Cortex vs Snowpark, dbt as the feature contract, and compare effort vs flexibility across all three.”

End-of-module mini-discussion (10 min — after `ex-ml`)

Run after trainees fill the Compare Your Results table in the exercise:

“Lowest effort to first prediction — which approach?”
“Most algorithm flexibility — which approach?”
“SQL-only team — Cortex first or Snowpark?”
“Would you always define features in dbt before training? Why?”

Trainer answers — end-of-module mini-discussion

#	Question	Answer
1	Lowest effort to first prediction	Snowflake Cortex `ML.FORECAST` or similar SQL function — minutes if features exist.
2	Most algorithm flexibility	Databricks sklearn + MLflow — full Python ML ecosystem.
3	SQL-only team	Cortex first for baseline; Snowpark ML when custom sklearn needed inside Snowflake.
4	Features always in dbt?	Best practice yes — versioned feature definitions, tests, lineage to training sets; training can happen elsewhere but eats what dbt serves.

Facilitator close:

“dbt defines what the model eats. Databricks and Snowflake define where it trains. Priya still reads predictions from Gold — same consumption story as KPIs.”

Trainer note — Cortex vs Module 6 AI

“Module 6 was Cortex LLM assistants. This module is predictive ML — sklearn, Snowpark ML, and ML.FORECAST. If trainees say ‘we already did Cortex,’ clarify the distinction.”

End of workshop — Final survey (Google Form)

Not a Think & Discuss step. Run at closing (after Mod 7, or after optional tracks).

When	What
Last 5–10 min of day	QR: End of Workshop Survey

Scan QR to open quiz | Trainees | Rate each module (grid), overall workshop, takeaway, recommend, pace, optional written suggestions | | Facilitator | Share the final survey URL from google-forms-links.md; review responses before the next cohort |

Script (one sentence):

“Please rate each module you attended, the workshop overall, how much you took away, whether you’d recommend it to a colleague, and whether the pace was right for you. Optional box at the end — tell us what to improve. Skip Mod 8–9 rows if you didn’t do those labs.”

Document History

Date	Change
2026-06-18	Module 7 §3.10 self-check answer key; optional Mod 8–9 Q6 aligned with trainee cards
2026-06-05	Added trainer answer keys for Think & Discuss (all modules)
2026-06-05	Reframed as oral discussion guide (no Google Form at this stage); Quiz moved after Theory
2026-05-31	End-of-workshop final survey (Google Form)
2026-05-23	Initial facilitator reflection prompts for Story–7
2026-05-23	Added Module 8 optional streaming reflection prompts
2026-05-23	Added Module 9 optional ML reflection prompts
2026-05-23	Aligned optional prerequisites (2–3 required, 4 recommended); Cortex 6 vs 9 notes

How to Facilitate Think & Discuss

Story — Welcome & Setup

Situation recap

Questions (design worksheet)

Trainer answers

Pair prompt

Whole-group capture (whiteboard)

Bridge to theory

Module 1 — Data Engineering Fundamentals

Situation recap

Questions

Trainer answers

Whiteboard capture

Bridge to theory

Module 2 — Databricks Pipeline

Situation recap

Trainer UI readiness

Questions

Trainer answers

Whiteboard capture

Bridge to theory

Module 3 — Snowflake Pipeline

Situation recap

Trainer UI readiness

Questions

Trainer answers

Whiteboard capture

Bridge to theory

Module 4 — dbt Pipeline

Situation recap

Trainer UI readiness

Questions

Trainer answers

Whiteboard capture

Bridge to theory

Module 5 — Production Patterns

Situation recap

Trainer UI readiness

Questions

Trainer answers

Whiteboard capture

Bridge to theory

Module 6 — AI Features

Situation recap

Trainer UI readiness

Questions

Trainer answers

Whiteboard capture

Bridge to theory

Trainer note — Cortex vs Module 9 ML

Module 7 — Power BI Payoff & Open Tool Discussion

Situation recap

Silent reflection (2 minutes — no talking)

Trainer answers — silent reflection

Discussion questions (pick 4–6)

Trainer answers — discussion questions

Trainer answers — §3.10 Review Questions (self-check)

Whiteboard synthesis (fill during discussion)

Trainer close (2 minutes)

Optional Power BI demo bridge

Quick Reference — All Modules

Module 8 — Streaming Data Processing (Optional)

Situation recap

Trainer UI readiness

Questions

Trainer answers

Whiteboard capture

Bridge to theory

Trainer note — dataset honesty

Module 9 — Machine Learning (Optional)

Situation recap

Trainer UI readiness

Questions

Trainer answers

Whiteboard capture

Bridge to theory

End-of-module mini-discussion (10 min — after ex-ml)

Trainer answers — end-of-module mini-discussion

Trainer note — Cortex vs Module 6 AI

End of workshop — Final survey (Google Form)

End-of-module mini-discussion (10 min — after `ex-ml`)