Reflection Prompts
Facilitator guide for Think & Discuss (oral discussion) in every module
Facilitator guide for Think & Discuss — informal oral discussion in every module.
Day-of rhythm: Animation → Think & Discuss (this doc) → Theory → Quiz (Google Form) → Practice
Related docs:
- facilitator-guide.qmd · pre-class-checklist.qmd · module-delivery-pattern.qmd
- Live form URLs — trainee + trainer links (from
forms-output.json; runnode sync-forms-links.mjsafter recreating forms) - google-forms-reflection-design.md (HTML) — Google Form reflections per module + end-of-workshop survey
- TRAINING_MATERIAL_MIGRATION_PLAN.md
- module-prerequisites-and-order.md
- animation-production-scripts.md
- open-discussion-guide.qmd — Module 7 extended discussion
How to Facilitate Think & Discuss
- Play the module animation — do not skip; trainees need shared context.
- State the situation in one sentence — use the Situation recap below.
- Ask the questions in order — start with situation, then challenge, then action.
- Capture on whiteboard / Miro — 3–5 bullet points only; use trainee words.
- Let trainees answer first — use the Trainer answers below only to steer, unblock silence, or validate during Theory.
- Bridge to theory — say: “You mentioned [X]. Let’s see how the industry handles that.”
Note: Think & Discuss is oral only — no Google Form at this stage. The Quiz (Google Form) comes after Theory.
Timing
| Module | Target duration | Format |
|---|---|---|
| Story | 12 min | Individual sketch → pairs → whole group |
| 1 | 8 min | Pairs → share 2 answers |
| 2–6 | 7 min | Pairs → share 1 answer per question |
| 7 | 5 min | Silent write → open discussion follows |
Trainee role reminder (say once at start of day):
“Today you are Bob, the junior data engineer. Elena designed the architecture. Your job is to build, evaluate, and eventually recommend tools for Marcus.”
After Theory, share the matching Google Form quiz from google-forms-links.md (~2–3 min, mostly multiple choice, auto-scored).
Story — Welcome & Setup
Google Form: Submit reflection
Video: media/modules/mod-00-welcome.mp4
Duration: 12 minutes
Format: Individual (5 min) → pairs (3 min) → whole group (4 min)
Situation recap
YellowLine NYC has millions of taxi trips but no analytics platform. Marcus hired MHP to help optimize revenue and fleet operations. The MHP team has assembled; Priya needs KPIs; Elena wants medallion architecture — but no solution is built yet.
Questions (design worksheet)
Ask trainees to write or sketch answers before discussing:
- Situation — What is Marcus’s biggest problem in your own words?
- Data — Where does the trip data live today? How often should it be refreshed?
- Architecture — If you had to split the data into layers, how many would you use and what would each layer contain?
- Tools — Name two tools you would consider for the pipeline. Why those two?
- Consumption — Priya needs a dashboard. What must exist in the pipeline before she can build it?
- Risks — What could go wrong (data quality, cost, skills, maintenance)?
Trainer answers
Trainer only — Aligns with trainee Think & Discuss (module §2). Do not read aloud before trainees discuss.
| # | Question | Expected themes |
|---|---|---|
| 1 | Marcus’s biggest problem | No analytics platform — fleet and revenue decisions are blind; millions of trips in files but no KPIs, dashboards, or trusted numbers for operations. |
| 2 | Data location & refresh | Today: Parquet + CSV on ADLS2 (mhpdeworkshopsa / nyc-taxi-data). Workshop uses batch refresh (nightly or on-demand); streaming is Phase 2 (Module 8). |
| 3 | Architecture layers | Three layers (medallion): Bronze = raw copy for audit/replay; Silver = cleaned, enriched trip rows; Gold = 12 pre-aggregated KPI tables for Priya. Accept “raw / trusted / consumption” if semantics match. |
| 4 | Tools | Any reasonable pair with a criterion — e.g. Databricks (scale, Spark ingest) + Snowflake (SQL ops), or Snowflake + dbt (transform-as-code). No single winner yet. |
| 5 | Before Priya’s dashboard | Gold KPI tables must exist (same schema all pipelines); Silver quality must be trustworthy; Bronze ingest must have run. Power BI connects to Gold, not raw files. |
| 6 | Risks | Bad data silently breaking KPIs; cost of cloud compute; team lacks PySpark skills; no lineage for auditors; skipping Bronze makes pipelines irreproducible. |
Pair prompt
Compare your architecture sketch with your partner. Where do you agree? Where do you disagree?
Whole-group capture (whiteboard)
| Column | Fill from trainees |
|---|---|
| Problems Marcus faces | |
| Layers / zones proposed | |
| Tools mentioned | |
| Risks mentioned |
Save this whiteboard — revisit in Module 7.
Bridge to theory
“You just did what consultants do on day one. Next we set up the environment, then Elena will formalize the architecture you sketched.”
Module 1 — Data Engineering Fundamentals
Google Form: Submit reflection
Video: media/modules/mod-01-fundamentals.mp4
Duration: 8 minutes
Format: Pairs → share
Situation recap
Elena has proposed medallion architecture: Bronze, Silver, Gold. Priya has listed the business questions but her Power BI dashboard is still empty. The team must agree what belongs in each layer before anyone writes code.
Questions
- Situation — Why does Elena want three layers instead of one big table Marcus can report from?
- Challenge — What belongs in Silver vs Gold? Give one example column or metric for each.
- Action — Priya asks: “When are our peak revenue hours?” Which layer would you query to answer that — Bronze, Silver, or Gold? Why?
- ETL vs ELT — Would you transform trip data before or after loading into the platform? What is one reason for your choice?
- Tools — Databricks, Snowflake, AWS, Cloudera appear on the board. What is one criterion you would use to compare them — without picking a winner yet?
Trainer answers
Trainer only — Aligns with trainee Think & Discuss (module §2).
| # | Question | Answer |
|---|---|---|
| 1 | Why three layers? | Separation of concerns: Bronze preserves raw source for replay/audit; Silver is the quality gate; Gold is BI-ready aggregates. One big table mixes dirty raw data with KPIs — errors compound and dashboards slow down. |
| 2 | Silver vs Gold examples | Silver: silver_nyc_taxi_enriched — one row per trip, cleaned fares, zone names joined, trip_duration_minutes. Gold: kpi_trips_by_hour — aggregated trip counts/revenue by hour for fast dashboard queries. |
| 3 | Peak revenue hours — which layer? | Gold — query kpi_revenue_by_hour or kpi_trips_by_hour. Silver has row-level trips (works but wrong contract); Bronze is never queried by analysts. |
| 4 | ETL vs ELT | ELT for this workshop — load raw Parquet to Bronze first, transform in-place on Databricks/Snowflake. Reason: cheap object storage + elastic compute; iterate on SQL/PySpark without re-extracting from source. |
| 5 | Tool comparison criterion | Examples: team skills (SQL vs PySpark), governance (Unity Catalog / Horizon), ingest pattern (Spark vs COPY INTO), total cost, time to first dashboard. |
Whiteboard capture
- Silver = (trainee definition)
- Gold = (trainee definition)
- ELT or ETL = (majority vote + one reason)
Bridge to theory
“Let’s define ETL vs ELT precisely and walk through the NYC Taxi schema Priya and James will use.”
Module 2 — Databricks Pipeline
Google Form: Submit reflection
Video: media/modules/mod-02-databricks.mp4
Duration: 7 minutes
Format: Pairs → share one answer each
Situation recap
Elena approved Bob’s request to prototype on Databricks. Sofia will mentor. Raw Parquet files sit in Azure ADLS2. Priya is waiting for Gold tables to populate her dashboard Overview page.
Trainer UI readiness
Before starting reflection, confirm attendees see: - Databricks workspace open in browser with left sidebar visible - At least one cluster in Terminated state (they will start it during Practice) - Notebooks 01_bronze_ingestion.py, 02_silver_cleaning.py, 03_gold_kpis.py visible in Workspace → Shared folder
Questions
- Situation — Why might MHP start with Databricks for this use case instead of Excel or a single SQL script?
- Challenge — How would you ingest Parquet from ADLS2 into a Bronze layer? What tool or command family would you expect to use?
- Action — What could go wrong in Bronze if you skip data quality checks and jump straight to KPIs?
- Silver — Name two transformations that belong in Silver (not Bronze, not Gold).
- Priya — After Gold is built, which two KPI tables would unlock Priya’s trips-by-hour and day-of-week charts?
Trainer answers
Trainer only — Aligns with trainee Think & Discuss (module §2).
| # | Question | Answer |
|---|---|---|
| 1 | Why Databricks vs Excel/script? | Scale + lakehouse: millions of Parquet rows, distributed spark.read, Delta ACID tables, Unity Catalog governance. Excel/SQL script breaks on volume and has no medallion ops story. |
| 2 | Ingest Parquet to Bronze | spark.read.parquet("abfss://nyc-taxi-data@.../raw/trips/") → write Delta with saveAsTable to {catalog}.{attendee}_bronze.nyc_taxi_trips. Storage credential configured in 00_setup.py. |
| 3 | Skip quality → jump to KPIs | Silent wrong numbers: null/zero fares, duplicate trips, cross-month outliers, bad joins — Gold looks plausible but Priya’s dashboard is wrong. Classic “skip Silver” pitfall. |
| 4 | Silver transforms (two) | Filter invalid trips (trip_distance > 0, valid datetimes); join taxi_zone_lookup for borough names; standardize column names; compute trip_duration_minutes, fare per mile. |
| 5 | Gold tables for Priya | kpi_trips_by_hour (Overview time chart) and kpi_trips_by_day or kpi_time_of_day_analysis (day-of-week / time-of-day). Accept kpi_revenue_by_hour if they tie it to revenue view. |
Whiteboard capture
- Bronze ingest approach =
- Silver transforms =
- Gold tables for Priya =
Bridge to theory
“You mentioned [ingest approach]. Let’s look at Unity Catalog, Delta Lake, and the notebook flow ADLS2 → Bronze → Silver → Gold.”
Module 3 — Snowflake Pipeline
Google Form: Submit reflection
Video: media/modules/mod-03-snowflake.mp4
Duration: 7 minutes
Format: Pairs → share
Situation recap
Marcus reviewed the Databricks prototype and said: “My team lives in SQL. They won’t maintain PySpark notebooks.” Elena asked Bob to rebuild the same medallion design on Snowflake. Priya needs to connect her Map page to Gold — same KPIs, different engine.
Trainer UI readiness
Module 3 is the first Snowflake module — attendees create their own trial accounts and run setup during this module. Before starting reflection: - Confirm each attendee has completed Snowflake trial signup (or started the process) - Distribute SAS tokens needed for External Stage setup - After 00_account_setup.sql runs: each attendee should have DE_MASTERCLASS database, DE_WORKSHOP_WH warehouse, and DE_WORKSHOP_ROLE role - Attendee switches to DE_WORKSHOP_ROLE in the Snowsight role selector (top-right) before starting exercises
Questions
- Situation — What is Marcus really asking for — a different tool, a different skill set, or both?
- Challenge — How do you keep the same 12 KPIs without PySpark? What would you rewrite?
- Action — Who at YellowLine NYC will maintain this pipeline after MHP leaves? What skills do they need?
- Ingest — Snowflake cannot run Spark notebooks. How would you load Parquet from ADLS2 into Snowflake Bronze?
- Priya — If Gold schema stays identical, can Priya keep the same Power BI reports? Why or why not?
Trainer answers
Trainer only — Aligns with trainee Think & Discuss (module §2).
| # | Question | Answer |
|---|---|---|
| 1 | Marcus asking for tool or skills? | Both — he wants a SQL-first platform his team can maintain after MHP leaves, not PySpark notebooks only a consultant understands. |
| 2 | Same 12 KPIs without PySpark | Rewrite transforms as SQL worksheets or Snowpark stored procedures — same Silver logic (filters, joins, aggregates), different language/runtime. Gold table names and columns stay identical. |
| 3 | Who maintains after MHP? | YellowLine NYC internal team — SQL analysts / analytics engineers. They need Snowflake worksheets, stages, tasks — not Databricks cluster admin. |
| 4 | Load Parquet without Spark | External stage on ADLS2 (SAS URL) + COPY INTO bronze.nyc_taxi_trips from @stage/path. Two-step but familiar to SQL teams. |
| 5 | Same Power BI reports? | Yes, mostly — same Gold schema (kpi_* tables, same columns) means same DAX and visuals. Only the connector changes (Snowflake vs Databricks SQL warehouse/connector). Semantic model may need a reconnect, not a rebuild. |
Whiteboard capture
- Marcus’s constraint =
- Maintainability owner =
- Snowflake ingest idea =
Bridge to theory
“Let’s compare external stages, SQL worksheets, and Snowpark — and see how Snowflake implements the same Silver logic you built in PySpark.”
Module 4 — dbt Pipeline
Google Form: Submit reflection
Video: media/modules/mod-04-dbt.mp4
Duration: 7 minutes
Format: Pairs → share
Situation recap
Marcus’s board asked: “Where does each dashboard number come from?” Elena added dbt on top of Snowflake for SQL transformations, tests, and lineage. Priya connects revenue and payment pages — and wants a data quality scorecard.
Trainer UI readiness
Before starting reflection, confirm attendees see: - Terminal open in Codespaces or Docker container - dbt_project/ directory accessible; profiles.yml configured - dbt --version shows Core 1.8.x with snowflake and databricks adapters
Questions
- Situation — What problem does Marcus have that Snowflake SQL alone did not fully solve?
- Challenge — Is dbt a replacement for Snowflake? If not, what is dbt’s job in one sentence?
- Action — Marcus clicks a revenue tile in Power BI. How would you prove which model and source table that number came from?
- Tests — Name two tests you would add so bad data never silently breaks a KPI.
- Priya — Which Gold KPI table would feed a data quality scorecard? What would it measure?
Trainer answers
Trainer only — Aligns with trainee Think & Discuss (module §2).
| # | Question | Answer |
|---|---|---|
| 1 | Problem Snowflake SQL alone didn’t solve | Governance for the board — no built-in lineage from KPI tile → model → source; tests not version-controlled; transform SQL scattered across worksheets. |
| 2 | Is dbt a Snowflake replacement? | No. One sentence: dbt compiles and runs SQL transformations on top of Snowflake (or Databricks) — it owns models, tests, and docs, not storage or ingest. |
| 3 | Prove revenue tile lineage | dbt docs generate lineage graph; trace ref() chain from Gold mart → Silver staging → Bronze source; match table name in Power BI to dbt model kpi_revenue_by_hour (or equivalent). Manifest + exposures for BI links. |
| 4 | Two tests | not_null on pickup_datetime / trip_id; accepted_values on payment_type; unique on grain keys; relationships test to zone lookup. Any two that catch silent KPI breakage. |
| 5 | Data quality scorecard table | kpi_data_quality — measures row counts Bronze vs Silver vs Gold, null rates, filter drop %, test pass/fail summary. Priya displays trend tiles; James validates thresholds. |
Whiteboard capture
- dbt’s role = (layer on warehouse, not replacement)
- Lineage proof =
- Tests to add =
Bridge to theory
“dbt models,
ref(), tests, anddbt docs generate— let’s see the lineage graph Marcus’s auditors want.”
Module 5 — Production Patterns
Google Form: Submit reflection
Video: media/modules/mod-05-production.mp4
Duration: 7 minutes
Format: Pairs → share
Situation recap
The pipeline works in a notebook and in a manual dbt run. Elena asks: “What executes every night when we are not in the room?” YellowLine NYC needs scheduled, reliable, monitored jobs.
Trainer UI readiness
Before starting reflection, confirm attendees see: - Databricks Workflows page accessible (left sidebar → Workflows icon) - Snowflake worksheet open with the Task creation SQL ready to paste - dbt project with production/ directory containing CI config files
Questions
- Situation — What is different about production compared to the lab environment you used this morning?
- Challenge — What breaks in production that almost never breaks in a classroom exercise?
- Action — How would you schedule the nightly Bronze → Silver → Gold run on Databricks? On Snowflake? On dbt?
- Failure — If Silver fails at 2 a.m., who should know? What should happen to Gold KPIs?
- Change control — An analyst edits a SQL model on Friday afternoon. What process stops that change from breaking Monday’s dashboard?
Trainer answers
Trainer only — Aligns with trainee Think & Discuss (module §2).
| # | Question | Answer |
|---|---|---|
| 1 | Production vs lab | Lab = manual notebook runs, interactive debug, no alerts. Production = scheduled jobs, retries, monitoring, secrets in vaults, CI/CD, access control — runs when nobody is in the room. |
| 2 | What breaks in production | Silent failures at 2 a.m., partial loads, schema drift, credential expiry, cluster/warehouse auto-stop, upstream ADLS2 file missing, dbt test failures blocking deploy. |
| 3 | Schedule nightly medallion | Databricks: Lakeflow Workflows job (setup → bronze → silver → gold). Snowflake: Tasks chain on worksheets/procedures. dbt: dbt build in dbt Cloud job or GitHub Actions on cron. |
| 4 | Silver fails at 2 a.m. | On-call data engineer alerted (email/Slack/PagerDuty); Gold job must not run (dependency failure); dashboard shows stale data or freshness flag; incident ticket opened. |
| 5 | Friday analyst edit | Git PR + CI running dbt test / SQL lint; code review; deploy only after green pipeline; no direct prod worksheet edits. Asset bundles for Databricks; tagged releases for Snowflake tasks. |
Whiteboard capture
- Production vs lab =
- Scheduling tool (per platform) =
- On failure =
Bridge to theory
“Workflows, Tasks, incremental loads, CI/CD for dbt — the go-live checklist Elena uses with clients.”
Module 6 — AI Features
Google Form: Submit reflection
Video: media/modules/mod-06-ai.mp4
Duration: 7 minutes
Format: Pairs → share
Situation recap
Marcus saw Priya’s dashboard and asked: “Can AI help my analysts explore data faster?” MHP will demo Cortex, Genie, and Copilot-style assistants — but the medallion pipeline remains the foundation.
Trainer UI readiness
Before starting reflection, confirm attendees see: - Snowflake Snowsight open with a worksheet ready for Cortex AI SQL - Databricks workspace with Genie icon visible in the left sidebar (under SQL section) - Gold tables exist from at least one pipeline (Module 2 or 3 completed)
Questions
- Situation — What task is Marcus trying to speed up — building pipelines, writing SQL, or reading dashboards?
- Challenge — Where could AI help in this NYC Taxi project? Where would you not trust AI?
- Action — If an AI tool writes SQL against Silver, what must still exist in your architecture for the answer to be trustworthy?
- Governance — Who is accountable if AI-generated SQL exposes wrong revenue numbers to Marcus?
- Priya — Could AI replace Priya’s Power BI dashboard? Why or why not?
Trainer answers
Trainer only — Aligns with trainee Think & Discuss (module §2).
| # | Question | Answer |
|---|---|---|
| 1 | What is Marcus speeding up? | Analyst SQL and exploration — writing queries, classifying data, natural-language questions over Gold. Not replacing the medallion pipeline build (that’s still Bob/Elena). |
| 2 | Trust AI / not trust | Help: boilerplate SQL, AI_COMPLETE classification, Genie/Cortex Analyst on Gold, dbt doc generation. Don’t trust: architectural tool choices, financial KPI definitions without review, AI on Bronze raw, using LLM output as join keys. |
| 3 | AI SQL against Silver — what must exist? | Governed Silver tables (quality filters applied), Unity Catalog/Horizon permissions, documented KPI definitions, human accountability, and preferably dbt tests so the AI is querying trusted data — not garbage-in-garbage-out. |
| 4 | Governance / accountability | Priya + James + Elena — AI is a tool; YellowLine leadership still owns numbers shown to Marcus. Log prompts, validate samples, never auto-publish LLM classifications to exec reports. |
| 5 | AI replace Power BI? | No — Marcus needs curated dashboards (Overview, Map, Revenue) with fixed KPI contracts, refresh schedules, and mobile-friendly visuals. AI assists exploration; Power BI is the governed consumption layer. |
Whiteboard capture
- AI helps with =
- AI should not =
- Still required = (medallion, quality, lineage)
Bridge to theory
“AI augments the stack — it does not replace Bronze, Silver, Gold, or Priya’s KPI definitions.”
Trainer note — Cortex vs Module 9 ML
Say explicitly before exercises:
“Module 6 uses Cortex LLM functions —
AI_COMPLETE, Copilot, Genie — to help analysts write SQL and explore data. Module 9 (optional) uses Cortex ML functions —ML.FORECAST,ML.ANOMALY_DETECTION— for prediction. Different APIs, different purpose.”
Module 7 — Power BI Payoff & Open Tool Discussion
Google Form: Submit reflection / decision matrix
Video: media/modules/mod-07-wrapup.mp4
Duration: 5 minutes reflection + 20–25 minutes open discussion
Format: Silent write → structured discussion
Situation recap
Priya presented the finished Power BI dashboard to Marcus — Overview, Map, Time Analysis, Revenue, Efficiency — all fed by Gold KPIs Bob built. Marcus asks: “What should we run in production?” MHP’s engagement is ending; trainees must recommend a tool strategy.
Silent reflection (2 minutes — no talking)
Hand out the architecture decision matrix (one page per trainee), then ask trainees to write privately:
- My recommended stack for YellowLine NYC is: Databricks / Snowflake / dbt / combination — (circle one)
- One sentence why:
- One tool I would not choose as the primary platform and why:
Trainer answers — silent reflection
Trainer only — No single correct stack. Use to sanity-check trainee reasoning; full facilitation in open-discussion-guide.qmd.
| Prompt | Reasonable answers |
|---|---|
| Recommended stack | Snowflake + dbt + Power BI if Marcus’s team is SQL-only and board wants lineage. Databricks-only if small eng team and ML/streaming on roadmap. Combination is valid but adds integration cost. |
| One sentence why | Tie to skills, governance, cost, or time-to-dashboard — not “because we used it in lab.” |
| Tool to avoid as primary | e.g. dbt alone (no ingest), Power BI (consumption only), Databricks if team will never maintain notebooks — must match their stated constraint. |
Discussion questions (pick 4–6)
Use after silent reflection. See open-discussion-guide.qmd for full facilitation rounds.
Tool comparison
- If Marcus’s team is SQL-only, what do you recommend and why?
- Where did Databricks clearly win today? Where did it feel like overkill?
- What did dbt add that Snowflake alone did not give you?
- Could YellowLine NYC run only one of the three tools? What would they lose?
Decision criteria
- Rank for your context: cost, skills, governance, speed to first dashboard — which matters most?
- What would you decide differently if Marcus needed streaming (Module 8) or ML (Module 9)?
Architecture revisit
- Look at the Story whiteboard. What would you change in your original design now that you have built all three pipelines?
- Priya’s dashboard connects to Gold. Does that change which tool you prioritize for ingest vs transform vs consumption?
Trainer answers — discussion questions
| # | Question | Answer |
|---|---|---|
| 1 | SQL-only team | Snowflake + dbt primary; Databricks optional for heavy ingest/ML later. Power BI unchanged on Gold. |
| 2 | Databricks wins / overkill | Wins: Spark ingest at scale, Delta, one platform for batch + streaming path. Overkill: simple SQL KPIs if team won’t run notebooks. |
| 3 | What dbt added | Versioned transform SQL, dbt test, dbt docs lineage — answers Marcus’s board audit question. |
| 4 | Only one tool? | Snowflake only: lose managed Spark ingest story. Databricks only: weaker SQL-analyst ergonomics. dbt only: cannot ingest raw Parquet by itself. |
| 5 | Rank criteria | No universal order — SQL team → skills + governance first; startup → speed; regulated → lineage + masking. |
| 6 | Streaming / ML changes | Streaming (Mod 8): favors Databricks Structured Streaming or Snowflake Dynamic Tables — batch-only Snowflake feels tight. ML (Mod 9): Databricks MLflow vs Snowflake Cortex — may keep Databricks for training even if Gold is in Snowflake. |
| 7 | Revise Story whiteboard | Trainees usually add dbt, explicit Silver quality, separate ingest vs transform owners, and Gold as BI contract. Celebrate what they got right day one. |
| 8 | Gold → tool priority | Consumption fixed (Power BI on Gold). Transform → dbt/SQL. Ingest → platform with best ADLS2 path (Spark or stages). Decisions are independent. |
Whiteboard synthesis (fill during discussion)
| Dimension | Databricks | Snowflake | dbt |
|---|---|---|---|
| Best for | |||
| Weak for | |||
| Fit for Marcus’s SQL team |
Trainer close (2 minutes)
- There is no single vendor answer.
- Platform (where data runs), transform layer (how logic is managed), and consumption (Power BI) are separate decisions.
- Revisit Story whiteboard — celebrate what trainees got right on day one.
Optional Power BI demo bridge
“Priya built the dashboard in the story. Let’s connect the same Gold tables live — or see the build guide in
powerbi/README.md.”
Quick Reference — All Modules
| Module | Core reflection question (one line) |
|---|---|
| Story | How would you design the solution? |
| 1 | What belongs in Silver vs Gold? |
| 2 | How do you ingest ADLS2 Parquet at scale? |
| 3 | How do you rebuild the same KPIs for a SQL team? |
| 4 | How do you prove where a KPI number comes from? |
| 5 | What runs every night without you? |
| 6 | Where does AI help — and where must the pipeline still govern truth? |
| 7 | What would you choose for YellowLine NYC in production? |
| 8 (optional) | When does streaming beat batch — and at what cost? |
| 9 (optional) | Who owns features vs models — and where does each tool fit? |
Module 8 — Streaming Data Processing (Optional)
Google Form: Submit reflection
Video: media/modules/mod-08-streaming.mp4
Duration: 8 minutes
Format: Pairs → share
When delivered: After main day (90 min block) or standalone advanced session
Prerequisites: Modules 2–3 required · Module 4 recommended (dbt dynamic_table track)
Story order: Deliver before Module 9 when following YellowLine NYC Phase 2 timeline
Situation recap
Marcus wants live visibility — dispatch needs demand signals in minutes, not after tonight’s batch job. Elena schedules Phase 2. For training, MHP uses Aiven Kafka user-activity events as a teaching proxy (same streaming patterns; different dataset than NYC Taxi Parquet).
Trainer UI readiness
Before starting reflection, confirm attendees see: - Databricks cluster running with Kafka Maven libraries installed (check Compute → cluster → Libraries tab) - Snowflake warehouse started; worksheets open for Dynamic Table SQL - Aiven Kafka topic user-activity has events flowing (trainer verifies via producer logs)
Questions
- Situation — What does Marcus need that yesterday’s batch pipeline cannot give him?
- Batch vs stream — When is batch still the right answer? Name one reason streaming is not worth the complexity.
- Kafka — In one sentence each: what is a topic, a partition, and an offset?
- Action — Would you read Kafka directly in Databricks, or relay to ADLS2 then Snowpipe? What trade-off drives that choice?
- Windows — Why do stream processors use watermarks? What breaks without them?
- Priya — Priya wants a Power BI page that refreshes every minute. Which Gold output from this module supports that — and why DirectQuery instead of Import?
Trainer answers
Trainer only — Aligns with trainee Think & Discuss (module §2).
| # | Question | Answer |
|---|---|---|
| 1 | What batch can’t give Marcus | Sub-hour dispatch signals — batch Gold refreshes overnight; dispatch needs minutes-old demand/activity counts, not yesterday’s aggregates. |
| 2 | When batch still wins | Historical analytics, cost sensitivity, simpler ops, NYC Taxi monthly reports — streaming adds Kafka ops, late data, exactly-once complexity. |
| 3 | Topic / partition / offset | Topic = named event stream; partition = ordered shard for parallelism; offset = position of a consumer in a partition (replay checkpoint). |
| 4 | Kafka in Databricks vs ADLS2 relay | Direct Structured Streaming from Kafka = lower latency, more streaming ops. ADLS2 landing + Snowpipe = higher latency, reuses batch skills. Trade-off: latency vs operational familiarity. |
| 5 | Watermarks | Handle late-arriving events — bound how long to wait for out-of-order data before closing a window. Without watermarks, windows never close or state grows unbounded. |
| 6 | Power BI every minute | Streaming Gold aggregate table (e.g. events-per-minute). DirectQuery hits live warehouse table; Import would snapshot stale data between refreshes. |
Whiteboard capture
- Marcus needs live =
- Batch still wins when =
- Databricks path = / Snowflake path =
- Watermark purpose =
Bridge to theory
“Batch vs streaming, Kafka fundamentals, Structured Streaming, Dynamic Tables, and dbt
dynamic_table— then you build the same Bronze → Silver → Gold pattern on live events.”
Trainer note — dataset honesty
Say explicitly:
“YellowLine NYC would stream taxi GPS or dispatch events. We use Aiven user-activity events so every attendee gets a live Kafka topic without NYC TLC streaming infrastructure. The patterns transfer; the schema does not.”
Module 9 — Machine Learning (Optional)
Google Form: Submit reflection
Video: media/modules/mod-09-ml.mp4
Duration: 8 minutes
Format: Pairs → share → brief whole-group (Module 9 includes 10 min comparison after exercises)
When delivered: After main day (90 min block) or standalone advanced session
Prerequisites: Modules 2–3 required · Module 4 recommended (dbt feature table track)
Story order: Deliver after Module 8 when following YellowLine NYC Phase 2 timeline
Situation recap
Marcus wants to predict tip amounts on credit-card trips so operations can tune driver incentives and Priya can add a “predicted vs actual tip” view. Elena assigns Phase 2 ML. The lab uses existing NYC Taxi Silver data — same dataset as the main workshop, new use case (unlike Module 8’s Aiven proxy).
Trainer UI readiness
Before starting reflection, confirm attendees see: - Databricks cluster with Runtime ML (15.x+) available; AI/ML → Experiments in sidebar - Snowflake worksheets ready for ML.FORECAST SQL; USE AI FUNCTIONS privilege + CORTEX_USER role granted - silver_nyc_taxi_enriched table exists in both Databricks and Snowflake schemas
Questions
- Situation — What business decision could tip prediction support for YellowLine NYC? Who consumes the output — Marcus, drivers, or Priya?
- Lifecycle — Where does ML sit relative to Bronze, Silver, Gold, and Power BI? Who builds the feature table vs who trains the model?
- Leakage — Why must
total_amountnot be a feature when predictingtip_amount? Name one other column that would leak the answer. - Filter — Why train on credit card trips only? What would the model learn if cash trips were included?
- Tools — Match each to its primary ML role (no winner yet):
- Databricks sklearn + MLflow
- Snowflake Cortex ML (
ML.FORECAST,ML.ANOMALY_DETECTION, …) - Snowflake Snowpark ML
- dbt
ml_features_tip_prediction
- Priya — How could predicted tips land in Power BI next to existing Gold KPIs? Batch scoring table vs live inference?
Trainer answers
Trainer only — Aligns with trainee Think & Discuss (module §2).
| # | Question | Answer |
|---|---|---|
| 1 | Business decision / consumer | Driver incentive tuning, pricing experiments — consumed by Marcus/ops (decisions) and Priya (predicted vs actual tip dashboard). |
| 2 | ML in medallion lifecycle | Feature table in Silver/Gold-adjacent schema (often dbt); training in Databricks/Snowflake; scored predictions written to Gold table; Power BI reads Gold like any KPI. Features = analytics engineer; model = ML engineer / DE with ML skills. |
| 3 | Why not total_amount as feature |
Target leakage — total_amount = fare + tolls + tip; model would “cheat.” Also avoid post-trip fields that encode the tip indirectly. |
| 4 | Credit-card trips only | Cash trips often have tip = $0 or null in data — model would learn payment artifact, not tipping behavior. Filter payment_type = credit card for honest signal. |
| 5 | Tool roles | Databricks sklearn + MLflow = full algorithm choice + experiment tracking. Cortex ML.FORECAST = SQL-native, low code, less flexible. Snowpark ML = Python in Snowflake. dbt ml_features_tip_prediction = feature contract + tests, not training. |
| 6 | Predictions in Power BI | Batch scoring table gold.tip_predictions joined to enriched trips — simplest for workshop. Live inference = real-time endpoint — overkill for Marcus’s nightly ops review. |
Whiteboard capture
- ML consumer =
- Feature table owner = / Model trainer =
- Leakage example =
- dbt role in ML = (features + tests, not training)
Bridge to theory
“Feature engineering, leakage, credit-card filter — then Databricks + MLflow, Snowflake Cortex vs Snowpark, dbt as the feature contract, and compare effort vs flexibility across all three.”
End-of-module mini-discussion (10 min — after ex-ml)
Run after trainees fill the Compare Your Results table in the exercise:
- “Lowest effort to first prediction — which approach?”
- “Most algorithm flexibility — which approach?”
- “SQL-only team — Cortex first or Snowpark?”
- “Would you always define features in dbt before training? Why?”
Trainer answers — end-of-module mini-discussion
| # | Question | Answer |
|---|---|---|
| 1 | Lowest effort to first prediction | Snowflake Cortex ML.FORECAST or similar SQL function — minutes if features exist. |
| 2 | Most algorithm flexibility | Databricks sklearn + MLflow — full Python ML ecosystem. |
| 3 | SQL-only team | Cortex first for baseline; Snowpark ML when custom sklearn needed inside Snowflake. |
| 4 | Features always in dbt? | Best practice yes — versioned feature definitions, tests, lineage to training sets; training can happen elsewhere but eats what dbt serves. |
Facilitator close:
“dbt defines what the model eats. Databricks and Snowflake define where it trains. Priya still reads predictions from Gold — same consumption story as KPIs.”
Trainer note — Cortex vs Module 6 AI
“Module 6 was Cortex LLM assistants. This module is predictive ML — sklearn, Snowpark ML, and
ML.FORECAST. If trainees say ‘we already did Cortex,’ clarify the distinction.”
End of workshop — Final survey (Google Form)
Not a Think & Discuss step. Run at closing (after Mod 7, or after optional tracks).
| When | What |
|---|---|
| Last 5–10 min of day | QR: End of Workshop Survey |
| Trainees | Rate each module (grid), overall workshop, takeaway, recommend, pace, optional written suggestions | | Facilitator | Link to Sheets; review before next cohort — see google-forms-reflection-design.md |
Script (one sentence):
“Please rate each module you attended, the workshop overall, how much you took away, whether you’d recommend it to a colleague, and whether the pace was right for you. Optional box at the end — tell us what to improve. Skip Mod 8–9 rows if you didn’t do those labs.”
Document History
| Date | Change |
|---|---|
| 2026-06-05 | Added trainer answer keys for Think & Discuss (all modules) |
| 2026-06-05 | Reframed as oral discussion guide (no Google Form at this stage); Quiz moved after Theory |
| 2026-05-31 | End-of-workshop final survey (Google Form) |
| 2026-05-23 | Initial facilitator reflection prompts for Story–7 |
| 2026-05-23 | Added Module 8 optional streaming reflection prompts |
| 2026-05-23 | Added Module 9 optional ML reflection prompts |
| 2026-05-23 | Aligned optional prerequisites (2–3 required, 4 recommended); Cortex 6 vs 9 notes |