# MHP Data Engineer Masterclass 2026 — Story-Driven Training Migration Plan

**Document purpose**: Validate the NYC Taxi use-case storyline, define the narrative arc for an
intro animation and the full training day, and plan migration from the current `workshop-2026-v1/` materials
to a new story-driven training folder without disrupting existing content.

**Status**: Active v2.7 — **Phases 1, 2 & 4 complete**; Phase 3 media (MP4) pending  
**Date**: 2026-05-23 (status refresh: 2026-05-24)  
**Audience**: Trainers, curriculum designers, content authors  
**Legacy material (frozen — do not edit)**: [`workshop-2026-v1/`](../../workshop-2026-v1/)  
**New story-driven material**: [`workshop-2026-v2/`](../../workshop-2026-v2/) *(Quarto site + trainer guides)*

---

## 1. Executive Summary — Is This a Good Idea?

**Verdict: Yes — with minor refinements.**

Your story-driven approach is pedagogically strong and aligns well with the repository you already
have. The NYC Taxi dataset is a proven teaching dataset, and the existing repo already implements
three parallel Bronze → Silver → Gold pipelines on Databricks, Snowflake, and dbt with identical
12 Gold KPIs — exactly what the story needs for a fair tool comparison at the end.

### Why the concept works

| Strength | Why it matters |
|----------|----------------|
| **Relatable business problem** | Revenue optimization and operational efficiency are easy for trainees to understand without domain expertise. |
| **Progressive tool introduction** | Each stakeholder constraint (scale → SQL skills → governance) naturally motivates the next tool chapter. |
| **Same dataset, three implementations** | Enables apples-to-apples comparison — already implemented in this repo. |
| **Medallion as architectural spine** | Gives a consistent mental model across all three tools; theory lands before hands-on. |
| **Trainee design question upfront** | Flips the classroom from passive demo to active engineering thinking. |
| **MHP consulting frame** | Positions trainees as practitioners solving a client problem, not just clicking through labs. |

### Recommended refinements

1. **Clarify dbt's role in the story** — dbt is not a replacement for Databricks or Snowflake; it is a
   transformation layer that runs *on* a warehouse. In the story, Bob should say: *"We keep Snowflake
   as the engine, but use dbt for SQL-first transformations, tests, and lineage."*

2. **Keep the Data Architect visible** — Elena (Data Architect) should own medallion and tool-selection
   decisions; Bob executes. This mirrors real project roles and prevents "junior picks tools" from
   feeling unrealistic.

3. **Front-load the trainee design exercise** — After the animation, pause for 10–15 minutes: *"What
   would you design?"* Capture answers on a whiteboard, then reveal how MHP's team structured the
   solution. Revisit the whiteboard at wrap-up.

4. **BI Analyst delivers the payoff** — Priya (BI Analyst) connects Gold KPIs to Power BI early in
   the story as the *north star*, so every pipeline module ends with "does this answer Priya's
   questions?"

5. **Optional modules stay optional** — Streaming (Module 8) and ML (Module 9) can be framed as
   "Phase 2" of the NYC Taxi engagement without breaking the main narrative.

6. **Repeat the same four-step rhythm every module** — Animation → reflection → theory →
   exercises. Trainees always know what comes next.

---

## 2. Standard Module Delivery Pattern

Every module (0–7) follows the same classroom rhythm. This is the core pedagogical design.

```{mermaid}
flowchart LR
    A["1. Animation<br>2–4 min"] --> B["2. Think & Discuss<br>5–10 min"]
    B --> C["3. Theory<br>10–20 min"]
    C --> D["4. Practice<br>remainder"]

    style A fill:#fbbf24,color:#000
    style B fill:#60a5fa,color:#000
    style C fill:#a78bfa,color:#000
    style D fill:#34d399,color:#000
```

### Step 1 — Play module animation (2–4 min)

- One short video **before each module**, not only at the start of the day.
- Sets the client situation and constraint that motivates the module's theory and tool.
- Priya (BI Analyst) appears in animations whenever Gold KPIs or Power BI progress — she is
  building the dashboard in parallel with Bob's pipeline work.

### Step 2 — Trainee reflection (5–10 min)

Facilitator prompts (adapt per module):

- *What is the situation Marcus's team faces right now?*
- *What is the hardest part of this challenge?*
- *What would you try first?*
- *What could go wrong?*

Capture 2–3 answers on a whiteboard or Miro. Do **not** reveal the solution yet — let theory and
labs confirm or challenge their ideas.

### Step 3 — Introduce module theory (10–20 min)

- Teach the concept the animation raised: medallion, ETL/ELT, Databricks ingest, Snowflake SQL,
  dbt lineage, production scheduling, AI features, etc.
- Link explicitly to the whiteboard: *"You mentioned X — here's how the industry handles it."*

### Step 4 — Practice exercises (remainder of module time)

- Hands-on labs where the module has exercises (Modules 2–6).
- Modules 0–1 and 7 may be lighter on coding; Module 7 is discussion-heavy.
- After Gold KPI tables exist, remind trainees: *Priya connects these to Power BI — same schema,
  any platform.*

### Power BI thread across the day

Priya's dashboard is not a single demo at the end. It unfolds in **animation** as KPIs land:

| After module | Animation beat (Priya) | Gold KPIs used |
|--------------|------------------------|----------------|
| Module 2 (Databricks) | First charts populate — trips by hour, overview cards | `kpi_trips_by_hour`, `kpi_trips_by_day` |
| Module 3 (Snowflake) | Map page — borough and zone visuals | `kpi_borough_analysis`, `kpi_top_pickup_zones` |
| Module 4 (dbt) | Revenue and payment pages; data quality scorecard | `kpi_revenue_by_hour`, `kpi_payment_type_analysis`, `kpi_data_quality_metrics` |
| Module 7 (Wrap-up) | Full dashboard walkthrough + live trainee connection | All 12 KPIs |

Trainees **build the data pipeline**; Priya **consumes Gold in Power BI** in the story. Optional:
trainers demo `powerbi/README.md` build-along after Module 4 or in Module 7.

**Priya's five KPI questions** (use in reflections and exercises):

1. When are our peak revenue hours?
2. Which pickup zones drive the most trips?
3. Which routes are most popular?
4. How does revenue vary by borough and distance band?
5. Are we losing money on bad data (null fares, zero-distance trips)?

All five map to the 12 Gold tables in [DATA_MODEL.md](DATA_MODEL.md).

### Final session — Open tool comparison (Module 7, ~30 min)

After all modules, **trainees lead** a facilitated open discussion — not a lecture.

**Facilitator opens**:

> "MetroYellow asked MHP to evaluate Databricks, Snowflake, and dbt on the same NYC Taxi use
> case. You built all three. How would **you** choose tools for a real project?"

**Discussion prompts** (pick 4–6; allow debate):

1. If Marcus's team is SQL-only, what do you recommend and why?
2. Where did Databricks clearly win? Where did it feel like overkill?
3. What did dbt add that Snowflake alone did not give you?
4. What would you run in production nightly — one stack or a combination?
5. What would you decide differently for streaming (Module 8) or ML (Module 9)?
6. Cost, skills, governance, speed to first dashboard — rank these for **your** context.

**Close**: Elena summarizes patterns heard from the room; no single "correct" stack. Revisit the
Day 1 whiteboard (*"What would you design?"*) and ask what trainees would change now.

---

## 3. Use Case Overview

### Client

**MetroYellow NYC** — a fictional NYC yellow taxi operator (inspired by TLC public trip data, not a
real company). They operate a fleet across all five boroughs and compete with ride-hail apps on
price, wait time, and driver earnings.

### Business challenge

Trip volume is stable, but **revenue per mile** and **driver utilization** vary sharply by hour,
borough, and route. Leadership suspects they are:

- Over-supplying taxis in low-demand zones at the wrong times
- Under-pricing or mis-allocating fleet during peak windows
- Losing visibility because reports are built manually from spreadsheets

They need an **analytics platform** — not a one-off report — that their SQL-heavy internal team can
maintain after MHP leaves.

### Success criteria (what "done" looks like)

| # | Criterion | How the training proves it |
|---|-----------|----------------------------|
| 1 | Ingest TLC trip data reliably | Bronze layer in all three tool tracks |
| 2 | Clean, enriched trip records | Silver layer with quality rules + zone lookup |
| 3 | Answer 12 operational KPI questions | Gold layer (identical KPIs across platforms) |
| 4 | Executive dashboard | Power BI demo on Gold tables |
| 5 | Team can extend pipelines | Snowflake SQL track + dbt maintainability story |
| 6 | Documented lineage and quality | dbt docs, tests, lineage graph |
| 7 | Informed platform choice | Module 7 open discussion |

### Data source (unchanged from current repo)

```
Azure ADLS2: mhpdeworkshopsa / nyc-taxi-data
├── raw/trips/          → yellow_tripdata_*.parquet
└── raw/lookup/         → taxi_zone_lookup.csv
```

See [DATA_MODEL.md](DATA_MODEL.md) and [ARCHITECTURE.md](ARCHITECTURE.md) for technical detail.

---

## 4. Characters & Roles

Use these names consistently in the animation, slides, and trainer narration.

| Character | Role | Narrative function |
|-----------|------|-------------------|
| **Marcus Chen** | Operations Manager, MetroYellow NYC | Client sponsor; states business pain and constraints |
| **Elena Vasquez** | Data Architect, MHP | Designs medallion architecture; leads tool evaluation |
| **Bob Müller** | Junior Data Engineer, MHP | Hands-on builder; trainee proxy — learns by doing |
| **Sofia Alvarez** | Senior Data Engineer, MHP | Mentors Bob on Databricks / PySpark patterns |
| **Priya Sharma** | BI Analyst, MHP | Defines KPIs; **builds Power BI dashboard in animations** as Gold tables become available |
| **James Okonkwo** | Data Analyst, MHP | Explores data, validates KPI logic with SQL |

**Trainee instruction**: *"Today you are Bob. Elena has designed the architecture; your job is to
implement and evaluate the tools."*

---

## 5. Detailed Storyline — Full Training Day

Each act below follows the **four-step module pattern**: animation → reflection → theory → practice.
Per-module animation briefs are in Section 6.

### Act 0 — Module 0: Welcome & Setup

| Step | Content |
|------|---------|
| **Animation** | `mod-00-welcome.mp4` — Full use-case intro (Marcus, MHP team, medallion preview, *"What would you design?"*) |
| **Reflection** | Trainee design worksheet (10–15 min): sources, layers, tools, risks |
| **Theory** | Roles, agenda, environment setup, repo tour |
| **Practice** | Codespaces / devcontainer, `.env`, verify CLI connections |

---

### Act 1 — Module 1: Data Engineering Fundamentals

| Step | Content |
|------|---------|
| **Animation** | `mod-01-fundamentals.mp4` — Elena whiteboards Bronze → Silver → Gold; Priya lists KPI questions |
| **Reflection** | *What layers would you create? What belongs in Silver vs Gold?* |
| **Theory** | ETL vs ELT, medallion architecture, NYC Taxi dataset, 12 KPI overview |
| **Practice** | Explore dataset schema; optional short SQL/preview queries — no full pipeline yet |

**Priya animation tag**: Dashboard wireframe empty — *"Waiting for Gold tables from Bob."*

---

### Act 2 — Module 2: Databricks Pipeline

| Step | Content |
|------|---------|
| **Animation** | `mod-02-databricks.mp4` — Bob asks to try Databricks; Sofia pairs; Elena approves |
| **Reflection** | *How would you ingest Parquet from ADLS2 at scale? What goes wrong in Bronze?* |
| **Theory** | Lakehouse, Unity Catalog, Delta Lake, PySpark transform patterns |
| **Practice** | Bronze → Silver → Gold notebooks (`ex-databricks`) |

**Priya animation tag**: Overview page lights up — trips-by-hour line chart, day-of-week bars from first Gold KPIs.

---

### Act 3 — Module 3: Snowflake Pipeline

| Step | Content |
|------|---------|
| **Animation** | `mod-03-snowflake.mp4` — Marcus: *"We need SQL, not notebooks"* |
| **Reflection** | *How do you keep the same KPIs without PySpark? Who maintains this after MHP leaves?* |
| **Theory** | Snowflake architecture, external stages, SQL vs Snowpark |
| **Practice** | Bronze → Silver → Gold SQL / Snowpark (`ex-snowflake`) |

**Priya animation tag**: Map page — borough filled map, top pickup zones; Priya confirms same schema as Databricks Gold.

---

### Act 4 — Module 4: dbt Pipeline

| Step | Content |
|------|---------|
| **Animation** | `mod-04-dbt.mp4` — Marcus demands lineage and documentation; Elena adds dbt on Snowflake |
| **Reflection** | *How do you prove where a KPI number comes from? What tests would you add?* |
| **Theory** | dbt models, tests, docs, lineage; dbt as layer **on** warehouse, not replacement |
| **Practice** | dbt project run, `dbt test`, `dbt docs generate` (`ex-dbt`) |

**Priya animation tag**: Revenue and payment pages; quality scorecard from `kpi_data_quality_metrics`.

---

### Act 5 — Module 5: Production Patterns

| Step | Content |
|------|---------|
| **Animation** | `mod-05-production.mp4` — Elena: *"What runs every night without us?"* |
| **Reflection** | *What breaks in production that works in a notebook? How do you schedule and monitor?* |
| **Theory** | Workflows, Tasks, DLT, incremental loads, CI/CD for dbt |
| **Practice** | Production patterns walkthrough (`ex-production`) |

---

### Act 6 — Module 6: AI Features

| Step | Content |
|------|---------|
| **Animation** | `mod-06-ai.mp4` — Marcus asks if AI can help analysts explore data faster |
| **Reflection** | *Where is AI useful vs hype in this pipeline?* |
| **Theory** | Cortex, Genie, Copilot — augment, not replace, the medallion stack |
| **Practice** | AI feature exercises (`ex-ai-features`) |

**Priya animation tag**: Brief clip — natural-language question over dashboard data.

### Trainer note — Cortex vs Module 9 ML

Module 6 uses Cortex **LLM** assistants. Module 9 (optional) uses Cortex **ML** functions.
See [`official-docs-audit-and-optimizations.md`](./official-docs-audit-and-optimizations.md) § Module 6 & 9.

---

### Act 7 — Module 7: Power BI Payoff & Open Tool Discussion

| Step | Content |
|------|---------|
| **Animation** | `mod-07-wrapup.mp4` — Priya presents finished dashboard to Marcus; all 12 KPIs visible |
| **Reflection** | *(Silent 2 min)* — *Write down: which tool would you pick for MetroYellow and one sentence why.* |
| **Theory** | Short trainer recap only (5 min): three pipelines, one dataset — reference comparison table |
| **Practice** | **Open discussion** (20–25 min): trainees compare tools and defend choices; optional Power BI live demo |

**Discussion structure** (facilitator notes):

1. Collect 2–3 volunteer recommendations for MetroYellow's production stack.
2. Challenge with constraints: SQL-only team, budget, ML later, audit requirements.
3. Poll: *If you started greenfield tomorrow, rank Databricks / Snowflake / dbt.*
4. Revisit Module 0 whiteboard — what changed in their design?

**Trainer demo (optional, 10 min)**: Live or pre-recorded Power BI on Gold tables — see `powerbi/README.md`.

---

### Act 8 — Optional Phase 2 (Modules 8–9)

**Story framing**: Marcus asks about **live fleet demand** and **tip prediction** — scheduled for
the next sprint after the batch platform is live.

| Module | Story hook | Legacy source |
|--------|------------|---------------|
| 8 Streaming | Real-time zone demand signals | `modules/08-streaming-optional.qmd`, `streaming/` |
| 9 ML | Predict tip likelihood / driver incentives | `modules/09-ml-optional.qmd`, `ml/` |

#### Module 8 — four-step pattern *(optional 90 min session)*

| Step | Content |
|------|---------|
| **Animation** | `mod-08-streaming.mp4` — Marcus needs live dispatch; Phase 2 streaming |
| **Reflection** | Batch vs streaming; Kafka basics; when *not* to stream |
| **Theory** | Kafka, Structured Streaming, Snowpipe relay, Dynamic Tables, dbt `dynamic_table` |
| **Practice** | `ex-streaming` — Databricks + Snowflake + dbt; live Power BI DirectQuery demo |

**Dataset note**: Story = MetroYellow live demand. Lab = **Aiven user-activity Kafka events** (teaching
proxy — same patterns, different schema). Trainers must say this explicitly.

**Prerequisites**: Modules **2–3** required · Module **4** recommended (dbt `dynamic_table` track)

**Story order**: Deliver **Module 8 before Module 9** (live ops before tip prediction in MetroYellow Phase 2).

**Trainer docs**: [`workshop-2026-v2/trainer/reflection-prompts.md`](../workshop-2026-v2/trainer/reflection-prompts.md) § Module 8,
[`development-docs/04_production/animation-production-scripts.md`](../../development-docs/04_production/animation-production-scripts.md) § Module 8.
Editorial rules: [`development-docs/03_operations/module-prerequisites-and-order.md`](../../development-docs/03_operations/module-prerequisites-and-order.md).

#### Module 9 — four-step pattern *(optional 90 min session)*

| Step | Content |
|------|---------|
| **Animation** | `mod-09-ml.mp4` — Marcus wants tip prediction; leakage and feature ownership |
| **Reflection** | ML lifecycle, leakage, credit-card filter, tool roles |
| **Theory** | sklearn + MLflow, Cortex ML vs Snowpark ML, dbt feature table |
| **Practice** | `ex-ml` — Databricks, Snowflake, dbt features; 10 min comparison discussion |

**Dataset note**: Same NYC Taxi **Silver** as main workshop — predict `tip_amount` on credit-card
trips only. No proxy dataset (contrast with Module 8).

**Prerequisites**: Modules **2–3** required · Module **4** recommended (dbt feature table track)

**Story order**: Deliver **after Module 8** when running both optional sessions.

**Trainer docs**: [`workshop-2026-v2/trainer/reflection-prompts.md`](../workshop-2026-v2/trainer/reflection-prompts.md) § Module 9,
[`development-docs/04_production/animation-production-scripts.md`](../../development-docs/04_production/animation-production-scripts.md) § Module 9.
Editorial rules: [`development-docs/03_operations/module-prerequisites-and-order.md`](../../development-docs/03_operations/module-prerequisites-and-order.md).

---

## 6. Story-to-Module Mapping (At a Glance)

Each row includes the **four-step flow** time split (approximate).

| Time | Module | Animation | Reflect | Theory | Practice | Priya / Power BI |
|------|--------|-----------|---------|--------|----------|------------------|
| 09:00 | 0 Welcome | 4 min | 12 min | 8 min | 6 min | Wireframe shown |
| 09:30 | 1 Fundamentals | 3 min | 8 min | 15 min | 4 min | KPI list pinned |
| 10:00 | 2 Databricks | 3 min | 7 min | 15 min | 50 min | Overview charts |
| 11:30 | 3 Snowflake | 3 min | 7 min | 15 min | 50 min | Map pages |
| 13:30 | 4 dbt | 3 min | 7 min | 15 min | 50 min | Revenue + quality |
| 15:00 | 5 Production | 3 min | 7 min | 15 min | 20 min | — |
| 15:45 | 6 AI Features | 3 min | 7 min | 15 min | 20 min | NL query clip |
| 16:30 | 7 Wrap-up | 4 min | 5 min | 5 min | 16 min | **Full dashboard + open discussion** |
| — | 8 Streaming *(opt.)* | 3 min | 8 min | 20 min | 59 min | DirectQuery live page |
| — | 9 ML *(opt.)* | 3 min | 8 min | 20 min | 49 min | Predicted vs actual tips page |

### Optional advanced sessions

Module 8 and 9 use the **same four-step pattern** when delivered. Trainer scaffolding status:

| Module | Animation script | Reflection prompts | Notes |
|--------|------------------|--------------------|-------|
| 8 Streaming | ✅ `mod-08-streaming.mp4` | ✅ Module 8 section | Aiven Kafka lab; **2–3 req, 4 rec**; deliver before 9 |
| 9 ML | ✅ `mod-09-ml.mp4` | ✅ Module 9 section | Same Silver; **2–3 req, 4 rec**; Cortex ML ≠ Module 6 LLM |

**End-to-end data flow** (reference for trainers):

```{mermaid}
flowchart TB
    subgraph sources [Sources]
        ADLS2[("Azure ADLS2<br>NYC Taxi Parquet + Zone CSV")]
    end

    subgraph medallion [Medallion — Bob builds]
        B[Bronze]
        S[Silver]
        G[Gold — 12 KPIs]
    end

    subgraph consume [Consumption — Priya builds]
        PBI[Power BI Dashboard]
    end

    ADLS2 --> B --> S --> G --> PBI
```

---

## 7. Animation Library — Storyboards

Produce **ten module videos** (Modules 0–9), not a single intro-only clip. Shared production
notes:

- Style: clean 2D motion graphics; NYC yellow + MHP navy
- Length: 2–4 min each (Module 0 up to 4 min; others 2–3 min)
- Export: 1920×1080 MP4, 30 fps, English subtitles; optional German track
- Location: `workshop-2026-v2/media/modules/mod-XX-*.mp4`

### Module 0 — `mod-00-welcome.mp4` (full intro, ~4 min)

| Scene | Visual | Voiceover |
|-------|--------|-----------|
| 1 | NYC dispatch, Excel chaos | Marcus: millions of trips, no single source of truth |
| 2 | Zone heatmap | Where to send drivers? When? |
| 3 | MHP team assembles | Analytics foundation, not one report |
| 4 | Priya pins KPI cards | What questions must we answer? |
| 5 | Elena draws medallion | Bronze → Silver → Gold |
| 6 | Tool icons montage | Then choose how to build |
| 7 | Title card | *How would you design the solution?* |

### Module 1 — `mod-01-fundamentals.mp4`

Elena explains each medallion layer with taxi icons. Priya's empty Power BI wireframe. Tool
landscape icons (Databricks, Snowflake, AWS, Cloudera) — evaluation, not decision yet.

### Module 2 — `mod-02-databricks.mp4`

Bob pitches Databricks; Sofia joins. Notebook montage: ADLS2 → Bronze Delta → Silver → Gold.
**Priya**: first visuals on Overview page from `kpi_trips_by_hour`.

### Module 3 — `mod-03-snowflake.mp4`

Marcus: SQL maintainability. Snowflake worksheet montage. **Priya**: Map page connects to same
Gold schema — *"Identical KPIs, different engine."*

### Module 4 — `mod-04-dbt.mp4`

Marcus: lineage and audit. dbt lineage graph animation. **Priya**: revenue pages + quality
scorecard; lineage arrow from dashboard tile to dbt model.

### Module 5 — `mod-05-production.mp4`

Calendar, cron, alert icons. Workflow/Task scheduling without MHP in the room.

### Module 6 — `mod-06-ai.mp4`

Analyst asks question in natural language; AI suggests SQL. Framed as accelerator, not replacement.

### Module 7 — `mod-07-wrapup.mp4`

Priya presents full five-page dashboard to Marcus and Elena. Marcus: *"Now help us choose what
to run in production."* Fade to classroom: *"What would you choose?"*

---

## 8. Migration Plan — `workshop-2026-v1/` → `workshop-2026-v2/`

### 8.1 Principles

1. **Do not modify `workshop-2026-v1/`** — legacy module-centric site stays frozen. All editorial and story
   changes go in `workshop-2026-v2/` only (see [`development-docs/03_operations/module-prerequisites-and-order.md`](../../development-docs/03_operations/module-prerequisites-and-order.md)).
2. **Reuse, don't rewrite** — pipeline code (`databricks/`, `snowflake/`, `dbt_project/`) stays in
   place; only training narrative and website structure change in `workshop-2026-v2/`.
3. **Story wrapper around existing labs** — ~70% of technical content can migrate with editorial
   reframing, not re-authoring.
4. **Single source of truth for KPIs** — keep `DATA_MODEL.md` and `ARCHITECTURE.md` at repo root;
   link from new site, do not duplicate schemas.
5. **Vendor alignment** — when writing theory in `workshop-2026-v2/modules/`, follow
   [`workshop-2026-v2/development-docs/01_system-design/official-docs-audit-and-optimizations.md`](../workshop-2026-v2/development-docs/01_system-design/official-docs-audit-and-optimizations.md)
   (naming, Cortex LLM vs ML, timing).

### 8.2 Target folder structure

**As built (2026-05-23)** — differs slightly from original plan; optional folders marked *planned*:

```
workshop-2026-v2/
├── _quarto.yml                 # Story-driven site config
├── index.qmd                   # MetroYellow landing + agenda
├── styles.css                  # Story callouts (reflect, priya, story)
├── _scaffold_generate.py       # Regenerates modules + exercises from workshop-2026-v1/
│
├── story/                      # ✅ narrative pages (use case, characters, storyline, storyboards)
├── setup/                      # ✅ synced from shared/setup/ (via _sync_shared.py)
├── reference/                  # ✅ synced from shared/reference/ (via _sync_shared.py)
│
├── media/
│   ├── modules/                # ✅ README + filenames; MP4s not yet produced
│   ├── characters/             # *planned*
│   ├── powerbi/                # *planned* — screenshots for animation compositing
│   └── diagrams/               # *planned*
│
├── docs/                       # ✅ prerequisites, official-docs audit
└── trainer/                    # ✅ reflection, voiceovers, open-discussion (.md; .qmd optional)
    ├── module-delivery-pattern   # *planned*
    ├── facilitator-guide         # *planned*
    └── whiteboard-prompts        # *planned* (Module 0 content in reflection-prompts.md)
```

**Original target** (for reference — some items deferred):

```
workshop-2026-v2/
├── story/                      # characters.qmd, use-case-metro-yellow.qmd, etc.
├── setup/                      # copy from workshop-2026-v1/setup/
├── reference/                  # copy from workshop-2026-v1/reference/
└── trainer/*.qmd               # Quarto trainer pages (currently .md)
```

**Quarto render exclusion**: Add `trainer/` to `render:` exclude list in `_quarto.yml` (same pattern
as `workshop-2026-v1/` excludes `demo/`).

### 8.3 Content reuse matrix

| Content type | Action | Source | Destination |
|--------------|--------|--------|-------------|
| Pipeline notebooks & SQL | **Keep in place** | `databricks/`, `snowflake/`, `dbt_project/` | No move |
| Root architecture docs | **Keep in place** | `ARCHITECTURE.md`, `DATA_MODEL.md`, `SETUP.md` | Link from new site |
| Module technical steps | **Merge via generator** | `workshop-2026-v1/modules/*.qmd` | `workshop-2026-v2/modules/*.qmd` (§3 Theory) |
| Exercises | **Merge via generator** | `workshop-2026-v1/exercises/*.qmd` | `workshop-2026-v2/exercises/*.qmd` (full lab content) |
| Setup guides | **Link to legacy site** | `workshop-2026-v1/setup/*.qmd` | Navbar → GitHub Pages URLs (local copy optional) |
| Reference pages | **Link to legacy site** | `workshop-2026-v1/reference/*.qmd` | Navbar → GitHub Pages URLs (local copy optional) |
| Site config | **New nav** | `workshop-2026-v1/_quarto.yml` | `workshop-2026-v2/_quarto.yml` |
| Module videos | **Create new (×10)** | — | `workshop-2026-v2/media/modules/` |
| Power BI assets | **Keep in place** | `powerbi/` | Animation compositing + Module 7 demo |
| Optional streaming/ML | **Keep in place** | `streaming/`, `ml/` | Link from Modules 8–9 |

### 8.4 Migration phases

#### Phase 0 — Planning

- [x] Validate storyline and document migration plan (this file)
- [x] Official vendor docs audit ([`official-docs-audit-and-optimizations.md`](../workshop-2026-v2/development-docs/01_system-design/official-docs-audit-and-optimizations.md))
- [x] Module prerequisites & editorial guide ([`module-prerequisites-and-order.md`](../workshop-2026-v2/development-docs/03_operations/module-prerequisites-and-order.md))
- [ ] Trainer review of characters, client name, recommended stack ending
- [ ] Approve animation style and AI video tool

#### Phase 1 — Scaffold

- [x] Create `workshop-2026-v2/` folder
- [x] `_quarto.yml`, `styles.css`, `index.qmd`, assets, `.gitignore`
- [x] `_scaffold_generate.py` — merge theory + exercises from `workshop-2026-v1/`; setup/reference sync moved to `_sync_shared.py`
- [x] Add `story/` narrative pages (`characters`, `use-case-metro-yellow`, `storyline-full`, `animation-storyboard`)
- [x] Configure Quarto preview on port **4201** (`_quarto.yml` + `preview.ps1` / `preview.sh`)
- [x] Update root `README.md` with dual-site table (legacy + story-driven)
- [x] Local `setup/` and `reference/` copied from frozen `workshop-2026-v1/` (via generator)

#### Phase 2 — Module reframing

For each module `00`–`09`:

- [x] **Four-step structure**: Animation → Reflection → Theory → Practice
- [x] **Reflection prompts** + Priya checkpoints on every module page
- [x] **Full theory** merged from `workshop-2026-v1/modules/` into Section 3
- [x] **Full exercises** merged into `workshop-2026-v2/exercises/`
- [x] **Editorial callouts** (LSDP, Horizon, Cortex 6 vs 9, Aiven, dbt Fusion, prereqs)
- [x] **Per-module official doc footers** (vendor links + link to local `exercises/`)
- [x] **Video embed infrastructure** — auto `{{< video >}}` when MP4 exists in `media/modules/`; placeholder until then

Priority modules `00`–`07` and optional `08`–`09`: **complete**.

#### Phase 3 — Animation & media

- [x] Define visual style guide (characters, Priya's Power BI mockup, medallion diagram) — [`development-docs/04_production/animation-style-guide.md`](../workshop-2026-v2/development-docs/04_production/animation-style-guide.md)
- [x] Capture Power BI screenshots per dashboard page — [`media/powerbi/README.md`](../workshop-2026-v2/media/powerbi/README.md) *(capture pending; instructions ready)*
- [ ] Produce Modules **0–9** videos from Section 7 storyboards (+ optional 8–9 scripts in trainer docs)
- [ ] Replace video placeholders with embeds at top of each module page
- [ ] Optional: single "recap montage" for marketing (cut from module clips)

#### Phase 4 — Trainer materials

- [x] `trainer/reflection-prompts.md` — Modules 0–9 think/discuss questions
- [x] `trainer/open-discussion-guide.md` — Module 7 facilitation
- [x] `docs/animation-production-scripts.md` — mod-00 … mod-09
- [x] `trainer/architecture-decision-matrix.md` — Module 7 structured worksheet
- [x] `trainer/module-delivery-pattern.md` — four-step flow cheat sheet
- [x] `trainer/facilitator-guide.md` — day-of runbook
- [x] `trainer/whiteboard-prompts.md` — Module 0 worksheet + Day-end revisit
- [x] `trainer/index.qmd` — trainer hub on rendered site

#### Phase 5 — QA & cutover

- [ ] Run full dry-run with co-trainer — checklist: [`development-docs/03_operations/dry-run-checklist.md`](../workshop-2026-v2/development-docs/03_operations/dry-run-checklist.md)
- [ ] Verify all code paths still work (Codespaces, `.env`, dbt targets)
- [x] `quarto render` story site; Module 9 markup fixed; link QA ongoing
- [ ] `quarto render` legacy `workshop-2026-v1/` site
- [x] Publish decision documented — [`development-docs/99_archive/publishing-cutover.md`](../workshop-2026-v2/development-docs/99_archive/publishing-cutover.md) (archived)
- [ ] Deploy both sites to GitHub Pages

#### Phase 6 — Post-delivery iteration

- [ ] Collect trainee feedback on story vs labs balance
- [ ] Trim or expand animation based on timing
- [ ] Archive superseded narrative drafts to `workshop-2026-v2/99_archive/`

### 8.5 What NOT to migrate

| Item | Reason |
|------|--------|
| `workshop-2026-v1/_site/` | Generated output — do not copy |
| `workshop-2026-v1/demo/` | Separate demo site — keep independent |
| `01_archive2025/` | Already archived 2025 material |
| Duplicate KPI/schema docs | Single source: root `DATA_MODEL.md` |

### 8.6 README and publishing updates (when cutover happens)

Suggested root `README.md` addition:

```markdown
## Training Websites

| Version | Folder | Description |
|---------|--------|-------------|
| Story-driven (2026 v2) | [workshop-2026-v2/](workshop-2026-v2/) | NYC Taxi use case narrative |
| Classic modules | [workshop-2026-v1/](workshop-2026-v1/) | Original module-centric site |
```

Preview commands:

```bash
quarto preview workshop-2026-v1/          # Legacy site — port 4200
quarto preview workshop-2026-v2/   # Story-driven site — port 4201
```

---

## 9. Module Editorial Template

Every module page in `workshop-2026-v2/modules/` must follow this structure:

```markdown
---
title: "Module N: [Title]"
---

::: {.callout-time}
**Duration**: XX min — Animation (X) · Reflection (X) · Theory (X) · Practice (X)
:::

## 1. Animation
{{< video media/modules/mod-0N-name.mp4 >}}

## 2. Think & Discuss (5–10 min)
::: {.callout-reflect}
**Situation**: [What happened in the animation]
**Prompts**:
- [Question 1]
- [Question 2]
- [Question 3]
:::

## 3. Theory
### Learning Objectives
...

### [Concept sections from legacy module]

## 4. Practice
[Link to exercises — ex-databricks, ex-snowflake, etc.]

::: {.callout-priya}
**Priya / Power BI**: [Which dashboard pages this module's Gold KPIs feed]
:::

## Next Module Preview
One sentence — what constraint or question the next animation raises.
```

**Module 7 variant** — replace Practice section with:

```markdown
## 4. Open Discussion — Tool Comparison (20–25 min)
[Facilitation guide link: trainer/open-discussion-guide.qmd]

### Optional: Power BI Demo (10 min)
[Link to powerbi/README.md]
```

---

## 10. Risks & Mitigations

| Risk | Impact | Mitigation |
|------|--------|------------|
| Eight module videos overrun schedule | Less lab time | Cap each clip at 3 min except Module 0; trim reflection if needed |
| Reflection + discussion eat lunch buffer | Day runs long | Use fixed 5-min reflection timer; Module 7 discussion is the one extended slot |
| Trainees think dbt replaces Snowflake | Conceptual confusion | Elena explicitly states "dbt on top of Snowflake" in Act 5 |
| Bob "chooses" tools unrealistically | Breaks credibility | Elena approves each pivot; Bob executes |
| Dual Quarto sites diverge | Maintenance burden | Shared `reference/` links to root docs; single pipeline codebase |
| AI video quality inconsistent | Weak first impression | Use illustrated style; avoid uncanny avatars; test with pilot audience |

---

## 11. Success Metrics

| Metric | Target |
|--------|--------|
| Trainees can explain medallion layers | ≥90% in end-of-day poll |
| Trainees can name one strength of each tool | ≥90% |
| All three pipelines complete Gold layer | 100% of pairs in standard time |
| Trainees articulate a tool choice with evidence | ≥80% in Module 7 discussion |
| Priya/Power BI thread understood as consumption layer | ≥4/5 in poll |
| Facilitator can deliver without story script memorization | Facilitator guide complete |

---

## 12. Immediate Next Steps

1. **Produce animation MP4s** — priority: Module 0, 2, 7, then 1–6, then optional 8–9.
2. **Run `quarto render`** on `workshop-2026-v2/` — fix links; confirm setup/reference URLs work offline if needed.
3. **Update root `README.md`** — dual-site table per §8.6.
4. **Complete trainer runbook** — facilitator guide, delivery-pattern cheat sheet, whiteboard prompts.
5. **Co-trainer dry-run** — validate §6 timing (protect Modules 2–4 lab minutes).
6. **Publish** — deploy `workshop-2026-v2/_site` alongside legacy site; decide primary URL for 2026 classes.

Regenerate merged content after shared or legacy changes:

```bash
python _sync_shared.py                          # sync shared/ → v1 and v2
python workshop-2026-v2/_scaffold_generate.py   # regenerate modules + exercises
```

---

## 13. Migration Status Summary (v2.7)

| Area | Completion | Notes |
|------|------------|-------|
| **Story curriculum design** | **100%** | Phases 1–2 complete |
| **Quarto story site** | **~95%** | Renders clean; MP4 files pending |
| **Trainer scaffolding** | **100%** | Runbook, delivery pattern, whiteboard prompts |
| **Animation & media** | **~15%** | Style guide + PBI capture instructions; no MP4s |
| **Publishing & cutover** | **~40%** | Decision doc + dry-run checklist; deploy pending |

### Open optimizations (see audit doc)

| Priority | Item | Status |
|----------|------|--------|
| P0 | LSDP / Lakeflow Jobs glossary (Modules 5–6) | Done |
| P1 | Module 1 timing → 35 min | Done in `index.qmd` |
| P2 | Module 6 Cortex Analyst / Search one-liners | **Done** |
| P2 | Per-module official doc footer URLs | **Done** |
| P3 | Module 9 `CREATE SNOWFLAKE.ML.FORECAST` note | Done |
| — | Module footers → local `exercises/` | **Done** |
| — | Video embed when MP4 exists | **Done** (generator auto-detect) |
| — | Module 7 theory length vs 5 min plan | **Done** — deep dive reference page |

### What trainees can use today

- **Story track**: [`workshop-2026-v2/`](workshop-2026-v2/) — theory + labs self-contained (no `workshop-2026-v1/` edits required)
- **Legacy track**: [`workshop-2026-v1/`](workshop-2026-v1/) — frozen module-centric site
- **Regenerate**: `python _sync_shared.py` after any `shared/` change; `python workshop-2026-v2/_scaffold_generate.py` after any frozen v1 module/exercise change

---

## Appendix A — Sample Dialogue (Module 2 Transition)

**Elena**: "We have ADLS2 as our landing zone. Marcus needs answers by Friday. Bob, show us
ingest at scale — Sofia will pair with you on Databricks."

**Bob**: "I'll read the Parquet directly from ADLS2 into Bronze Delta tables, clean in Silver, and
materialize Priya's twelve KPIs in Gold."

**Sofia**: "Remember — we're not optimizing for perfect code. We're proving the medallion pattern
works before we simplify for Marcus's SQL team."

---

## Appendix B — Trainee Design Worksheet (Whiteboard Prompt)

1. Draw your data sources and ingestion frequency.
2. Name three layers and what each contains.
3. List five KPIs Marcus cares about.
4. Pick a primary tool for ingest and a primary tool for transform — defend in one sentence.
5. What tests or documentation would you require before go-live?

*(Facilitator reveals MHP's answers after 10 minutes — see Act 2 and Act 5.)*

---

## Appendix C — Module 7 Open Discussion Facilitation Guide

**Goal**: Trainees compare Databricks, Snowflake, and dbt and defend a real-world tool choice —
not receive a vendor answer from the trainer.

**Before discussion** (during animation + reflection):

- Trainees write privately: *My recommended stack for MetroYellow is ___ because ___.*

**Round 1 — Share (8 min)**

- 3–4 volunteers, 2 min each. No interruptions.

**Round 2 — Challenge (8 min)**

- Facilitator applies constraints: *"Budget cut 40%."* *"Team is 5 SQL analysts, 0 Python."*
  *"Need real-time in 6 months."* Ask: does your recommendation hold?

**Round 3 — Synthesis (8 min)**

- Table on whiteboard — fill from **trainee words**, not slides:

| Dimension | Databricks | Snowflake | dbt |
|-----------|------------|-----------|-----|
| Best for | (trainee input) | | |
| Weak for | | | |
| Marcus's team | | | |

**Trainer close (2 min)**

- Name the pattern: platform + transform layer + consumption (Power BI) are separate decisions.
- Revisit Module 0 design whiteboard.

**Reference only if discussion stalls** — do not lead with this table:

| Dimension | Databricks | Snowflake | dbt |
|-----------|------------|-----------|-----|
| Primary user | Data engineer / ML engineer | SQL analyst | Analytics engineer |
| Ingest | Spark, Auto Loader, DLT | Stages, Snowpipe | Delegated |
| Transform | PySpark, SQL, Delta | SQL, Snowpark | SQL + tests + lineage |
| Governance | Unity Catalog | Horizon, masking | Docs, lineage graph |
| Power BI | Gold tables via connector | Gold tables via connector | Gold via warehouse |

---

## Document History

| Date | Version | Change |
|------|---------|--------|
| 2026-05-23 | v1 | Initial migration plan and storyline |
| 2026-05-23 | v2 | Four-step module pattern; eight animations; Power BI thread; open discussion finale |
| 2026-05-23 | v2.1 | Module 8 optional streaming scaffold in trainer docs |
| 2026-05-23 | v2.2 | Module 9 optional ML scaffold in trainer docs |
| 2026-05-23 | v2.3 | Prerequisites, Cortex 6 vs 9, optional order — in workshop-2026 only; quarto reverted |
| 2026-05-23 | v2.4 | Content scaffold complete: merged theory + exercises, Quarto site, phase checklists updated; media & cutover pending |
| 2026-05-23 | v2.5 | **Phase 1 & 2 complete**: story/ pages, setup/reference sync, port 4201, dual README, video embed infra, official doc footers |
| 2026-05-24 | v2.6 | Cross-referenced external storyline review (`Suggestion_of_chatgpt.md`): adopted **architecture decision matrix**, **three-constraint framework**, Module 3/7 taglines, generator EDITORIAL updates. |
| 2026-05-24 | v2.7 | **Phase 4 complete**: facilitator-guide, module-delivery-pattern, whiteboard-prompts. **Module 7 trim**: `reference/tool-comparison-deep-dive.qmd` via generator skip range. **Phase 3/5 partial**: animation style guide, powerbi screenshot README, dry-run checklist, publishing-cutover doc. P0/P1 link fixes (Module 9 callout, setup links, trainer hub). |
