Facilitator Guide — Day-of Runbook

Facilitator-only — not shown to trainees during labs

TRAINER ONLY

YellowLine NYC masterclass · MHP Data Engineer Masterclass 2026

Audience: Lead trainer + co-trainer
Related: pre-class-checklist.qmd · module-delivery-pattern.qmd · whiteboard-prompts.qmd · reflection-prompts.qmd


Pre-class infrastructure setup (1–2 weeks before)

These tasks are done once by the trainer before the workshop day. They cannot be done on the morning of the class — plan ahead.

ImportantShared Azure resource group

ADLS2 (mhpdeworkshopsa) and Databricks (mhpdeworkshop_databricks) are both pre-provisioned under the same resource group:

1000_data_engineering_workshop · subscription MHP Resort Consulting Services · tenant mhpdev.onmicrosoft.com

Databricks Workspace

NotePre-provisioned by MHP — do not create a new workspace

The workshop Databricks workspace already exists in resource group 1000_data_engineering_workshop. Your job is to configure users, Git folders, Unity Catalog grants, and clusters — not to provision a new workspace.

Item Value
Workspace name mhpdeworkshop_databricks
Workspace ID 3359135813781456
Resource group 1000_data_engineering_workshop
Subscription MHP Resort Consulting Services (ba826c91-8e52-4e07-ac7c-538858bbc813)
Azure tenant mhpdev.onmicrosoft.com
Your role Workspace admin (provisioned by MHP IT)
Unity Catalog mhpdeworkshop_databricks_2026 — confirm in Catalog explorer (name must match CATALOG_NAME in databricks/notebooks/00_setup.py)
Source repo github.com/jinjuewei/MHPDataEngineerWorkshop
Notebook path in repo databricks/notebooks/ (00_setup.py04_ai_features.py)

Find it in Azure Portal

  1. Open resource group 1000_data_engineering_workshop.
  2. Under Resources, click Azure Databricks / mhpdeworkshop_databricks.
  3. Click Launch workspace (or copy the workspace URL from Overview).
  4. Workspace URL looks like https://adb-<id>.<random>.azuredatabricks.net — also under Settings → Workspace settings in the Databricks UI.

Official references: Add users (Azure Databricks), Create Git folders, Unity Catalog get started.

This workshop has two trainers (lead + co-trainer). Each trainer creates their own Git folder in their own Home, runs 00_setup.py with a trainer-specific ATTENDEE_ID, then shares that folder to all trainees (Can Run). Trainees open the shared folder to run notebooks; they still use their own {attendee_id} in 00_setup.py (clone that notebook to Home first — see Databricks setup).

NoteTrainer and trainee ATTENDEE_ID naming
Role Pattern Examples
Trainers 00_{firstname} (lowercase) 2026 delivery: 00_juewei, 00_alisa · other cohorts: 00_sam, 00_taylor
Trainees 01_{name}, 02_{name}, … (lowercase) 01_alice, 02_bob
  • Databricks / Unity Catalog — schemas are lowercase: 00_juewei_bronze, 00_juewei_silver, 00_juewei_gold
  • Snowflake — same ID stem, schemas uppercase: 00_JUEWEI_BRONZE, 00_JUEWEI_SILVER, 00_JUEWEI_GOLD (set in each trainer’s 00_account_setup.sql)
  • Agree both trainer IDs in pre-class; use the same IDs for Databricks demos, Snowflake dry-runs, and Power BI trainer Gold

Step 1 — Confirm workspace admin access

  1. Open the workspace URL and sign in with your MHP account.
  2. Click your username (top bar) → Settings.
  3. Confirm you can open Admin settings (workspace admin) or Identity and access without errors.
  4. Open Catalog → verify catalog mhpdeworkshop_databricks_2026 exists and you can browse it.
Note2026 catalog vs 2025

The workspace name stays mhpdeworkshop_databricks (shared Azure resource). Last year’s Unity Catalog is mhpdeworkshop_databricks_2025. For this cohort, create a new catalog mhpdeworkshop_databricks_2026 if it does not exist yet (CatalogCreate catalog), then apply the GRANTs in Step 5. Notebooks use CATALOG_NAME = "mhpdeworkshop_databricks_2026" in 00_setup.py — do not point trainees at the 2025 catalog.

Step 3 — Each trainer creates a Git folder (lead + co-trainer)

Both trainers repeat this in their own Databricks Home (same steps, different ATTENDEE_ID). Use a Git folder.

Trainer ATTENDEE_ID in 00_setup.py Schemas created (lowercase)
Lead trainer 00_{firstname}2026: 00_juewei e.g. 00_juewei_bronze, _silver, _gold
Co-trainer 00_{firstname}2026: 00_alisa e.g. 00_alisa_bronze, _silver, _gold

Replace {firstname} with each trainer’s assigned ID (naming convention above). Other deliveries pick any two 00_* IDs — keep them unique and agreed before class.

  1. WorkspaceHome (/Users/<your-email>/).

  2. CreateGit folder.

  3. Fill in:

    Field Value
    Git repository URL https://github.com/jinjuewei/MHPDataEngineerWorkshop.git
    Git provider GitHub
    Git folder name MHP-DE-Workshop-2026
    Sparse checkout mode ✅ Enable
    Cone patterns databricks/notebooks

    Optional before Modules 8–9: add cone patterns streaming/databricks and ml/databricks.

  4. Create Git folder → wait for clone.

  5. Open databricks/notebooks/00_setup.py → set ATTENDEE_ID to your trainer ID (e.g. 00_juewei or 00_alisa) → attach/start cluster → Run all.

  6. Catalog → confirm your three schemas exist under mhpdeworkshop_databricks_2026.

  7. Confirm 01_bronze_ingestion.py04_ai_features.py are listed under the Git folder.

Sparse checkout must be enabled at creation; you cannot disable sparse mode afterward. Cone patterns can be edited later: Git folder → Settings → Advanced → Cone patterns (Configure sparse checkout).

Step 3b — Sync Git folder with GitHub (Pull — not automatic)

A Git folder is a workspace checkout of the remote repo. Changes on GitHub do not appear in the workspace until someone Pulls.

flowchart LR
    GH["GitHub\njinjuewei/MHPDataEngineerWorkshop"]
    T1["Lead trainer Git folder"]
    T2["Co-trainer Git folder"]
    ST["Trainees\nshared Can Run"]
    GH -->|"git push (CI / developer)"| GH
    T1 -->|"Git → Pull (manual)"| GH
    T2 -->|"Git → Pull (manual)"| GH
    T1 --> ST
    T2 --> ST
    GH -.->|"No auto sync"| T1

Question Answer
Does GitHub auto-update the workspace? No — click Pull in the Git dialog, or automate via Repos API / CI/CD (Pull changes)
Who Pulls for the shared-folder model? Both trainers only — trainees with Can Run cannot run Git operations (permissions)
When to Pull? After any push to main; morning of class; after notebook/doc fixes land in GitHub
Trainee with own Git fork? Trainee creates their own Git folder and Pulls there (collaborate in Git folders)

Pull procedure (each trainer) — UI steps per Access the Git dialog and Pull changes:

Option A — from Workspace (recommended before class)

  1. Left sidebar → Workspace.
  2. Expand Users → your email → MHP-DE-Workshop-2026.
  3. Beside the folder name, click Git (Git icon / Git link). A full-screen Git operations dialog opens.
  4. At the top, confirm the branch dropdown shows main (or your workshop branch). If not, select main before pulling.
  5. Click Pull (in the dialog toolbar — sync/download from remote).
  6. Wait for the dialog to finish. Files under databricks/notebooks/ update to match GitHub.
  7. Close the Git dialog (click outside or X).
  8. Verify: open databricks/notebooks/00_setup.py → check PEP 723 header / recent edits match GitHub main.

Option B — from an open notebook

  1. Open any notebook in the Git folder (e.g. 01_bronze_ingestion.py).
  2. At the top of the notebook, next to the notebook title, click the branch name button (shows current branch, e.g. main).
  3. The same Git operations dialog opens → click Pull → confirm branch main.
  4. Close the dialog and re-open the notebook if cells look stale (Pull can clear notebook session state).
UI element Where to find it
Git button Workspace tree: beside MHP-DE-Workshop-2026 folder name
Branch button Top bar inside a notebook opened from the Git folder
Pull Git operations dialog toolbar (downloads from remote — no commit message needed)
Commit & Push Same dialog — only use if you intentionally changed files in the Git folder

If Pull fails or is disabled

Symptom Action
Merge conflict after Pull Git dialog offers Keep all current / Take all incoming or manual edit — see Resolve merge conflicts. For workshop notebooks, prefer incoming unless you have local edits to keep.
Pull grayed out / Git ops disabled Workspace may need serverless compute for Git UI (Git CLI folders) — Git CLI compute requirements. Fallback: Repos API or ask workspace admin.
Uncommitted local changes in folder Commit or discard before Pull; sparse-checkout folders block pattern changes while files have uncommitted edits.

Official notes from Databricks:

  • Pull is manual — “click Pull in the Git operations dialog” (source).
  • Pull clears notebook state — warn attendees if they have unsaved notebook session state before you Pull mid-class.
  • One Git operator per folder — Databricks recommends only one user performs Git ops per folder; trainees use Can Run on the shared copy (collaborate).
  • Git UI + serverless — if Git → Pull is disabled, the workspace may need serverless compute (required for Git CLI-enabled folders) — see Git CLI compute requirements.

Never Push secrets from the workspaceSTORAGE_ACCOUNT_KEY belongs only in each attendee’s Home clone of 00_setup.py, not in the shared Git folder commit. Source .py notebook outputs are not committed by default (commit and push).

Step 4 — Create workshop groups and invite attendees

Create two groups for the 2026 cohort:

Group Members Workspace access level
workshop_trainer_2026 Lead + co-trainer User (trainers are also workspace admins — see Step 1)
workshop_trainees_2026 All attendees User (not Admin)
  1. SettingsAdmin settingsIdentity and accessGroupsAdd group — create both groups.
  2. Add trainer emails to workshop_trainer_2026; add each attendee to workshop_trainees_2026 (or bulk-import).
  3. UsersAdd user for anyone not yet in the workspace. Each invitee receives an activation email — confirm all accepted before class.

Entitlements — open each group → Entitlements tab:

Group Enable
workshop_trainees_2026 Workspace access · Databricks SQL (Module 6 Genie + SQL Editor)
workshop_trainer_2026 Workspace access · Databricks SQL (trainer dry-runs)

Trainees do not need Allow unrestricted cluster creation — the Workshop cluster policy (Step 6) grants Can use, which is enough to self-create clusters within policy limits.

See Manage users · Manage entitlements.

Step 5 — Unity Catalog permissions (trainees)

Run in SQL Editor (or from 00_setup.py as admin). Grants must exist before attendees run 00_setup.py (CREATE SCHEMA).

Option A — grant to trainee group (recommended):

-- Catalog-level (once, before class) — trainees only
GRANT USE CATALOG ON CATALOG mhpdeworkshop_databricks_2026 TO `workshop_trainees_2026`;
GRANT CREATE SCHEMA ON CATALOG mhpdeworkshop_databricks_2026 TO `workshop_trainees_2026`;

Option B — grant per attendee (if no group):

GRANT USE CATALOG ON CATALOG mhpdeworkshop_databricks_2026 TO `attendee@example.com`;
GRANT CREATE SCHEMA ON CATALOG mhpdeworkshop_databricks_2026 TO `attendee@example.com`;

After each attendee runs 00_setup.py, grant schema access (schemas are lowercase):

GRANT USE SCHEMA ON SCHEMA mhpdeworkshop_databricks_2026.{attendee_id}_bronze TO `workshop_trainees_2026`;
GRANT USE SCHEMA ON SCHEMA mhpdeworkshop_databricks_2026.{attendee_id}_silver TO `workshop_trainees_2026`;
GRANT USE SCHEMA ON SCHEMA mhpdeworkshop_databricks_2026.{attendee_id}_gold TO `workshop_trainees_2026`;
GRANT SELECT ON SCHEMA mhpdeworkshop_databricks_2026.{attendee_id}_gold TO `workshop_trainees_2026`;

Replace {attendee_id} with e.g. 01_alice. The SELECT on _gold is required before Module 6 (ai_query(), Genie). Alternatively use Catalog → catalog → Permissions UI.

Trainers (workshop_trainer_2026) use workspace-admin privileges for demos — no separate UC group grants needed unless you prefer explicit grants.

Step 7 — Compute for attendees

Option A — attendees create their own cluster (less prep, more day-of support):

Setting Value
Policy Workshop
Runtime 15.4 LTS (Modules 2–6) · 15.4 LTS ML for optional Module 9
Node type Standard_DS3_v2
Workers 1
Auto-terminate 30 minutes
Trainee self-create cluster (Option A — brief for class)

Regular users in workshop_trainees_2026 can create a cluster when the Workshop policy has Can use (Step 6). Trainees do not need workspace admin.

  1. ComputeCreate computeCluster.
  2. Policy: select Workshop (required — limits size and cost).
  3. Databricks runtime: 15.4 LTS (or latest LTS).
  4. Node type: Standard_DS3_v2 · Workers: 1 · name e.g. de-workshop-01_alice.
  5. Create → wait until Running (green).
  6. Open a notebook → compute dropdown (top bar) → attach this cluster → Run all on your Home copy of 00_setup.py first.

Pre-class verify: sign in as a test trainee (or co-trainer in workshop_trainees_2026) and confirm Create compute shows policy Workshop and the cluster starts.

Option B — pre-create one cluster per attendee (faster Module 2 start):

Setting Value
Cluster name de-workshop-{ATTENDEE_ID} (e.g. de-workshop-01_alice)
Policy Workshop
Runtime 15.4 LTS (ML for Module 9 optional labs)
Node type Standard_DS3_v2
Workers 1
Auto-terminate 30 minutes

Create in Terminated state. Share each cluster: Compute → cluster → Permissions → add attendee (or workshop_trainees_2026) with Can restart.

Step 8 — Share Git folder with trainees (required — both trainers)

After each trainer finishes Step 3 (Git folder created + 00_setup.py dry-run), share the folder before class:

  1. WorkspaceHomeMHP-DE-Workshop-2026.
  2. Click Share (or Share).
  3. Add principals:
    • Group workshop_trainees_2026Can Run (recommended — all trainees)
    • Or add individual trainee emails → Can Run
  4. Click Save.
  5. Verify: co-trainer or a test user opens WorkspaceUsers<your-email>MHP-DE-Workshop-2026 → opens 01_bronze_ingestion.py without error.
Permission Trainee can Trainee cannot
Can Run Open and run notebooks Edit or delete files in your folder

Trainee path to notebooks:

Workspace → Users → <trainer-email> → MHP-DE-Workshop-2026 → databricks/notebooks

Both trainers share their folders — trainees may use either trainer’s copy (same repo content). Tell the class which trainer to contact for lab support (optional split by row/seat).

00_setup.py and trainee IDs: shared folder is read-only for trainees. They must clone 00_setup.py to their own Home to set their {attendee_id} (e.g. 01_alice), then run it once before 0104. Steps in Databricks setup.

Fallback: trainee Git folder from their fork (Option A) or manual import (Option B) if sharing is blocked.

Step 9 — What trainees do in Module 2

Brief attendees on:

Task Where
Accept workspace invite Email link
Open trainer notebooks Workspace → Users → <trainer-email> → MHP-DE-Workshop-2026
Clone 00_setup.py to Home ⋮ → Clone to [your Home] — then set your ATTENDEE_ID (e.g. 01_alice)
Run 00_setup.py Your Home copy — creates your {id}_bronze/silver/gold schemas
Run 0104 Trainer shared folder (or your clone)
Paste ADLS2 key (verbal from trainer) Your 00_setup.py only — never commit
Create/start cluster ComputeCreate compute → policy WorkshopStart (see Step 7)

Step 10 — Databricks SQL Warehouse (Module 6 AI + optional dbt target)

Required for Genie and ai_query() in Module 6 — even when dbt targets Snowflake only.

  • SQL Editor → SQL Warehouses → Create SQL Warehouse
  • Type: Pro or Serverless (not Classic — ai_query() is unsupported on Classic)
  • Name: de-workshop-wh, Size: 2X-Small, Auto-suspend: 5 minutes
  • Permissions → add workshop_trainees_2026 and workshop_trainer_2026Can use
  • Trainer dry-run: SQL Editor → attach de-workshop-whSELECT 1

Official reference: Create a SQL warehouse

Step 10b — Databricks AI features (Module 6 — account + workspace admin)

Module 6 uses Genie Code (notebook/SQL AI assistant — formerly Databricks Assistant), ai_query() (SQL), and Genie Spaces (natural language over Gold). Configure once before class.

Account admin (Account consoleSettingsFeature enablement):

Setting Value Why
Enable partner-powered AI features On Powers Genie Code, Genie Spaces, and related assistive features (Azure OpenAI / Anthropic on Databricks)
Enforce data processing within workspace Geography Review if AI features fail to enable Workspaces outside US/EU (e.g. Germany West Central) may need cross-geo processing disabled — see Partner-powered AI features

Workspace admin (username → SettingsWorkspace adminAdvanced):

Setting Value
Partner-powered AI features On (unless account enforces Off)

No separate “AI-powered assistive features” toggle? That is normal in current Azure Databricks UI. Microsoft consolidated admin control under Partner-powered AI features; Genie Code and other assistive features are enabled when partner-powered AI is On (or use Databricks-hosted models when it is Off in supported regions). Do not block Module 6 prep looking for a second toggle — run the functional checks below instead.

##### Verify AI is activated (workspace admin — 5 min)

Check Pass criteria
Workspace Advanced Partner-powered AI features = On
Genie Code Open a notebook on a cluster → Ctrl+I / Cmd+I (or Genie Code icon) → prompt returns a suggestion
Genie Spaces Sidebar GenieNew → add a Gold table → question returns SQL or an answer
ai_query() SQL Editor on de-workshop-wh → one-row test query succeeds (verify model name)

User entitlements — configured in Step 4 (workshop_trainees_2026: Workspace access + Databricks SQL).

Unity CatalogGRANT SELECT ON SCHEMA …_gold is in Step 5 (run after each 00_setup.py). Required before Module 6 ai_query() and Genie.

Trainer pre-class dry-run (after Gold tables exist from 03_gold_kpis.py):

  1. Genie Code — open 04_ai_features.py on a cluster → Ctrl+I / Cmd+I → prompt “Show top 5 pickup zones by revenue”

  2. ai_query()SQL Editor → warehouse de-workshop-wh → run a one-row test (verify model name in workspace first):

    SELECT ai_query(
      'databricks-meta-llama-3-3-70b-instruct',
      'Reply with exactly: OK'
    ) AS test;
  3. Genie — sidebar GenieNew → add your trainer Gold tables (e.g. 00_juewei_gold.kpi_*) → set default warehouse de-workshop-wh → ask “What hour has the most taxi trips?”

Model IDs drift — list available foundation models in the workspace before class. Do not hardcode names from last year’s delivery.

Official references: AI assistive features · Genie setup · Genie Code · ai_query

Step 11 — Databricks CLI authentication (Module 8 Aiven Secrets)

The trainer needs to create the workshop-scope secrets scope before class or guide attendees through it during Module 8. Two options:

Option A — Trainer creates scope via CLI (recommended): ```bash # Authenticate CLI to workspace databricks configure –token # Prompt: Databricks Host: https://.azuredatabricks.net # Prompt: Token: <generate PAT from Settings → User Settings → Access Tokens>

# Create scope (once) databricks secrets create-scope –scope workshop-scope

# Add secrets (values from Aiven Console) databricks secrets put –scope workshop-scope –key aiven-bootstrap-servers databricks secrets put –scope workshop-scope –key aiven-ca-cert databricks secrets put –scope workshop-scope –key aiven-client-cert databricks secrets put –scope workshop-scope –key aiven-client-key databricks secrets put –scope workshop-scope –key aiven-topic

# Allow trainees to read secrets in notebooks (Module 8) databricks secrets put-acl –scope workshop-scope –principal workshop_trainees_2026 –permission READ databricks secrets put-acl –scope workshop-scope –principal workshop_trainer_2026 –permission READ ```

Option B — Attendees create their own scope during Module 8 (requires each attendee to have a PAT and Databricks CLI installed).

Verify: as a test trainee, databricks secrets list --scope workshop-scope lists five keys (trainees need READ ACL — not scope admin).

ADLS2 Storage (mhpdeworkshopsa)

NotePre-provisioned by MHP — do not create a new account

The workshop storage account already exists. Your job is to upload TLC data, rotate keys, and create SAS tokens — not to provision Azure storage.

Item Value
Storage account mhpdeworkshopsa
Resource group 1000_data_engineering_workshop
Location Germany West Central (germanywestcentral)
Subscription MHP Resort Consulting Services
Subscription ID ba826c91-8e52-4e07-ac7c-538858bbc813
Azure tenant mhpdev.onmicrosoft.com
Container nyc-taxi-data (should already exist)

Find it in Azure Portal

  1. Sign in to Azure Portal with your MHP account (mhpdev.onmicrosoft.com).
  2. Open the resource group directly: 1000_data_engineering_workshop.
  3. Under Resources, click storage account mhpdeworkshopsa.
  4. Confirm Location shows Germany West Central on the Overview blade.

Shortcut: search mhpdeworkshopsa in the portal top bar if you are already in the correct subscription.

Same resource group also contains Databricks workspace mhpdeworkshop_databricks — see Databricks Workspace above.

Shared storage for all attendees. Two credentials — do not mix them up:

Credential Used by Module Never commit to Git
Storage account key (key1) Databricks 00_setup.py 2 Distribute verbally
SAS token (query string) Snowflake 02_external_stage.sql 3 Printed card / slide

Data layout (container nyc-taxi-data):

Path Content
raw/trips/ Parquet trip files (yellow_tripdata_YYYY-MM.parquet)
raw/lookup/taxi_zone_lookup.csv Zone lookup CSV (265 zones)
streaming/user-activity/ Module 8 relay output (optional)

Official reference: Grant limited access with SAS.

0 — Download from TLC and upload to ADLS2

Workshop pipelines read raw files from ADLS2 — they do not download from the internet at runtime. The trainer (or MHP ops) must download once from NYC TLC and upload to mhpdeworkshopsa before rotating keys or distributing SAS tokens.

Source: NYC TLC Trip Record DataYellow Taxi Trip Records, Parquet format.

File TLC download (direct) ADLS2 destination
Trip data https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-10.parquet (~61 MB) nyc-taxi-data/raw/trips/yellow_tripdata_2024-10.parquet
Zone lookup https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv (~12 KB) nyc-taxi-data/raw/lookup/taxi_zone_lookup.csv

Workshop default month: October 2024 (dbt_project/dbt_project.yml sets data_year: 2024, data_month: 10). One month (~3M trips) is enough for all labs. Optional: add yellow_tripdata_2024-09.parquet and yellow_tripdata_2024-11.parquet for richer time-series KPIs — keep original TLC filenames.

Step 1 — Download locally

  1. Open the TLC trip record page.
  2. Under Yellow Taxi Trip Records, choose Parquet (not CSV) for the trip file.
  3. Download October 2024 Parquet (link above or TLC table row yellow_tripdata_2024-10.parquet).
  4. Download Taxi Zone Lookup Table CSV (taxi_zone_lookup.csv — link above or TLC Auxiliary data section).
  5. Confirm locally: trip file is .parquet; lookup is .csv with header LocationID,Borough,Zone,service_zone.

Step 2 — Upload to ADLS2

Use any method below. Create folders raw/trips/ and raw/lookup/ if they do not exist.

Option A — Azure Portal (no extra tools)

  1. Azure PortalStorage accountsmhpdeworkshopsa.
  2. Data storageContainersnyc-taxi-data.
  3. Open or create raw/trips/Upload → select yellow_tripdata_2024-10.parquet.
  4. Open or create raw/lookup/Upload → select taxi_zone_lookup.csv (exact filename — Snowflake stage and Databricks LOOKUP_DATA_PATH expect this name).

Option B — Azure Storage Explorer

  1. Install Azure Storage Explorer.
  2. Connect with your MHP Azure account → mhpdeworkshopsanyc-taxi-data.
  3. Drag Parquet into raw/trips/ and CSV into raw/lookup/.

Option C — Azure CLI (trainer workstation with az logged in)

# Set variables — use key1 from Portal → Access keys (trainer only; never commit)
ACCOUNT=mhpdeworkshopsa
KEY="<storage-account-key1>"
CONTAINER=nyc-taxi-data

az storage blob upload \
  --account-name "$ACCOUNT" --account-key "$KEY" \
  --container-name "$CONTAINER" \
  --file ./yellow_tripdata_2024-10.parquet \
  --name raw/trips/yellow_tripdata_2024-10.parquet \
  --overwrite

az storage blob upload \
  --account-name "$ACCOUNT" --account-key "$KEY" \
  --container-name "$CONTAINER" \
  --file ./taxi_zone_lookup.csv \
  --name raw/lookup/taxi_zone_lookup.csv \
  --overwrite

Step 3 — Sanity-check before key/SAS distribution

Check Expected
raw/trips/ At least one yellow_tripdata_*.parquet visible
raw/lookup/taxi_zone_lookup.csv File present; ~265 data rows (+ header)
Bronze ingest (after key) spark.read.parquet(TRIPS_DATA_PATH).count()3M for Oct 2024
Lookup ingest Zone count 265 in 01_bronze_ingestion

Do this upload before sections A–C below (key rotation and SAS creation).

A — Regenerate storage account key (Databricks)

Use a fresh key before each workshop cohort.

  1. Sign in to Azure Portal.
  2. Search Storage accounts → open mhpdeworkshopsa.
  3. Left menu → Security + networkingAccess keys.
  4. Under key1, click Rotate key (or Regenerate — regenerates key1 and invalidates the old one).
  5. Click Show next to key1 → copy the key value.
  6. Store in your password manager — distribute to class verbally during Module 2 only.

Databricks notebooks use STORAGE_ACCOUNT_KEY with the abfss:// path to nyc-taxi-data.

B — Create SAS token in Azure Portal (Snowflake)

Snowflake external stages need Read + List on blobs in nyc-taxi-data. Microsoft documents SAS creation in the portal here: Create SAS tokens (Azure portal).

Recommended: container-scoped SAS (least privilege — only nyc-taxi-data):

  1. Azure Portal → storage account mhpdeworkshopsa.

  2. Left menu → Data storageContainers → click nyc-taxi-data.

  3. Top menu → Generate SAS (or Generate SAS).

  4. Set fields:

    Field Workshop value
    Signing method Account key (Snowflake AZURE_SAS_TOKEN expects key-signed SAS on trial accounts)
    Permissions Read, ✅ List only — leave Write / Delete / Add unchecked
    Start Today (or workshop morning)
    Expiry Workshop date + 2 days buffer
    Allowed IP addresses (leave empty for classroom)
    Allowed protocols HTTPS only
    Signing key key1
  5. Click Generate SAS token and URL.

  6. Copy only the SAS token field (query string like sv=2024-11-04&ss=b&srt=sco&sp=rl&se=...&sig=...).

    • For Snowflake CREDENTIALS = (AZURE_SAS_TOKEN = '...'), paste the token without a leading ?.
    • The portal shows the token once — save it immediately; you cannot retrieve it later.
  7. Optional: copy Blob SAS URL to test in a browser or Azure Storage Explorer.

Alternative: account-level SAS (broader scope — use only if container SAS is unavailable):

  1. Storage account mhpdeworkshopsaSecurity + networkingShared access signature.

  2. Configure:

    Field Workshop value
    Allowed services Blob only
    Allowed resource types Container + Object
    Allowed permissions Read + List
    Start / Expiry Workshop date → +2 days
    Allowed protocols HTTPS only
    Signing key key1
  3. Generate SAS and connection string → copy the SAS token query string (same rules as above).

C — Verify before class

Azure Portal / Storage Explorer

  1. Containersnyc-taxi-dataraw/trips/ — Parquet files visible.
  2. Open raw/lookup/taxi_zone_lookup.csv exists.

Snowflake (trainer account)

Run after pasting SAS into snowflake/setup/02_external_stage.sql:

-- Replace 00_JUEWEI with your trainer ID (uppercase in Snowflake)
LIST @00_JUEWEI_BRONZE.nyc_taxi_trips_stage;
LIST @00_JUEWEI_BRONZE.nyc_taxi_lookup_stage;

Both commands must return file names (not Access denied or empty error).

Databricks (trainer dry-run)

After 00_setup.py with ADLS2 key:

# In a notebook cell — should print a row count, not auth error
spark.read.parquet(TRIPS_DATA_PATH).count()

D — Distribute to attendees

Item When Format
Storage account key Module 2 Verbal only
SAS token Module 3 Printed card — warn about expiry date

Common failures: expired SAS, extra spaces when copy-pasting token, using account key in Snowflake stage SQL, or regenerating key1 after Databricks setup without telling the class.

Aiven Kafka (Module 8 only)

  1. Create Aiven Kafka cluster — see aiven-streaming-setup.qmd
  2. Start User Activity generator (4 hours max on free tier — start on the morning of Module 8, not before)
  3. Download SSL certificates: ca.pem, service.cert, service.key
  4. Note the Service URI: kafka-xxxxx.aivencloud.com:12345
  5. Start the relay consumer (streaming/snowflake/00_relay_consumer.py) — requires ADLS2 key

Power BI (Module 7 demo + optional trainee self-paced)

Trainer — build the dashboard (Desktop, once before class)

  1. Use a Windows machine with Power BI Desktop (free to install)
  2. Follow Exercise: Power BI or powerbi/README.md — connect to trainer Gold: Snowflake {trainer_id}_GOLD or Databricks {trainer_id}_gold
  3. Connect via Snowflake or Azure Databricks — load all 12 kpi_* tables · choose Import for the main workshop demo
  4. Build all five pages: Overview, Map, Time Analysis, Revenue, Efficiency — see Module 7 §3.1
  5. File → Save AsYellowLine-NYC-KPIs.pbix (keep a local copy for offline demo fallback)

Trainer — publish to cloud workspace (if you have Power BI Pro / Fabric)

Reports are authored in Desktop; your cloud workspace hosts, refreshes, and shares them. You cannot realistically build this five-page Snowflake/Databricks dashboard from scratch in the browser alone.

Step Where Action
1 Desktop Sign in with your work Microsoft account (same tenant as the workspace)
2 Desktop Home → Publish → select your workshop workspace (not My workspace unless you have no shared workspace)
3 Service Open Power BI → workspace → confirm report + semantic model appear
4 Service Refresh now on the dataset — Snowflake warehouse DE_WORKSHOP_WH must be Started
5 Service Open each of the five report pages — especially Map (Azure Maps geocoding needs network)
6 Service (optional) Settings → Scheduled refresh — daily refresh before class if using Import

Sharing and licensing

Your setup What trainees need to view your published report
Workspace on Premium / Fabric capacity Often no Pro — share workspace or report link (viewer)
Workspace without Premium capacity Viewers typically need Power BI Pro (or you screen-share only)
No org workspace Demo from Desktop screen share — still works; no attendee license needed

Co-trainer access: add them as Member or Contributor on the workspace so they can open the report before Module 7.

Module 7 demo (optional, ~10 min) — pick one path:

Path When to use
A — Service (browser) You published to a cloud workspace; maps and refresh tested in app.powerbi.com
B — Desktop (local) Fallback if Service refresh fails, or you have no shared workspace
C — Skip live demo Point trainees to Exercise: Power BI; animation already showed the dashboard

Trainees — self-paced after Module 4 (optional)

  • Not part of main-day timing — no classroom block required
  • Prerequisites: Gold KPI tables from Modules 2–4; Windows + free Desktop only
  • Point trainees to Power BI setup and Exercise: Power BI after dbt lab
  • macOS/Linux attendees: read-only / defer to post-workshop Windows machine

Say once after Module 4:

“Priya’s dashboard is optional self-paced work — if you have Windows, install free Power BI Desktop and connect to the same Gold tables you just built. Full steps are in Setup → Power BI.”


Before the room opens (T-30 min)

Task Owner Done
Test projector / second screen for animations Lead [ ]
Open story site: https://mhp-data-engineer-2026.pages.dev/ (mirror: Vercel; fallback quarto preview port 4201) Co-trainer [ ]
Verify Databricks workspace mhpdeworkshop_databricks — Git folder MHP-DE-Workshop-2026, catalog grants, clusters Terminated (see Pre-class setup) Co-trainer [ ]
Verify Snowflake on trainer’s own trial account (students create theirs during class) Co-trainer [ ]
Prepare credentials to distribute: SAS token, ADLS2 storage key, attendee IDs Co-trainer [ ]
Print architecture decision matrix (1 per trainee) Co-trainer [ ]
Open Google Form URLs — QR / short links ready per module Co-trainer [ ]
Photo / save blank whiteboard space for Story sketch Lead [ ]
Power BI: published report opens in cloud workspace or local .pbix on Desktop (Module 7 demo) Co-trainer [ ]
.env / Codespaces tested on one machine Co-trainer [ ]

Credentials & Materials to Distribute

Each attendee needs the following credentials and materials during the workshop. Prepare these before class and distribute at the appropriate module.

Item When to distribute Format Notes
ATTENDEE_ID Start of day (Module 1) Printed card or slide e.g., 01_alice, 02_bob — used in every schema/table name
Databricks workspace URL Module 2 Invite link via email Trainer sends workspace invite to each attendee’s email before class
Trainer notebook paths Module 2 Slide or printed Workspace → Users → <lead or co-trainer email> → MHP-DE-Workshop-2026 — both trainers share Can Run
ADLS2 Storage Account Key Module 2 Verbal or printed Used by Databricks 00_setup.py to read Parquet from ADLS2. Never commit to Git.
SAS Token Module 3 Printed card or slide Used by Snowflake 02_external_stage.sql to create External Stage. Has an expiry date — generate fresh before each workshop.
Databricks Personal Access Token Only if using dbt with Databricks target Self-service Attendee generates their own via Settings → Access Tokens. Not needed if dbt only targets Snowflake.
Snowflake account Self-service (before or during Module 3) Attendee creates own trial at signup.snowflake.com Attendee is ACCOUNTADMIN on their own account; 00_account_setup.sql creates DE_WORKSHOP_ROLE
ImportantSnowflake is self-service

Unlike Databricks (trainer-managed workspace), each attendee creates their own Snowflake trial account. This means: - Attendees are ACCOUNTADMIN on their own accounts - They run 00_account_setup.sql themselves during Module 3 — this creates the database (DE_MASTERCLASS), warehouse (DE_WORKSHOP_WH), role (DE_WORKSHOP_ROLE), and personal schemas - The trainer cannot pre-verify attendee Snowflake accounts — only the trainer’s own account can be verified beforehand - dbt connects to Snowflake using the attendee’s own credentials (username/password + DE_WORKSHOP_ROLE)

Databricks Workspace

  1. Login — Open workspace mhpdeworkshop_databricks (ID 3359135813781456). Confirm sidebar shows Workspace, Catalog, Compute, SQL Editor.
  2. Git folders (both trainers) — each Home → MHP-DE-Workshop-2026, both trainer schemas exist (e.g. 00_juewei_*, 00_alisa_*), Share shows workshop_trainees_2026 Can Run.
  3. ComputeCompute page: clusters Terminated (not Error); Workshop policy visible; attendees can start or attach to pre-created de-workshop-{id} clusters.
  4. Unity CatalogCatalog → mhpdeworkshop_databricks_2026. Confirm workshop_trainees_2026 has USE CATALOG + CREATE SCHEMA; test schemas exist after dry-run of 00_setup.py.
  5. SQL Editor — Run SELECT 1 AS test on a SQL warehouse or cluster.
  6. Secrets (Module 8)databricks secrets list --scope workshop-scope returns five keys.

Snowflake Snowsight (trainer’s own account)

These checks run on the trainer’s own Snowflake trial account to verify the UI paths work correctly. Attendees create their own accounts during Module 3.

  1. Login — Open your Snowsight URL. Confirm the left sidebar shows: Projects, Data, Compute, Admin sections.
  2. Warehouse — At the top-right, verify DE_WORKSHOP_WH warehouse exists and is selectable. If suspended, click to resume and confirm it shows Started within ~10 seconds.
  3. Role — Verify DE_WORKSHOP_ROLE is available in the role selector dropdown (created by 00_account_setup.sql).
  4. Worksheets — Navigate to Projects → Worksheets. Create a test worksheet → run SELECT CURRENT_VERSION() → confirm the result appears.
  5. Databases — Navigate to Data → DatabasesDE_MASTERCLASS. Confirm your own schemas exist (e.g. 00_JUEWEI_BRONZE, _SILVER, _GOLD — uppercase stem from your ATTENDEE_ID).
  6. External Stage — Run LIST @00_JUEWEI_BRONZE.nyc_taxi_trips_stage (replace with your trainer ID) to verify the SAS token works and Parquet files are listed.

dbt (Docker / Codespaces)

  1. Codespaces — Open a Codespace from the fork → run dbt --version in the terminal → confirm Core 1.8.x with adapters snowflake and databricks.
  2. Docker — Pull the workshop image: docker pull ghcr.io/mhp-data-engineer/workshop-dbt:2026 → run docker run --rm ghcr.io/mhp-data-engineer/workshop-dbt:2026 dbt --version.
  3. Connection test — Inside the environment: cd dbt_project && dbt debug --target snowflake → confirm All checks passed!.

Fallback if MP4 missing: Read animation beat from voiceover scripts while showing module story callout on screen.


Trainer roles

Role Responsibility
Lead Story narration, reflection facilitation, theory, Module 7 discussion
Co-trainer Lab roaming, environment issues, timing nudges, Power BI demo
Both Never leave a stuck pair >5 min without a hint or checkpoint offer

Main-day schedule

Time Module Focus Watch for
09:00 Story Welcome Design worksheet Save whiteboard photo
09:30 1 Fundamentals Medallion + KPIs Keep to 35 min
10:00 2 Databricks Core lab Do not steal lab time for discussion
11:30 3 Snowflake Core lab Same KPIs narrative
12:45 Lunch 45 min
13:30 4 dbt Core lab dbt ≠ warehouse
15:00 5 Production Scheduling / CI LSDP naming
15:45 6 AI Cortex LLM only Not Module 9 ML
16:30 7 Wrap-up Discussion + optional PBI Matrix handout at silent write
17:00 End

Hard stops: Start Module 2 by 10:00 · Start Module 3 by 11:30 · Start Module 4 by 13:30.


Per-module checklist (repeat every module)

Module-specific notes:

Module Trainer note
Story Capture design whiteboard — revisit at 16:30
2 Sofia voice: prototype before SQL simplification
3 “Same architecture. Different implementation philosophy.”
4 Elena: dbt on Snowflake
6 Do not demo ML.FORECAST
7 Theory ≤5 min · open discussion guide

Per-module UI checkpoints (co-trainer verifies before each module)

Module Attendee UI should show Co-trainer check
2 Databricks Cluster Running (green dot) in Compute page; notebooks visible in Workspace Confirm all attendee clusters started; notebooks accessible in Shared folder
3 Snowflake Snowsight open on attendee’s own trial account; 00_account_setup.sql completed; warehouse Started Walk around — confirm each attendee has DE_MASTERCLASS database and DE_WORKSHOP_ROLE created; SAS token distributed and working
4 dbt Terminal open in Codespaces or Docker with dbt_project/ directory; dbt debug --target snowflake passing Walk around — check terminals for green All checks passed!; confirm profiles.yml uses DE_WORKSHOP_ROLE and DE_MASTERCLASS
5 Production Jobs & Pipelines page accessible in Databricks (formerly Workflows → Delta Live Tables; renamed to Lakeflow Declarative Pipelines); Snowflake worksheets with Task SQL ready Pre-create one Lakeflow pipeline as demo; verify TASK_HISTORY() returns data
6 AI Features Genie icon visible in Databricks sidebar (under SQL section); Snowflake worksheets ready for Cortex SQL Confirm AI_COMPLETE returns results (run test query); Genie page loads
7 Wrap-up No portal needed — whiteboard and discussion only Print architecture decision matrix handouts

Module 6 — Databricks AI prerequisites

Complete Step 10 and Step 10b during pre-class setup. Co-trainer verifies the table below before Module 6 (after attendees have Gold tables from Module 2–4).

Check Pass criteria
Partner-powered AI On at account + workspace (Step 10b)
SQL warehouse de-workshop-wh (Pro or Serverless) Started; trainees have Can use
Databricks SQL entitlement Enabled for workshop_trainees_2026 (Step 4)
UC data access SELECT on {attendee_id}_gold / kpi_* tables
Gold tables exist 03_gold_kpis.py completed — Module 6 builds on Gold, not Bronze
Assistant Ctrl+I in a notebook returns a code suggestion
ai_query() Test query on de-workshop-wh returns a response (model name verified)
Genie Sidebar icon loads; trainer test space answers a question on Gold KPIs
Snowflake Cortex AI_COMPLETE test query returns on trainer Snowflake account

Azure-specific note: ai_query() on Pro SQL warehouses requires Azure Private Link enabled for the workspace. If the test query fails with a Private Link error, use a Serverless SQL warehouse instead or work with MHP IT to enable Private Link.

Throughput (classroom scale): Genie defaults to ~20 questions/min per workspace — sufficient for ~20–30 trainees. Do not run a Genie stress test during the module. | 8 Streaming | Databricks cluster with Kafka Maven libs installed; Snowflake warehouse running | Verify Kafka libs on cluster (Compute → Libraries tab); Aiven topic has events flowing | | 9 ML | Databricks AI/ML → Experiments page accessible; Snowflake worksheets ready for ML.FORECAST | Confirm USE AI FUNCTIONS privilege + CORTEX_USER role granted; ML Runtime cluster available |


Module 7 runbook (30 min block)

Min Activity Doc
0–3 Animation mod-07-wrapup.mp4 voiceovers
3–5 Silent write + decision matrix matrix
5–10 Short theory: Objectives, PBI demo notes, When to Use What Module 7
10–28 Open discussion (Rounds 1–4) discussion guide
28–30 Close: three constraints + “Technology is a decision…”
+10 Optional Power BI live demo § Power BI — Service or Desktop

If running PBI demo before discussion, cut Round 3 synthesis to 5 min.


Common classroom fixes

Situation Response
Pair stuck on Bronze ingest Point to checkpoint data / co-trainer pairs in
“dbt replaces Snowflake” Draw platform box; dbt inside as transform layer
Reflection runs long 5-min timer; capture 3 bullets max on whiteboard
Running late in Module 2–4 Cut discussion to 5 min — never cut lab
Vendor debate in Module 7 “We’re advising Marcus, not picking a winner for MHP.”
Attendee cannot find notebooks in Databricks Confirm you Shared MHP-DE-Workshop-2026workshop_trainees_2026 Can Run; guide to Workspace → Users → <trainer-email> → MHP-DE-Workshop-2026
Trainee cannot Create compute / no Workshop policy Grant Workshop policy Can use to workshop_trainees_2026 (Step 6)
Module 8: PermissionDenied on dbutils.secrets.get Run secrets put-acl READ for workshop_trainees_2026 on workshop-scope (Step 11)
Attendee cannot change ATTENDEE_ID Shared folder is read-only — Clone 00_setup.py to Home
Trainees see old notebook content after GitHub update Git folders do not auto-syncboth trainers must Git → Pull on MHP-DE-Workshop-2026; trainees on shared folder inherit trainer’s checkout
“Environment configurations are not saved” on Git .py notebook Expected until PEP 723 metadata is in the file — repo notebooks include it after pull. Tell attendees to attach a cluster (not Serverless); do not add PySpark in the Environment panel
Attendee’s Databricks cluster won’t start Check Compute page for error message; try Restart; if stuck >5 min, assign a buddy cluster
Snowflake trial signup fails Suggest using a different email; check spam folder for verification email; trial creation can take 5–10 min
Snowflake 00_account_setup.sql fails Check attendee is using ACCOUNTADMIN role (default for trial); confirm SET attendee_id = '...' was run first
Snowflake DE_WORKSHOP_ROLE not found The role is created by 00_account_setup.sql — re-run the script; or manually: CREATE ROLE DE_WORKSHOP_ROLE;
Snowflake External Stage cannot list files SAS token may be expired or have extra spaces; re-copy from trainer handout; verify stage URL matches mhpdeworkshopsa.blob.core.windows.net
Snowflake warehouse shows “Suspended” Click warehouse name at top-right → click Resume; wait ~10 seconds for Started status
Snowflake worksheet shows “No results” Check session variable: SELECT $attendee_id; — if null, re-run the SET statement at the top
dbt debug fails with connection error Check profiles.yml — confirm database: DE_MASTERCLASS, role: DE_WORKSHOP_ROLE, and correct Snowflake account/user/password
Databricks Experiments page is empty The experiment appears after the first mlflow.start_run() call — run the training notebook first
Databricks “Cannot see catalog” error Re-run GRANT on mhpdeworkshop_databricks_2026 from Step 5; or Catalog → Permissions UI
Git folder clone fails / repo too large Enable sparse checkout with cone pattern databricks/notebooks only; see Git folders
Git folder commit rejected New folder outside cone pattern — add pattern under Git folder Settings → Advanced → Cone patterns
Databricks CLI 403 Forbidden PAT may be expired or lack admin scope — generate a new token: Settings → User Settings → Access Tokens; ensure workspace has admin consent for CLI apps
Module 8: workshop-scope secrets scope missing Trainer must create scope before class (see Pre-class setup); or guide attendee: databricks secrets create-scope --scope workshop-scope
ML.FORECAST returns error in Snowflake Verify GOLD_TRIPS_BY_HOUR table exists and has PICKUP_HOUR_TS timestamp column; check Cortex role
Genie icon visible but spaces won’t open Enable Partner-powered AI features at account level — see Step 10b
ai_query() fails or “not supported” Attach Pro or Serverless warehouse (not Classic); on Azure Pro, check Private Link or switch to Serverless
ai_query() model not found List foundation models in workspace; update endpoint name in lab — see Module 6 model drift callout
Genie returns empty / permission error Grant SELECT on Gold kpi_* tables in UC; confirm Genie space default warehouse is de-workshop-wh
Genie Code / Assistant missing in notebooks Confirm Partner-powered AI features = On (workspace + account); try Ctrl+I in a notebook — no separate assistive toggle in current UI
Power BI cannot connect to Snowflake Verify warehouse is Started; check server URL matches <account>.snowflakecomputing.com; use DirectQuery mode

End-of-day close (2 min script)

One dataset. Three implementations. Priya’s dashboard didn’t care which engine built Gold.

Three constraints — cost, performance, compliance. Three decisions — platform, transform, consumption.

Look at your Story sketch. You weren’t wrong to guess. Now you’ve proved it in code.

Technology is a decision. Architecture is responsibility.

Optional: 1–5 finger poll — “I could defend my tool choice to a client.”


After class

Task Owner
Save Story + Module 7 whiteboard photos Lead
Note timing overruns for next delivery Both
Log environment issues (catalog, warehouse, dbt target) Co-trainer
Share tool comparison deep dive link for self-study Lead

Full dry-run checklist: docs/dry-run-checklist.md · pre-class-checklist.qmd


Document history

Date Change
2026-06-05 Git folder sync: manual Pull required (Step 3b); cluster vs Serverless; environment banner
2026-06-05 Added Power BI cloud workspace publish workflow for trainers (Desktop build → Service demo); aligned with five-page / 12-KPI exercise
2026-06-05 Updated per-module checklist for five-step rhythm; fixed aiven-kafka → workshop-scope secret scope
2026-06-05 Databricks + ADLS2 both documented under RG 1000_data_engineering_workshop
2026-06-05 ADLS2: direct Azure Portal link to RG 1000_data_engineering_workshop (mhpdev.onmicrosoft.com)
2026-06-05 ADLS2: document pre-provisioned mhpdeworkshopsa (RG 1000_data_engineering_workshop, Germany West Central)
2026-06-05 ADLS2: TLC download + upload to raw/trips/ and raw/lookup/ before key/SAS distribution
2026-06-05 ADLS2: Azure Portal steps for storage key + container SAS; two-trainer Git folder model
2026-06-05 Groups workshop_trainees_2026 / workshop_trainer_2026; Step 7 trainee self-create cluster; SELECT + secrets READ ACL; entitlements in Step 4
2026-06-05 Added Module 6 Databricks AI prerequisites (Steps 10/10b, pre-class checks, classroom fixes)
2026-06-05 Step 10b: Partner-powered AI is sole workspace toggle; Genie Code verify checklist (no separate assistive toggle)
2026-06-05 Trainer IDs 00_{firstname} (2026: 00_juewei/00_alisa); reusable naming convention for other cohorts
2026-06-05 Two-trainer model, mandatory Git folder share to trainees; Cloudflare URL
2026-06-05 Expanded Databricks workspace admin guide (mhpdeworkshop_databricks, Git folders, UC grants); Cloudflare production URL
2026-06-04 Added pre-class infrastructure setup section (Databricks workspace, ADLS2, Aiven, Power BI); added Unity Catalog GRANT statements, cluster creation guide, Databricks CLI auth, SQL Warehouse setup
2026-05-24 Initial day-of facilitator runbook