Prerequisites

What you need before the training day

Before You Start — 3 Steps

Follow these steps before the workshop day. They take about 10–15 minutes.


Step 1: Fork the Repository

  1. Open github.com/jinjuewei/MHPDataEngineerWorkshop
  2. Click Fork (top-right) → keep the defaults → Create fork
  3. You now have your own copy at github.com/<your-username>/MHPDataEngineerWorkshop

Why fork? Your fork is your personal working copy. You can commit changes, push code, and open Codespaces from it without affecting other trainees.


Step 2: Open the Training Website

The training site (module content, setup guides, exercises) is built into the repository. Open it from your fork using GitHub Codespaces — no installation needed.

  1. Go to your fork on GitHub → Code → Codespaces → Create codespace on main
  2. Wait ~2 minutes for the container to build
  3. In the Codespace terminal:
cd workshop-2026-v2/_site
python -m http.server 8000
  1. Codespaces auto-forwards port 8000 — click Open in Browser in the notification.

Trainer-provided URL (recommended): https://mhp-data-engineer-2026.pages.dev/ — open directly in your browser; enter the access token your trainer shares. Mirror: Vercel.

Local machine (no Quarto needed):

cd workshop-2026-v2/_site
python -m http.server 8000
# Open http://localhost:8000

GitHub Pages (public repos only): 1. Fork → Settings → Pages 2. Branch: main — Folder: /workshop-2026-v2/_siteSave 3. Live at https://<your-username>.github.io/MHPDataEngineerWorkshop


Step 3: Open Your Working Environment

Your Codespace from Step 2 is already your working environment — all tools are pre-installed at /workspace/. The only remaining step is configuring your credentials (see Configure Credentials below).

Tool Status
Python 3.13+ ✅ pre-installed
dbt Core 1.8+ ✅ pre-installed
Git ✅ pre-installed
VS Code ✅ browser-based
git clone https://github.com/<your-username>/MHPDataEngineerWorkshop.git
cd MHPDataEngineerWorkshop

# Install dependencies
uv pip install --system .

# Verify dbt
dbt --version

Accounts & Access

Databricks

Snowflake

Workshop Credentials (Provided by Trainer)

dbt

  • Codespaces: pre-installed ✅ — no action needed
  • Local: Python 3.13+ and dbt Core 1.8+ required (dbt --version)

Power BI (optional — self-paced after Module 4)

Not required for the main workshop day. Priya’s dashboard exercise connects to your Gold kpi_* tables after the pipeline labs.

Requirement Notes
Windows PC Power BI Desktop is Windows only — free download
Microsoft account Optional — only if you publish to My Workspace (free personal workspace)
Gold KPI tables Complete Modules 2–4 exercises first (Databricks / Snowflake + dbt)
Same warehouse credentials Snowflake login or Databricks PAT — already in your .env

Module 7: Trainer may demo a pre-built dashboard. Trainees without Windows can follow the story and read the Power BI dashboard guide on this site.


Configure Credentials

Where to run these commands? - Codespaces users: run everything below in the Codespace terminal (the VS Code terminal inside your browser-based Codespace — not your local machine). Your Codespace is your working environment. - Local machine users: run everything in your local terminal after cd MHPDataEngineerWorkshop.

1. Environment Variables (.env)

Many scripts and notebooks read credentials from environment variables. Set them up first:

# Run this in your Codespace terminal (or local terminal)
cp .env.template .env

This creates a personal .env file inside your Codespace (or local repo). It will not be uploaded to GitHub — .gitignore keeps it private.

Open .env in the Codespace editor and replace every placeholder with your real credentials:

Variables you must fill in

Variable Where to find it What it’s used for
ATTENDEE_ID Provided by trainer (e.g., 01_alice) All modules — creates your personal schemas (01_alice_BRONZE, _SILVER, _GOLD) in Databricks, Snowflake, and dbt so each trainee works in isolation
DATABRICKS_HOST Your Databricks workspace URL (without https://) Modules 2, 4, 8 — connects Python scripts, dbt, and the Databricks CLI to your workspace
DATABRICKS_TOKEN Databricks PAT — User Settings → Access Tokens Modules 2, 4, 8 — authenticates API calls and dbt runs against your Databricks workspace
DATABRICKS_HTTP_PATH SQL Warehouse HTTP path — SQL Warehouses → Connection details Module 4 — dbt connects to the SQL Warehouse to run models on Databricks
SNOWFLAKE_ACCOUNT Your Snowflake account URL (e.g., abc12345.west-europe.azure) Modules 3, 4, 9 — connects Snowpark Python scripts and dbt to your Snowflake trial
SNOWFLAKE_USER Your Snowflake login username Modules 3, 4, 9 — Snowflake authentication
SNOWFLAKE_PASSWORD Your Snowflake login password Modules 3, 4, 9 — Snowflake authentication

DATABRICKS_HOST 1. Log in to your Databricks workspace in the browser 2. Copy the URL from the address bar — it looks like https://adb-1234567890.1.azuredatabricks.net 3. Remove the https:// prefix → paste adb-1234567890.1.azuredatabricks.net

DATABRICKS_TOKEN 1. In your Databricks workspace, click your username/email (top-right corner) 2. Select Settings 3. Go to the Developer tab 4. Next to Access tokens, click Manage 5. Click Generate new token 6. Give it a name (e.g., workshop), set lifetime to 90 days 7. Click Generatecopy the token immediately (it’s only shown once!)

DATABRICKS_HTTP_PATH 1. In the Databricks sidebar, click SQL Warehouses 2. Click the warehouse name (trainer will tell you which one, e.g., de-workshop-wh) 3. Click the Connection details tab 4. Copy the HTTP Path value — it looks like /sql/1.0/warehouses/abc123def456

SNOWFLAKE_ACCOUNT 1. Log in to Snowsight 2. Click your account name (bottom-left corner) 3. Select View account details 4. In the Account Details dialog, find Account URL 5. Copy just the domain part without https:// and without .snowflakecomputing.com - Example full URL: https://abc12345.west-europe.azure.snowflakecomputing.com - What to paste in .env: abc12345.west-europe.azure

Tip: if you can’t find it, run this SQL in a Snowsight worksheet:

SELECT CURRENT_ACCOUNT() || '.' || CURRENT_REGION();

SNOWFLAKE_USER / SNOWFLAKE_PASSWORD - These are the credentials you created when you signed up for the Snowflake trial at signup.snowflake.com

Pre-filled defaults (no change needed)

These values are already correct — they match what snowflake/setup/00_account_setup.sql creates:

Variable Default value What it’s used for
SNOWFLAKE_WAREHOUSE DE_WORKSHOP_WH The compute warehouse created by the setup script — used by all Snowflake queries
SNOWFLAKE_DATABASE DE_MASTERCLASS The main database that holds your Bronze/Silver/Gold schemas
SNOWFLAKE_SCHEMA PUBLIC Default schema (your actual work goes to {ATTENDEE_ID}_SILVER etc.)
SNOWFLAKE_ROLE DE_WORKSHOP_ROLE The role with permissions to read stages and write to your schemas

Important: .env is listed in .gitignore — it will never be committed to Git. Never share this file.

Tip

Codespaces users — the Codespace starts before you create .env, so the variables are not loaded yet. You have two options:

Option A (quick): Run the setup script now to load everything immediately:

bash .devcontainer/setup-environment.sh

This loads .env, generates ~/.dbt/profiles.yml, and shows ✅ for each variable.

Option B (persistent): Restart the Codespace (⌘/Ctrl+Shift+P → Codespaces: Restart). On restart, .env is loaded automatically into all terminal tabs — no manual step needed.

2. dbt Profiles

~/.dbt/profiles.yml is generated automatically when you either: - Run bash .devcontainer/setup-environment.sh (Option A above), or - Restart the Codespace (Option B above)

Verify it:

dbt debug --target snowflake
cd dbt_project/
cp profiles.yml.example profiles.yml
# Edit profiles.yml with your Databricks and Snowflake credentials

dbt debug   # verify connection

Troubleshooting

Tip

If you have trouble, see the tool-specific setup guides:


Day-of Checklist

Before the training starts, verify: