flowchart LR
subgraph notebook ["Notebook"]
N["Your .py cells"]
end
subgraph compute ["Compute dropdown — pick one"]
C["Cluster<br/>Spark + Delta + UC"]
S["Serverless<br/>lightweight Python REPL"]
end
subgraph env ["Environment side panel"]
E["Python packages<br/>Serverless REPL only"]
end
N --> C
N -.-> S
E -.-> S
classDef cluster fill:#ff3621,color:#fff,stroke:#c42a15
classDef serverless fill:#94a3b8,color:#fff,stroke:#64748b
classDef panel fill:#e2e8f0,color:#1e293b,stroke:#94a3b8
classDef cells fill:#0057b8,color:#fff,stroke:#003d82
class N cells
class C cluster
class S serverless
class E panel
Module 2: Databricks Pipeline
PySpark, Unity Catalog, and Delta Lake
Duration: 75 min — Animation (3) · Think & Discuss (7) · Theory (15) · Quiz (3) · Practice (47)
1. Animation
2. Think & Discuss
Situation: Elena approved Databricks for the prototype. Raw Parquet sits in ADLS2. Priya waits for Gold tables for her Overview page.
Prompts:
- How would you ingest Parquet from ADLS2 into Bronze?
- What could go wrong if you skip quality checks and jump to KPIs?
- Which two Gold KPI tables unlock trips-by-hour and day-of-week charts?
3. Theory
Production Databricks pipelines use LSDP (formerly DLT). Lab notebooks use interactive PySpark — clearer for learning. Both implement the same medallion layers.
3.1 Theory: Databricks Essentials
Architecture
Databricks separates the control plane (notebooks, job definitions, user management — hosted by Databricks) from the data plane (Spark clusters running in your cloud account). Your code executes on cluster nodes; your data stays in your cloud storage (Azure ADLS2 in this workshop). This separation means compute and storage scale independently.
A core architecture component shown on Slide 11 is the Photon Engine — Databricks’ vectorized C++ execution engine that replaces the default JVM-based Spark runtime. Photon processes data in batches and significantly speeds up SQL queries, joins, and aggregations. Clusters with Photon enabled can achieve up to 25% lower latency with no code changes. Photon is enabled per-cluster (via cluster configuration) and is especially powerful for BI workloads where the same queries run repeatedly against large Gold tables.
- Workspace: Your Databricks environment — notebooks, jobs, catalogs, and collaboration features
- Cluster: Compute resource running Apache Spark. Clusters can auto-scale (e.g., 2–8 workers) and auto-terminate after inactivity. All-purpose clusters are for interactive development; job clusters spin up for a single pipeline run and terminate automatically — significantly cheaper for production.
Today’s notebooks are PySpark medallion pipelines (spark.read, spark.sql, saveAsTable) over ~3M rows in ADLS2. That work runs on a Spark cluster (data plane in your Azure subscription).
| UI control | What it is | This lab |
|---|---|---|
| Compute dropdown (top of notebook) | Where Spark runs | Attach your de-workshop-… cluster — start it if needed |
| Environment side panel | Serverless Python REPL (packages, base image) | Not used for this pipeline — do not add PySpark there |
Solid arrow = this lab (PySpark on a cluster). Dotted = Serverless path (not used for 01–03). Environment panel applies only to Serverless REPL — not to Spark on a cluster (serverless environment).
Why not Serverless for Module 2? Serverless configures a Python REPL (Environment side pane); high-memory settings there affect REPL memory, not Spark session size on a cluster (docs). Distributed spark.read / saveAsTable for this lab use an attached cluster. Databricks warns: do not install PySpark in the serverless Environment panel — it stops the session.
Production may use serverless SQL, serverless GPU, or Lakeflow pipelines elsewhere — this lab teaches the classic cluster + notebook path Elena approved for the prototype.
Operational steps (Git folder, Pull/sync, clone 00_setup, environment banner): see Exercise: Databricks — Before you start and the Databricks setup guide (Git folder sync).
- Unity Catalog: Centralized governance layer with a hierarchical object model:
flowchart TD
MS["Metastore<br/><i>governance boundary</i>"]
MS --> CAT["Catalog<br/><i>mhpdeworkshop_databricks_2026</i>"]
CAT --> S1["Schema: 01_alice_bronze"]
CAT --> S2["Schema: 01_alice_silver"]
CAT --> S3["Schema: 01_alice_gold"]
S1 --> T1["Tables"]
S1 --> V1["Views"]
S2 --> T2["Tables"]
S2 --> V2["Views"]
S3 --> T3["Tables"]
S3 --> V3["Views"]
S3 --> MV3["Materialized Views"]
classDef metastore fill:#01065c,color:#fff,stroke:#000940
classDef catalog fill:#0057b8,color:#fff,stroke:#003d82
classDef schema fill:#0369a1,color:#fff,stroke:#075985
classDef objects fill:#475569,color:#fff,stroke:#334155
class MS metastore
class CAT catalog
class S1,S2,S3 schema
class T1,T2,T3,V1,V2,V3,MV3 objects
The three-level namespace (catalog.schema.object) is what you use in SQL and PySpark. The metastore sits above all catalogs and is the trust boundary for access control, audit logging, and column-level lineage.
Unity Catalog now supports Metric Views — semantic layers that let you define KPIs once (e.g., “daily revenue” or “average fare per mile”) and reuse them consistently across dashboards, Genie Spaces, and alerts. This eliminates the risk of different teams computing the same metric with slightly different logic.
You will work primarily with tables (Bronze, Silver, Gold). Volumes and functions exist in the same namespace but are not used in this module.
- Delta Lake: Open-source table format built on Parquet. Every Delta table has a
_delta_log/directory containing JSON transaction files that record every write, update, and delete. This log enables ACID transactions (concurrent reads/writes without corruption), time travel (query the table as it was at any point:VERSION AS OF 0), and schema enforcement (reject writes that don’t match the expected schema). Periodic checkpoint files (Parquet snapshots of the log) keep read performance fast even on tables with thousands of commits.
flowchart LR
subgraph DeltaTable ["Delta Table (Parquet files)"]
P1["part-00000.parquet"]
P2["part-00001.parquet"]
P3["part-00002.parquet"]
end
subgraph DeltaLog ["_delta_log/"]
L0["00000.json<br/><i>CREATE TABLE</i>"]
L1["00001.json<br/><i>INSERT</i>"]
L2["00002.json<br/><i>UPDATE</i>"]
CK["00002.checkpoint.parquet"]
end
L0 --> P1
L0 --> P2
L1 --> P3
L2 -.->|"modifies"| P1
classDef parquet fill:#E25A1C,color:#fff,stroke:#c44a15
classDef log fill:#0369a1,color:#fff,stroke:#075985
classDef checkpoint fill:#6d28d9,color:#fff,stroke:#5b21b6
class P1,P2,P3 parquet
class L0,L1,L2 log
class CK checkpoint
Cluster cost management
Use auto-scaling clusters for interactive development — they scale down when load drops. For production pipelines, always use job clusters with auto-termination so you only pay for the compute a pipeline actually needs. An idle all-purpose cluster that nobody stopped is the #1 cost surprise for Databricks newcomers.
ADLS2 Integration
Databricks reads directly from Azure ADLS2 using the abfss:// protocol. No staging step is required — Spark infers the Parquet schema from file metadata in a single read:
spark.conf.set(
f"fs.azure.account.key.{STORAGE_ACCOUNT}.dfs.core.windows.net",
STORAGE_ACCOUNT_KEY
)
df = spark.read.parquet(f"abfss://{CONTAINER}@{ACCOUNT}.dfs.core.windows.net/raw/trips/")This one-line read is architecturally simpler than Snowflake’s two-step stage + COPY INTO, but gives less control over error handling for malformed files.
PySpark Pattern
The standard pattern throughout this module is Read → Transform → Write:
# Read from ADLS2 or a catalog table
df = spark.read.parquet(path)
# Transform using DataFrame operations (filter, join, aggregate)
df_cleaned = df.filter(...).withColumn(...).dropDuplicates()
# Write to a Unity Catalog table (overwrite or append)
df_cleaned.write.mode("overwrite").saveAsTable("catalog.schema.table")PySpark DataFrames are lazy — transformations are not executed until an action (.count(), .show(), .write) triggers computation. This allows Spark to optimize the entire execution plan before running anything.
Don’t forget to stop your cluster
Forgetting to stop a cluster after a lab session is the #1 cost surprise for Databricks newcomers. When you finish the exercise, go to Compute in the sidebar and stop your cluster. In production, use job clusters with auto-termination instead.
3.2 Demo: Trainer Walkthrough
The trainer runs all four notebooks in sequence:
- 00_setup.py — Configure ATTENDEE_ID, ADLS2 credentials, create schemas
- 01_bronze_ingestion.py — Read Parquet + CSV from ADLS2, write Bronze tables
- 02_silver_cleaning.py — Quality filters, derived metrics, zone enrichment
- 03_gold_kpis.py — 12 KPI aggregations written as Gold tables
3.3 Key Takeaways
- Databricks excels at complex PySpark transformations and ML workloads
- Unity Catalog provides centralized governance across all data assets
- Delta Lake gives ACID guarantees, time travel, and schema evolution
- In production, replace notebooks with LSDP + Lakeflow Jobs (see Module 5)
4. Quiz
Quiz: Module 2 — Databricks Pipeline Quiz
Before moving on, make sure you can answer:
- What are the three levels of the Unity Catalog namespace, and which level contains your schemas?
- What does Delta Lake provide that plain Parquet files do not?
- Why is
.write.mode("overwrite")used instead of.mode("append")for Gold KPI tables? - For today’s PySpark lab, should you attach Serverless or a cluster from the compute dropdown — and why?
5. Practice
Hands-on lab
- Exercise: Databricks Pipeline — start with Before you start (cluster, Git folder,
00_setupclone)
Notebooks: databricks/notebooks/00_setup.py through 03_gold_kpis.py.
Priya / Power BI: Overview page lights up — kpi_trips_by_hour and kpi_trips_by_day populate first charts.
Next module
Module 3: Snowflake Pipeline — Marcus reviews the prototype: My team lives in SQL — not PySpark notebooks.