Module 2: Databricks Pipeline

PySpark, Unity Catalog, and Delta Lake

Duration: 75 min — Animation (3) · Think & Discuss (7) · Theory (15) · Quiz (3) · Practice (47)

1. Animation

2. Think & Discuss

Situation: Elena approved Databricks for the prototype. Raw Parquet sits in ADLS2. Priya waits for Gold tables for her Overview page.

Prompts:

  • How would you ingest Parquet from ADLS2 into Bronze?
  • What could go wrong if you skip quality checks and jump to KPIs?
  • Which two Gold KPI tables unlock trips-by-hour and day-of-week charts?

3. Theory

NoteLakeflow Spark Declarative Pipelines (LSDP)

Production Databricks pipelines use LSDP (formerly DLT). Lab notebooks use interactive PySpark — clearer for learning. Both implement the same medallion layers.

3.1 Theory: Databricks Essentials

Architecture

Databricks separates the control plane (notebooks, job definitions, user management — hosted by Databricks) from the data plane (Spark clusters running in your cloud account). Your code executes on cluster nodes; your data stays in your cloud storage (Azure ADLS2 in this workshop). This separation means compute and storage scale independently.

NotePhoton Engine

A core architecture component shown on Slide 11 is the Photon Engine — Databricks’ vectorized C++ execution engine that replaces the default JVM-based Spark runtime. Photon processes data in batches and significantly speeds up SQL queries, joins, and aggregations. Clusters with Photon enabled can achieve up to 25% lower latency with no code changes. Photon is enabled per-cluster (via cluster configuration) and is especially powerful for BI workloads where the same queries run repeatedly against large Gold tables.

  • Workspace: Your Databricks environment — notebooks, jobs, catalogs, and collaboration features
  • Cluster: Compute resource running Apache Spark. Clusters can auto-scale (e.g., 2–8 workers) and auto-terminate after inactivity. All-purpose clusters are for interactive development; job clusters spin up for a single pipeline run and terminate automatically — significantly cheaper for production.
ImportantLab compute: cluster — not Serverless

Today’s notebooks are PySpark medallion pipelines (spark.read, spark.sql, saveAsTable) over ~3M rows in ADLS2. That work runs on a Spark cluster (data plane in your Azure subscription).

UI control What it is This lab
Compute dropdown (top of notebook) Where Spark runs Attach your de-workshop-… cluster — start it if needed
Environment side panel Serverless Python REPL (packages, base image) Not used for this pipeline — do not add PySpark there

flowchart LR
    subgraph notebook ["Notebook"]
        N["Your .py cells"]
    end
    subgraph compute ["Compute dropdown — pick one"]
        C["Cluster<br/>Spark + Delta + UC"]
        S["Serverless<br/>lightweight Python REPL"]
    end
    subgraph env ["Environment side panel"]
        E["Python packages<br/>Serverless REPL only"]
    end
    N --> C
    N -.-> S
    E -.-> S

    classDef cluster fill:#ff3621,color:#fff,stroke:#c42a15
    classDef serverless fill:#94a3b8,color:#fff,stroke:#64748b
    classDef panel fill:#e2e8f0,color:#1e293b,stroke:#94a3b8
    classDef cells fill:#0057b8,color:#fff,stroke:#003d82

    class N cells
    class C cluster
    class S serverless
    class E panel

Solid arrow = this lab (PySpark on a cluster). Dotted = Serverless path (not used for 0103). Environment panel applies only to Serverless REPL — not to Spark on a cluster (serverless environment).

Why not Serverless for Module 2? Serverless configures a Python REPL (Environment side pane); high-memory settings there affect REPL memory, not Spark session size on a cluster (docs). Distributed spark.read / saveAsTable for this lab use an attached cluster. Databricks warns: do not install PySpark in the serverless Environment panel — it stops the session.

Production may use serverless SQL, serverless GPU, or Lakeflow pipelines elsewhere — this lab teaches the classic cluster + notebook path Elena approved for the prototype.

Operational steps (Git folder, Pull/sync, clone 00_setup, environment banner): see Exercise: Databricks — Before you start and the Databricks setup guide (Git folder sync).

  • Unity Catalog: Centralized governance layer with a hierarchical object model:

flowchart TD
    MS["Metastore<br/><i>governance boundary</i>"]
    MS --> CAT["Catalog<br/><i>mhpdeworkshop_databricks_2026</i>"]

    CAT --> S1["Schema: 01_alice_bronze"]
    CAT --> S2["Schema: 01_alice_silver"]
    CAT --> S3["Schema: 01_alice_gold"]

    S1 --> T1["Tables"]
    S1 --> V1["Views"]

    S2 --> T2["Tables"]
    S2 --> V2["Views"]

    S3 --> T3["Tables"]
    S3 --> V3["Views"]
    S3 --> MV3["Materialized Views"]

    classDef metastore fill:#01065c,color:#fff,stroke:#000940
    classDef catalog fill:#0057b8,color:#fff,stroke:#003d82
    classDef schema fill:#0369a1,color:#fff,stroke:#075985
    classDef objects fill:#475569,color:#fff,stroke:#334155

    class MS metastore
    class CAT catalog
    class S1,S2,S3 schema
    class T1,T2,T3,V1,V2,V3,MV3 objects

The three-level namespace (catalog.schema.object) is what you use in SQL and PySpark. The metastore sits above all catalogs and is the trust boundary for access control, audit logging, and column-level lineage.

TipMetric Views (2025)

Unity Catalog now supports Metric Views — semantic layers that let you define KPIs once (e.g., “daily revenue” or “average fare per mile”) and reuse them consistently across dashboards, Genie Spaces, and alerts. This eliminates the risk of different teams computing the same metric with slightly different logic.

NoteObjects in today’s lab

You will work primarily with tables (Bronze, Silver, Gold). Volumes and functions exist in the same namespace but are not used in this module.

  • Delta Lake: Open-source table format built on Parquet. Every Delta table has a _delta_log/ directory containing JSON transaction files that record every write, update, and delete. This log enables ACID transactions (concurrent reads/writes without corruption), time travel (query the table as it was at any point: VERSION AS OF 0), and schema enforcement (reject writes that don’t match the expected schema). Periodic checkpoint files (Parquet snapshots of the log) keep read performance fast even on tables with thousands of commits.

flowchart LR
    subgraph DeltaTable ["Delta Table (Parquet files)"]
        P1["part-00000.parquet"]
        P2["part-00001.parquet"]
        P3["part-00002.parquet"]
    end

    subgraph DeltaLog ["_delta_log/"]
        L0["00000.json<br/><i>CREATE TABLE</i>"]
        L1["00001.json<br/><i>INSERT</i>"]
        L2["00002.json<br/><i>UPDATE</i>"]
        CK["00002.checkpoint.parquet"]
    end

    L0 --> P1
    L0 --> P2
    L1 --> P3
    L2 -.->|"modifies"| P1

    classDef parquet fill:#E25A1C,color:#fff,stroke:#c44a15
    classDef log fill:#0369a1,color:#fff,stroke:#075985
    classDef checkpoint fill:#6d28d9,color:#fff,stroke:#5b21b6

    class P1,P2,P3 parquet
    class L0,L1,L2 log
    class CK checkpoint

Cluster cost management

Use auto-scaling clusters for interactive development — they scale down when load drops. For production pipelines, always use job clusters with auto-termination so you only pay for the compute a pipeline actually needs. An idle all-purpose cluster that nobody stopped is the #1 cost surprise for Databricks newcomers.

ADLS2 Integration

Databricks reads directly from Azure ADLS2 using the abfss:// protocol. No staging step is required — Spark infers the Parquet schema from file metadata in a single read:

spark.conf.set(
    f"fs.azure.account.key.{STORAGE_ACCOUNT}.dfs.core.windows.net",
    STORAGE_ACCOUNT_KEY
)
df = spark.read.parquet(f"abfss://{CONTAINER}@{ACCOUNT}.dfs.core.windows.net/raw/trips/")

This one-line read is architecturally simpler than Snowflake’s two-step stage + COPY INTO, but gives less control over error handling for malformed files.

PySpark Pattern

The standard pattern throughout this module is Read → Transform → Write:

# Read from ADLS2 or a catalog table
df = spark.read.parquet(path)

# Transform using DataFrame operations (filter, join, aggregate)
df_cleaned = df.filter(...).withColumn(...).dropDuplicates()

# Write to a Unity Catalog table (overwrite or append)
df_cleaned.write.mode("overwrite").saveAsTable("catalog.schema.table")

PySpark DataFrames are lazy — transformations are not executed until an action (.count(), .show(), .write) triggers computation. This allows Spark to optimize the entire execution plan before running anything.

Don’t forget to stop your cluster

Forgetting to stop a cluster after a lab session is the #1 cost surprise for Databricks newcomers. When you finish the exercise, go to Compute in the sidebar and stop your cluster. In production, use job clusters with auto-termination instead.

3.2 Demo: Trainer Walkthrough

The trainer runs all four notebooks in sequence:

  1. 00_setup.py — Configure ATTENDEE_ID, ADLS2 credentials, create schemas
  2. 01_bronze_ingestion.py — Read Parquet + CSV from ADLS2, write Bronze tables
  3. 02_silver_cleaning.py — Quality filters, derived metrics, zone enrichment
  4. 03_gold_kpis.py — 12 KPI aggregations written as Gold tables

3.3 Key Takeaways

  • Databricks excels at complex PySpark transformations and ML workloads
  • Unity Catalog provides centralized governance across all data assets
  • Delta Lake gives ACID guarantees, time travel, and schema evolution
  • In production, replace notebooks with LSDP + Lakeflow Jobs (see Module 5)

4. Quiz

Quiz: Module 2 — Databricks Pipeline Quiz

Scan QR to open quiz

Before moving on, make sure you can answer:

  1. What are the three levels of the Unity Catalog namespace, and which level contains your schemas?
  2. What does Delta Lake provide that plain Parquet files do not?
  3. Why is .write.mode("overwrite") used instead of .mode("append") for Gold KPI tables?
  4. For today’s PySpark lab, should you attach Serverless or a cluster from the compute dropdown — and why?

5. Practice

Hands-on lab

Notebooks: databricks/notebooks/00_setup.py through 03_gold_kpis.py.

Priya / Power BI: Overview page lights up — kpi_trips_by_hour and kpi_trips_by_day populate first charts.

Next module

Module 3: Snowflake Pipeline — Marcus reviews the prototype: My team lives in SQL — not PySpark notebooks.


Official Documentation