Architecture Reference

Three parallel pipelines, one shared dataset

title: “Architecture Reference” subtitle: “Three parallel pipelines, one shared dataset” —

High-Level Architecture

flowchart TD
    ADLS2[("Azure ADLS2\nmhpdeworkshopsa / nyc-taxi-data\nraw/trips/  Parquet ~3M trips\nraw/lookup/ CSV 265 zones")]

    ADLS2 --> DB_B
    ADLS2 --> SF_B
    ADLS2 --> DBT_B

    subgraph DB ["Databricks Pipeline"]
        direction TB
        DB_B["Bronze\nPySpark · Delta Lake\nUnity Catalog"]
        DB_S["Silver\nCleaned + Enriched\n~2.5-2.8M rows"]
        DB_G["Gold\n12 KPI tables\nUnity Catalog"]
        DB_B --> DB_S --> DB_G
    end

    subgraph SF ["Snowflake Pipeline"]
        direction TB
        SF_B["Bronze\nCOPY INTO · External Stage\nVirtual Warehouse"]
        SF_S["Silver\nSQL cleaned + Snowpark\n~2.5-2.8M rows"]
        SF_G["Gold\n12 KPI tables\nSnowflake native"]
        SF_B --> SF_S --> SF_G
    end

    subgraph DBT ["dbt Pipeline"]
        direction TB
        DBT_B["Staging\ndbt source refs\n(uses existing Bronze)"]
        DBT_S["Silver\ndbt SQL models\ndbt test built-in"]
        DBT_G["Gold\ndbt SQL models\nDatabricks or Snowflake"]
        DBT_B --> DBT_S --> DBT_G
    end

    DB_G --> PBI
    SF_G --> PBI
    DBT_G --> PBI

    PBI["Power BI Dashboard\ntrainer demo"]

    style ADLS2 fill:#0057b8,color:#fff,stroke:#003d82
    style DB_B fill:#475569,color:#fff,stroke:#334155
    style DB_S fill:#0369a1,color:#fff,stroke:#075985
    style DB_G fill:#01065c,color:#fff,stroke:#000940
    style SF_B fill:#475569,color:#fff,stroke:#334155
    style SF_S fill:#0369a1,color:#fff,stroke:#075985
    style SF_G fill:#01065c,color:#fff,stroke:#000940
    style DBT_B fill:#475569,color:#fff,stroke:#334155
    style DBT_S fill:#0369a1,color:#fff,stroke:#075985
    style DBT_G fill:#01065c,color:#fff,stroke:#000940
    style PBI fill:#107c10,color:#fff,stroke:#0a5c0a

Three parallel medallion pipelines processing the same NYC Taxi dataset from a shared Azure ADLS2 source.

Pipeline Comparison

Layer Databricks Snowflake (SQL) Snowflake (Snowpark) dbt
Ingestion spark.read.parquet() COPY INTO from stage session.read.parquet() N/A (uses existing Bronze)
Quality PySpark .filter() SQL WHERE clauses Snowpark .filter() SQL + macros
Enrichment PySpark .join() SQL LEFT JOIN Snowpark .join() SQL LEFT JOIN + ref()
KPIs .groupBy().agg() GROUP BY + CREATE TABLE AS .group_by().agg() GROUP BY in model SQL
Storage Delta Lake (Unity Catalog) Snowflake native tables Snowflake native tables Delegated to backend
Testing Manual verification Manual verification Manual verification Built-in dbt test

Data Flow Detail

Bronze Layer

  • Input: Raw Parquet files (trips) + CSV (zone lookup) from ADLS2
  • Output: Exact copy of source data + metadata columns
  • Metadata added: source_file, bronze_processing_timestamp, data_year, data_month
  • Row count: ~3 million trips + 265 zones

Silver Layer

  • Input: Bronze tables
  • Transformations:
    • Data quality filters (7 rules)
    • Column standardization
    • Derived metrics (trip_duration, fare_per_mile, tip_percentage, avg_speed)
    • Time features (pickup_hour, day_of_week, is_weekend, is_peak_hour, time_of_day)
    • Descriptive labels (payment_type_desc, rate_code_desc, distance_band)
    • Zone enrichment (LEFT JOIN with lookup)
  • Output: silver_nyc_taxi_cleaned + silver_nyc_taxi_enriched
  • Row count: ~2.5-2.8M (after quality filters)

Gold Layer

  • Input: Silver enriched table
  • Output: 12 KPI aggregation tables
  • Designed for: Direct consumption by dashboards, reports, analysts

Production Architecture

flowchart LR
    subgraph TODAY ["Today — Training"]
        T1["Interactive notebooks"]
        T2["SQL worksheets"]
        T3["Manual dbt run"]
    end

    subgraph PROD ["Production — Real World"]
        P1["DLT + Workflows + DABs"]
        P2["Tasks + Streams + Stored Procs"]
        P3["dbt Cloud / GitHub Actions CI"]
    end

    TODAY -->|"productionise"| PROD

    style TODAY fill:#dbeafe,stroke:#0057b8,stroke-width:2px
    style PROD  fill:#dcfce7,stroke:#107c10,stroke-width:2px

Azure Infrastructure

All workshop Azure resources are pre-provisioned in resource group 1000_data_engineering_workshop (subscription MHP Resort Consulting Services).

Resource Details
Resource Group 1000_data_engineering_workshop
Subscription MHP Resort Consulting Services (ba826c91-8e52-4e07-ac7c-538858bbc813)
Azure Tenant mhpdev.onmicrosoft.com
Databricks Workspace mhpdeworkshop_databricks (ID 3359135813781456)
Unity Catalog (2026 cohort) mhpdeworkshop_databricks_2026 (separate from 2025 catalog mhpdeworkshop_databricks_2025)
Storage Account mhpdeworkshopsa
Container nyc-taxi-data
Trip Data Path raw/trips/ (Parquet files)
Lookup Data Path raw/lookup/taxi_zone_lookup.csv
Protocol abfss:// (ADLS2 Gen2)
Storage Region Germany West Central (germanywestcentral)