flowchart TD
ADLS2[("Azure ADLS2\nmhpdeworkshopsa / nyc-taxi-data\nraw/trips/ Parquet ~3M trips\nraw/lookup/ CSV 265 zones")]
ADLS2 --> DB_B
ADLS2 --> SF_B
ADLS2 --> DBT_B
subgraph DB ["Databricks Pipeline"]
direction TB
DB_B["Bronze\nPySpark · Delta Lake\nUnity Catalog"]
DB_S["Silver\nCleaned + Enriched\n~2.5-2.8M rows"]
DB_G["Gold\n12 KPI tables\nUnity Catalog"]
DB_B --> DB_S --> DB_G
end
subgraph SF ["Snowflake Pipeline"]
direction TB
SF_B["Bronze\nCOPY INTO · External Stage\nVirtual Warehouse"]
SF_S["Silver\nSQL cleaned + Snowpark\n~2.5-2.8M rows"]
SF_G["Gold\n12 KPI tables\nSnowflake native"]
SF_B --> SF_S --> SF_G
end
subgraph DBT ["dbt Pipeline"]
direction TB
DBT_B["Staging\ndbt source refs\n(uses existing Bronze)"]
DBT_S["Silver\ndbt SQL models\ndbt test built-in"]
DBT_G["Gold\ndbt SQL models\nDatabricks or Snowflake"]
DBT_B --> DBT_S --> DBT_G
end
DB_G --> PBI
SF_G --> PBI
DBT_G --> PBI
PBI["Power BI Dashboard\ntrainer demo"]
style ADLS2 fill:#0057b8,color:#fff,stroke:#003d82
style DB_B fill:#475569,color:#fff,stroke:#334155
style DB_S fill:#0369a1,color:#fff,stroke:#075985
style DB_G fill:#01065c,color:#fff,stroke:#000940
style SF_B fill:#475569,color:#fff,stroke:#334155
style SF_S fill:#0369a1,color:#fff,stroke:#075985
style SF_G fill:#01065c,color:#fff,stroke:#000940
style DBT_B fill:#475569,color:#fff,stroke:#334155
style DBT_S fill:#0369a1,color:#fff,stroke:#075985
style DBT_G fill:#01065c,color:#fff,stroke:#000940
style PBI fill:#107c10,color:#fff,stroke:#0a5c0a
Architecture Reference
Three parallel pipelines, one shared dataset
title: “Architecture Reference” subtitle: “Three parallel pipelines, one shared dataset” —
High-Level Architecture
Pipeline Comparison
| Layer | Databricks | Snowflake (SQL) | Snowflake (Snowpark) | dbt |
|---|---|---|---|---|
| Ingestion | spark.read.parquet() |
COPY INTO from stage |
session.read.parquet() |
N/A (uses existing Bronze) |
| Quality | PySpark .filter() |
SQL WHERE clauses |
Snowpark .filter() |
SQL + macros |
| Enrichment | PySpark .join() |
SQL LEFT JOIN |
Snowpark .join() |
SQL LEFT JOIN + ref() |
| KPIs | .groupBy().agg() |
GROUP BY + CREATE TABLE AS |
.group_by().agg() |
GROUP BY in model SQL |
| Storage | Delta Lake (Unity Catalog) | Snowflake native tables | Snowflake native tables | Delegated to backend |
| Testing | Manual verification | Manual verification | Manual verification | Built-in dbt test |
Data Flow Detail
Bronze Layer
- Input: Raw Parquet files (trips) + CSV (zone lookup) from ADLS2
- Output: Exact copy of source data + metadata columns
- Metadata added:
source_file,bronze_processing_timestamp,data_year,data_month - Row count: ~3 million trips + 265 zones
Silver Layer
- Input: Bronze tables
- Transformations:
- Data quality filters (7 rules)
- Column standardization
- Derived metrics (trip_duration, fare_per_mile, tip_percentage, avg_speed)
- Time features (pickup_hour, day_of_week, is_weekend, is_peak_hour, time_of_day)
- Descriptive labels (payment_type_desc, rate_code_desc, distance_band)
- Zone enrichment (LEFT JOIN with lookup)
- Output:
silver_nyc_taxi_cleaned+silver_nyc_taxi_enriched - Row count: ~2.5-2.8M (after quality filters)
Gold Layer
- Input: Silver enriched table
- Output: 12 KPI aggregation tables
- Designed for: Direct consumption by dashboards, reports, analysts
Production Architecture
flowchart LR
subgraph TODAY ["Today — Training"]
T1["Interactive notebooks"]
T2["SQL worksheets"]
T3["Manual dbt run"]
end
subgraph PROD ["Production — Real World"]
P1["DLT + Workflows + DABs"]
P2["Tasks + Streams + Stored Procs"]
P3["dbt Cloud / GitHub Actions CI"]
end
TODAY -->|"productionise"| PROD
style TODAY fill:#dbeafe,stroke:#0057b8,stroke-width:2px
style PROD fill:#dcfce7,stroke:#107c10,stroke-width:2px
Azure Infrastructure
All workshop Azure resources are pre-provisioned in resource group 1000_data_engineering_workshop (subscription MHP Resort Consulting Services).
| Resource | Details |
|---|---|
| Resource Group | 1000_data_engineering_workshop |
| Subscription | MHP Resort Consulting Services (ba826c91-8e52-4e07-ac7c-538858bbc813) |
| Azure Tenant | mhpdev.onmicrosoft.com |
| Databricks Workspace | mhpdeworkshop_databricks (ID 3359135813781456) |
| Unity Catalog (2026 cohort) | mhpdeworkshop_databricks_2026 (separate from 2025 catalog mhpdeworkshop_databricks_2025) |
| Storage Account | mhpdeworkshopsa |
| Container | nyc-taxi-data |
| Trip Data Path | raw/trips/ (Parquet files) |
| Lookup Data Path | raw/lookup/taxi_zone_lookup.csv |
| Protocol | abfss:// (ADLS2 Gen2) |
| Storage Region | Germany West Central (germanywestcentral) |