Architecture Reference

Three parallel pipelines, one shared dataset

title: “Architecture Reference” subtitle: “Three parallel pipelines, one shared dataset” —

High-Level Architecture

flowchart TD
    ADLS2[("Azure ADLS2<br>mhpdeworkshopsa / nyc-taxi-data<br>raw/trips/  Parquet 3,646,319 trips<br>raw/lookup/ CSV 265 zones")]

    ADLS2 --> DB_B
    ADLS2 --> SF_B

    subgraph DB ["Databricks Pipeline"]
        direction TB
        DB_B["Bronze<br>PySpark · Delta Lake<br>Unity Catalog"]
        DB_S["Silver<br>Cleaned + Enriched<br>3,146,710 rows"]
        DB_G["Gold<br>12 KPI tables<br>Unity Catalog"]
        DB_B --> DB_S --> DB_G
    end

    subgraph SF ["Snowflake Pipeline"]
        direction TB
        SF_B["Bronze<br>COPY INTO · External Stage<br>Virtual Warehouse"]
        SF_S["Silver<br>SQL cleaned + Snowpark<br>3,146,710 rows"]
        SF_G["Gold<br>12 KPI tables<br>Snowflake native"]
        SF_B --> SF_S --> SF_G
    end

    subgraph DBT ["dbt Pipeline"]
        direction TB
        DBT_B["Staging<br>dbt source refs<br>(reads existing Bronze)"]
        DBT_S["Silver<br>dbt SQL models<br>dbt test built-in"]
        DBT_G["Gold<br>dbt SQL models<br>Databricks or Snowflake"]
        DBT_B --> DBT_S --> DBT_G
    end

    DB_B -.->|"source()"| DBT_B
    SF_B -.->|"source()"| DBT_B

    DB_G --> PBI
    SF_G --> PBI
    DBT_G --> PBI

    PBI["Power BI Dashboard<br>five-page KPI report"]

    style ADLS2 fill:#0057b8,color:#fff,stroke:#003d82
    style DB_B fill:#475569,color:#fff,stroke:#334155
    style DB_S fill:#0369a1,color:#fff,stroke:#075985
    style DB_G fill:#01065c,color:#fff,stroke:#000940
    style SF_B fill:#475569,color:#fff,stroke:#334155
    style SF_S fill:#0369a1,color:#fff,stroke:#075985
    style SF_G fill:#01065c,color:#fff,stroke:#000940
    style DBT_B fill:#475569,color:#fff,stroke:#334155
    style DBT_S fill:#0369a1,color:#fff,stroke:#075985
    style DBT_G fill:#01065c,color:#fff,stroke:#000940
    style PBI fill:#107c10,color:#fff,stroke:#0a5c0a

Three parallel medallion pipelines processing the same NYC Taxi dataset from a shared Azure ADLS2 source. dbt reads Bronze tables your pipeline already created — it does not ingest from ADLS2.

Note

dbt does not ingest from ADLS2. Run Databricks notebooks or Snowflake Snowpark bronze first, then dbt run. On Snowflake, dbt expects the Snowpark bronze table (BRONZE_NYC_TAXI_TRIPS), not the SQL-script bronze (bronze_nyc_taxi_trips). See recommended workshop tracks.

Workshop schema isolation

Each attendee works in isolated schemas so pipelines do not overwrite each other:

Platform	Typical pattern	Example
Databricks	`{attendee_id}_bronze` / `_silver` / `_gold`	`de_01_alice_bronze.nyc_taxi_trips`
Snowflake SQL	`{attendee_id}_SQL_*`	`DE_01_ALICE_SQL_SILVER`
Snowflake Snowpark	`{attendee_id}_SP_*`	`DE_01_ALICE_SP_BRONZE`
dbt (batch)	`{attendee_id}_DBT` / `_DBT_GOLD`	`DE_01_ALICE_DBT_GOLD`
dbt (streaming)	`{attendee_id}_DBT_STREAMING`	`DE_01_ALICE_DBT_STREAMING`

Full naming tables and track guidance: Data Model — recommended workshop tracks.

Pipeline Comparison

Layer	Databricks	Snowflake (SQL)	Snowflake (Snowpark)	dbt
Ingestion	`spark.read.parquet()`	`COPY INTO` from stage	`session.read.parquet()`	N/A (uses existing Bronze)
Quality	PySpark `.filter()`	SQL `WHERE` clauses	Snowpark `.filter()`	SQL + macros
Enrichment	PySpark `.join()`	SQL `LEFT JOIN`	Snowpark `.join()`	SQL `LEFT JOIN` + `ref()`
KPIs	`.groupBy().agg()`	`GROUP BY` + `CREATE TABLE AS`	`.group_by().agg()`	`GROUP BY` in model SQL
Storage	Delta Lake (Unity Catalog)	Snowflake native tables	Snowflake native tables	Delegated to backend
Testing	Manual verification	Manual verification	Manual verification	Built-in `dbt test`

Data Flow Detail

Bronze Layer

Input: Raw Parquet files (trips) + CSV (zone lookup) from ADLS2
Output: Exact copy of source data + metadata columns
Metadata added (names vary by ingest path):

Path	Typical metadata
Databricks	`_source_path`, `_bronze_loaded_at`
Snowflake SQL	`_source_file`, `_loaded_at`
Snowflake Snowpark	`_LOADED_AT`

See Bronze schema for the full column list.

Row count: 3,646,319 trips + 265 zones (Oct 2024 workshop month — see Workshop dataset volumes)

Silver Layer

Input: Bronze tables
Transformations:
- Data quality rules (9 rules: filters, negative-tip correction, deduplication)
- Column standardization
- Derived metrics (trip_duration, fare_per_mile, tip_percentage, avg_speed)
- Time features (pickup_hour, day_of_week, is_weekend, is_peak_hour, time_of_day)
- Descriptive labels (payment_type_desc, rate_code_desc, distance_band)
- Zone enrichment (LEFT JOIN with lookup)
Output: silver_nyc_taxi_cleaned + silver_nyc_taxi_enriched
Row count: 3,146,710 after quality filters (499,609 removed, 13.7%); Databricks may be marginally smaller due to an extra cross-month filter — see Workshop dataset volumes

Gold Layer

Input: Silver enriched table
Output: 12 KPI aggregation tables (kpi_*)
Designed for: Direct consumption by dashboards, reports, analysts
Metric Views (optional, Databricks): Unity Catalog Metric Views (2025 preview) offer a governed alternative — define KPI logic once and share across Databricks BI and Genie without extra physical tables. This workshop materialises physical Gold tables for cross-platform parity (Databricks, Snowflake, dbt) and Power BI Import mode; production teams may add Metric Views on top of curated Gold.

Optional Extensions (Modules 8–9)

The batch architecture above is the core workshop path. Optional modules add parallel use cases — they do not feed NYC Taxi Bronze from Kafka.

flowchart TD
    subgraph BATCH ["Batch (Modules 2–4) — NYC Taxi"]
        DB_S2["Silver enriched<br>(all three paths)"]
    end

    DB_S2 -.->|"Module 9 (Optional)"| ML["ML feature table<br>Tip prediction<br>Databricks · Snowflake · dbt"]

    KAFKA[("Aiven Kafka<br>User activity events")]

    subgraph STREAM ["Module 8 (Optional) — Streaming"]
        KAFKA --> DB_ST["Databricks<br>Structured Streaming"]
        KAFKA --> SF_ST["Snowflake<br>Dynamic Tables"]
        KAFKA --> DBT_ST["dbt<br>dynamic_table models"]
    end

    style KAFKA fill:#d97706,color:#fff,stroke:#b45309
    style ML fill:#6d28d9,color:#fff,stroke:#5b21b6
    style DB_ST fill:#01065c,color:#fff,stroke:#000940
    style SF_ST fill:#01065c,color:#fff,stroke:#000940
    style DBT_ST fill:#01065c,color:#fff,stroke:#000940

Optional Module 8 (streaming) uses a separate user-activity dataset from Aiven Kafka. Optional Module 9 (ML) extends batch Silver into tip-prediction features.

Module 8 (Streaming): Live user-activity events from Aiven Kafka — Databricks Structured Streaming, Snowflake Dynamic Tables (+ optional dbt dynamic_table in {attendee}_DBT_STREAMING).
Module 9 (ML): silver_nyc_taxi_enriched feeds tip-prediction features — Databricks (sklearn + MLflow), Snowflake (Cortex ML + Snowpark ML), dbt feature models in _DBT_GOLD.

Module 6 (AI) consumes Gold/Silver via Cortex (AI_COMPLETE, Genie, Workspaces SQL) — a consumption layer on top of batch outputs, not a fourth ingest pipeline.

Production Architecture

flowchart LR
    subgraph TODAY ["Today — Training"]
        T1["Interactive notebooks"]
        T2["SQL files (Workspaces)"]
        T3["Manual dbt run"]
    end

    subgraph PROD ["Production — Real World"]
        P1["LSDP + Lakeflow Jobs + DABs"]
        P2["Tasks + Streams + Stored Procs"]
        P3["GitHub Actions CI / dbt Core"]
    end

    TODAY -->|"productionise"| PROD

    style TODAY fill:#dbeafe,stroke:#0057b8,stroke-width:2px
    style PROD  fill:#dcfce7,stroke:#107c10,stroke-width:2px