Exercise: Snowflake Pipeline

YellowLine NYC story · full hands-on lab

title: “Exercise: Snowflake Pipeline” subtitle: “YellowLine NYC story · full hands-on lab” —

Estimated time: 30–35 min (Base: 20 min · Stretch: 10–15 min)

YellowLine NYC context (Module 3)

Marcus needs SQL maintainability. Rebuild the same medallion and KPIs on Snowflake.

Working Environment

Workshop files live in your fork (snowflake/sql/setup/, snowflake/sql/, snowflake/snowpark/). Default: link your fork as a Git workspace in Snowsight (setup § Git). Fallback: copy-paste from Codespace if Git integration is blocked.

Method	When to use
Git workspace (recommended)	Default — open `.sql` files in Snowsight; Pull facilitator updates
Copy-paste (fallback)	Git/OAuth blocked — open `.sql` in Codespace, paste into Snowsight

Setup guide: Snowflake § Access workshop files

GitHub blocked? (emergency only)

The normal path is fork + Codespace (Prerequisites § Step 2). Use Lab source files only if your facilitator approved it — e.g. you cannot create or use GitHub before class.

Account Setup (run once)

Run the scripts below to create your Snowflake objects — database, warehouse, role, schemas, and external stage. After this, your environment is ready for all subsequent labs.

Prerequisites (complete these first)

Requirement	Where
Snowflake trial account	Snowflake setup guide
Workshop repo forked on GitHub	Prerequisites
Git workspace linked to your fork	Snowflake setup § Git

If you are using copy-paste instead of Git, skip the Git row above and go directly to the Copy-paste tab below.

How to run SQL in Workspaces

Snowsight Run executes only the statement at the cursor — not the whole file. For setup scripts, always use Run all (Ctrl+Shift+Enter / Cmd+Shift+Return) or select all (Ctrl+A) → Run.

Goal	Action
One statement	Cursor in statement → Run (Ctrl+Enter)
Several statements	Select the SQL block → Run
Entire script	Run all (Ctrl+Shift+Enter)

Setup scripts — run in order

Run this section only once

If you re-run Bronze/Silver/Gold steps later, skip back to Base Exercise — no need to re-run setup.

Each script creates different objects. Every script includes USE ROLE / USE WAREHOUSE at the top, so it sets its own context — just open and Run all.

Replace placeholders before running

Only two files need editing — all other scripts auto-load your ID from a central config table:

Script	Placeholder	Replace with
`02_account_setup.sql`	`de_XX_yourname`	Your assigned ID (e.g. `de_01_alice` from your credential card)
`04_external_stage.sql`	`<SAS_TOKEN>`	ADLS2 SAS token provided by the trainer

02_account_setup.sql saves your ID to a config table (_workshop_config). Every subsequent script loads it automatically — no more editing needed.

Each script has a built-in validation check that stops execution if the placeholder is not replaced — you will see an ✖ ERROR message.

Script	What it creates
`01_git_api_integration.sql`	API integration for Git workspace sync (skip if workspace created via UI — integration already exists)
`02_account_setup.sql`	`DE_MASTERCLASS` database, `DE_WORKSHOP_WH` warehouse, `{attendee_id}` schemas, `DE_WORKSHOP_ROLE`
`03_git_workshop_grants.sql`	Git usage grants for `DE_WORKSHOP_ROLE` (Git path only)
`04_external_stage.sql`	`@nyc_taxi_trips_stage` external stage (points to facilitator’s ADLS2 with SAS token)

Assumes your Git workspace is already linked. In the toolbar (top), make sure a Started warehouse (e.g. COMPUTE_WH) is selected — this is only needed as initial compute. Each script then sets its own role and warehouse via USE statements.

Warning

Skip 01_git_api_integration.sql if you created the Git workspace via the Snowsight UI dialog — the integration was already created automatically. Running it would CREATE OR REPLACE the integration and break your workspace link.

Open each file from the workspace file tree and Run all:

~~snowflake/sql/setup/01_git_api_integration.sql~~ — skip if workspace was created via UI
snowflake/sql/setup/02_account_setup.sql
snowflake/sql/setup/03_git_workshop_grants.sql
snowflake/sql/setup/04_external_stage.sql

If Git integration or OAuth is blocked, open .sql files in Codespaces and paste the content into Snowsight. In the toolbar (top), select a Started warehouse (e.g. COMPUTE_WH) — each script then sets its own role and warehouse via USE statements.

Workspaces (or Projects → Workspaces) → + Add New → SQL File
Paste 02_account_setup.sql from Codespaces → Run all
Paste 04_external_stage.sql → Run all

Skip 01_git_api_integration.sql and 03_git_workshop_grants.sql — they are only needed for Git workspace sync.

Note

Before Module 6: run 05_cortex_access.sql — Module 6 § Cortex access.

Do not run reference/storage_integration.sql — production pattern only; trials use SAS in 04.

Verify setup

After 04_external_stage.sql completes, confirm everything is ready:

SELECT CURRENT_ROLE(), CURRENT_WAREHOUSE();
-- Expect: DE_WORKSHOP_ROLE, DE_WORKSHOP_WH

SHOW SCHEMAS IN DATABASE DE_MASTERCLASS;

LIST @nyc_taxi_trips_stage PATTERN = '.*parquet';

Reference: Warehouse & role switching

No action needed now — keep this table handy for later labs.

Role	When to use
`ACCOUNTADMIN`	Admin scripts — `01`, `02`, `03`, Module 8 streaming setup
`DE_WORKSHOP_ROLE`	All pipeline labs, Snowpark, dbt, Cortex / ML, Power BI

DE_WORKSHOP_WH is created by 02_account_setup.sql — do not create a second warehouse in UI. X-Small, 60s auto-suspend (Virtual warehouses). Suspend when finished: ALTER WAREHOUSE DE_WORKSHOP_WH SUSPEND; (cleanup)

Base Exercise

Option A: SQL Path

A.1 — Snowsight (recommended for workshop)

Open .sql files via Git workspace in Snowsight and run them directly in the browser.

Prefer a notebook? (optional)

Workspaces SQL files are the recommended route for the SQL path. Optional SQL-kernel notebooks live in snowflake/notebooks/01_sql/ (bronze/silver/gold) and write to separate _NB_SQL_* schemas if you prefer the notebook UX.

Step 1: Bronze Ingestion

Open snowflake/sql/bronze/01_ingest_trips.sql
Run all the script (Ctrl+Shift+Enter) or select all → Run — the script auto-loads your attendee ID from the config table

Why COPY INTO from an external stage?

Databricks reads Parquet directly from ADLS2 with spark.read.parquet(). Snowflake requires an external stage (@nyc_taxi_trips_stage) because COPY INTO is Snowflake’s bulk-load engine — it tracks which files have already been loaded, supports format validation, and can reject bad rows without failing the entire load. For one-time workshop ingestion the difference is small; in production it matters for idempotency.

Reference: Loading data using COPY

Optional: the no-code Snowsight Ingestion UI

The same external stage can be created point-and-click via Ingestion → Add Data → Create an external stage → Microsoft Azure in Snowsight. ADLS2 (Data Lake Storage Gen2) uses the blob.core.windows.net endpoint, so “Microsoft Azure Blob Storage” is the path to our mhpdeworkshopsa storage — there is no separate ADLS2 option. The wizard simply generates the same CREATE STAGE … URL='azure://…' SQL you ran above.

We teach the SQL-first path because it is version-controlled in the repo and reproducible across attendees. The UI is a handy alternative for ad-hoc loads. Note the Ingestion menu is a newer Snowsight nav item (Preview) and may not yet appear on every trial account.

Reference: Create an Azure stage using Snowsight

Verify (build the object name in a variable first — Snowflake does not allow IDENTIFIER($attendee_id || '...')):

SET attendee_id = (SELECT value FROM DE_MASTERCLASS.PUBLIC._workshop_config WHERE key = 'attendee_id');
SET bronze_trips = 'DE_MASTERCLASS.' || $attendee_id || '_SQL_BRONZE.BRONZE_NYC_TAXI_TRIPS';
SET bronze_zones = 'DE_MASTERCLASS.' || $attendee_id || '_SQL_BRONZE.BRONZE_TAXI_ZONE_LOOKUP';
SELECT COUNT(*) FROM IDENTIFIER($bronze_trips);
SELECT COUNT(*) FROM IDENTIFIER($bronze_zones);

Step 2: Silver Transforms

Open snowflake/sql/silver/01_create_cleaned.sql
Run all the CTAS statement

Note

Why CTAS instead of INSERT INTO? CREATE TABLE AS SELECT is atomic — if the query fails, no table is created. An INSERT INTO with a failed query would leave an empty or partially-filled table. CTAS also replaces the entire table on re-run (idempotent), which is ideal for workshop re-runs and batch pipelines.

Open snowflake/sql/silver/02_create_enriched.sql
Run all the CTAS statement

Verify:

SET attendee_id = (SELECT value FROM DE_MASTERCLASS.PUBLIC._workshop_config WHERE key = 'attendee_id');
SET silver_cleaned = 'DE_MASTERCLASS.' || $attendee_id || '_SQL_SILVER.SILVER_NYC_TAXI_CLEANED';
SET silver_enriched = 'DE_MASTERCLASS.' || $attendee_id || '_SQL_SILVER.SILVER_NYC_TAXI_ENRICHED';
SELECT COUNT(*) FROM IDENTIFIER($silver_cleaned);
SELECT COUNT(*) FROM IDENTIFIER($silver_enriched);

Step 3: Gold KPIs

Open snowflake/sql/gold/01_create_kpis.sql
Run all the 12 CTAS statements (Ctrl+Shift+Enter)

Verify (final block in the script returns PASS/FAIL):

SET attendee_id = (SELECT value FROM DE_MASTERCLASS.PUBLIC._workshop_config WHERE key = 'attendee_id');
SET kpi_trips_by_hour = 'DE_MASTERCLASS.' || $attendee_id || '_SQL_GOLD.kpi_trips_by_hour';
SET kpi_top_pickup_zones = 'DE_MASTERCLASS.' || $attendee_id || '_SQL_GOLD.kpi_top_pickup_zones';
SELECT * FROM IDENTIFIER($kpi_trips_by_hour) ORDER BY pickup_hour;
SELECT * FROM IDENTIFIER($kpi_top_pickup_zones) ORDER BY trip_rank;

Option B: Snowpark Python Path

Snowpark lets you write Python DataFrame code that Snowflake translates to SQL and executes on the warehouse — data never leaves Snowflake.

Why does Snowpark exist?

Snowflake designed Snowpark’s DataFrame API (session.table(), .filter(col(...)), .groupBy().agg()) to mirror PySpark. A PySpark developer can read Snowpark code immediately. The key difference: PySpark runs on Spark clusters (you manage the cluster); Snowpark runs on Snowflake’s warehouse (Snowflake manages the compute). Same API surface, completely different infrastructure.

Reference: Snowpark Python Developer Guide

B.1 — Workspaces Notebook (recommended for workshop)

Open ready-to-run Python Snowpark notebooks from snowflake/notebooks/02_snowpark/ in Snowsight Workspaces. Each notebook uses get_active_session() — no local credentials needed. Data is written to your _NB_SP_* schemas (e.g. DE_01_ALICE_NB_SP_BRONZE).

Separate schemas by design: the Option A Workspaces SQL file path writes to _SQL_*; these Python Snowpark notebooks write to _NB_SP_*. This keeps the tracks isolated so you can compare results.

How Workspaces notebooks work

Workspaces is the default editor in Snowsight (as of April 2026). When you run a .ipynb or .py file in Workspaces, Snowflake creates a notebook service — a Snowflake-managed container that hosts the notebook kernel and executes your code. The Snowpark session is obtained via get_active_session().

Two layers:

Compute pool (infrastructure): provides the CPU/memory nodes. Set up once by an admin.
Notebook service (service): runs on a compute pool node. Created once when you first open a .ipynb or .py file and click Connect → Create new service. The same service is then reused by all notebooks (.ipynb) and Python files (.py) in the same workspace — you do not create a new one each time.

To manage your running service, use the Connected dropdown (top of the file) → Manage service — this opens the Services & jobs page. You can also reach it directly via Monitoring → Services & Jobs (left navigation).

Compared to PySpark: similar DataFrame API (session.table(), .filter(), .groupBy()), but Snowflake manages the compute — no cluster to configure.

SPCS setup

Running notebooks or Python files in Workspaces requires two layers: a compute pool (infrastructure, set up once by an admin) and a notebook service (created per user when opening a file).

Step 1: One-time compute pool setup (ACCOUNTADMIN)

Your compute pool setup depends on your account type:

Trial accounts include two pre-provisioned system compute pools — SYSTEM_COMPUTE_POOL_CPU (CPU_X64_S) and SYSTEM_COMPUTE_POOL_GPU (GPU_NV_SM). For this workshop, we only need the CPU pool for running notebook services. No creation needed, and the setup script (snowflake/sql/setup/02_account_setup.sql) already granted USAGE to DE_WORKSHOP_ROLE — just verify.

In the navigation menu, select Compute → Compute Pools
Verify both SYSTEM_COMPUTE_POOL_CPU and SYSTEM_COMPUTE_POOL_GPU are listed
If USAGE was not granted during setup, open a new SQL file in Workspaces (+ → SQL File) and run:

USE ROLE ACCOUNTADMIN;

GRANT USAGE ON COMPUTE POOL SYSTEM_COMPUTE_POOL_CPU TO ROLE DE_WORKSHOP_ROLE;

-- Already run by 02_account_setup.sql; use this only to verify or re-grant
USE ROLE ACCOUNTADMIN;

GRANT USAGE ON COMPUTE POOL SYSTEM_COMPUTE_POOL_CPU TO ROLE DE_WORKSHOP_ROLE;

SHOW COMPUTE POOLS;
-- Expect: SYSTEM_COMPUTE_POOL_CPU and SYSTEM_COMPUTE_POOL_GPU (both ACTIVE or IDLE)

Non-trial accounts can create a custom compute pool for finer cost control:

In the navigation menu, select Compute → Compute Pools
Switch to the ACCOUNTADMIN role (bottom-left of the navigation bar)
Select + Compute Pool
Fill in the New compute pool dialog:
- Name: WORKSHOP_CP
- Instance family: CPU_X64_XS
- Node limit: 1
- Idle time: 5 minutes
Select Create Compute Pool

USE ROLE ACCOUNTADMIN;

CREATE COMPUTE POOL WORKSHOP_CP
  MIN_NODES = 1
  MAX_NODES = 1
  INSTANCE_FAMILY = CPU_X64_XS
  AUTO_SUSPEND_SECS = 300;

Grant USAGE and verify:

GRANT USAGE ON COMPUTE POOL WORKSHOP_CP TO ROLE DE_WORKSHOP_ROLE;

SHOW COMPUTE POOLS;
-- Expect: WORKSHOP_CP with state = ACTIVE or IDLE

Step 2: Create your notebook service — one-time setup (all users)

You create a notebook service once. After linking your Git workspace (see the Open from Git tab below), the first time you open a .ipynb or .py file in Workspaces click Connect → Create new service and fill in the dialog:

Service name: accept the default or give it a name
Compute pool: SYSTEM_COMPUTE_POOL_CPU (trial) or WORKSHOP_CP (non-trial)
Python version: select the latest available (3.11+)
Runtime version: select the latest Snowflake Container Runtime
Idle timeout: 30 minutes (saves trial credits during long breaks)

Click Create to start the service. Once running, subsequent .ipynb and .py files in the same workspace can reuse the service — no need to create a new one each time.

Managing services

To manage your service: open the Connected dropdown (top of the notebook file) → Manage service — this takes you to the Services & jobs page where you can view status, suspend, or drop services. Alternatively, navigate directly via Monitoring → Services & Jobs (left navigation).

Note: Creating a notebook service requires the compute pool to allow the NOTEBOOK workload type (ALLOWED_SPCS_WORKLOAD_TYPES). The default value is ALL, which already includes NOTEBOOK — no action needed unless an admin has restricted it.

Cost management

Layer	Setting	Value	Effect
Compute pool	Idle time / `AUTO_SUSPEND_SECS`	5 min (300 s)	Suspends the pool when no services are running
Notebook service	Idle timeout	30 min	Suspends the service after 30 min of inactivity, freeing the pool node

Cost: ~0.5 credits/hour (CPU_X64_XS) or ~1 credit/hour (CPU_X64_S / trial) — well within the $400 trial budget.

Connect your Git workspace to the fork (setup § Git)
Navigate to snowflake/notebooks/02_snowpark/ and open a Python notebook (e.g. 01_bronze_ingestion.ipynb)
Click Connect and select your notebook service (first time only: Create new service — see Step 2 dialog)
Run cells in order — Bronze → Silver → Gold
Look for [PASS] on all verification lines after each step

Notebook	Writes to
`01_bronze_ingestion.ipynb`	`_NB_SP_BRONZE`
`02_silver_cleaning.ipynb`	`_NB_SP_SILVER`
`03_gold_kpis.ipynb`	`_NB_SP_GOLD`

If you have not linked a Git workspace, create a blank notebook and paste code from the generated notebooks:

In Workspaces, select + Notebook — a new .ipynb file is created
Click Connect (top bar) and select your notebook service (first time only: Create new service — see Step 2 dialog)
Copy the cell contents from snowflake/notebooks/02_snowpark/01_bronze_ingestion.ipynb (available in the repo) and paste into your blank notebook — the first cell already contains the setup code with get_active_session() and _NB_SP_* schema variables
Repeat for 02_silver_cleaning.ipynb and 03_gold_kpis.ipynb
Look for [PASS] on all verification lines after each step

The Snowpark scripts in snowflake/snowpark/ are designed for local terminal execution with _SP_* schemas. The .ipynb notebooks in snowflake/notebooks/02_snowpark/ are the Workspaces-native equivalent — use those instead for the workshop.

If you want to run the raw .py files in Workspaces:

Open the .py file from your Git workspace

Remove the from _snowpark_bootstrap import ... block and add the setup manually:

from snowflake.snowpark.context import get_active_session
session = get_active_session()
ATTENDEE_ID = session.sql("SELECT value FROM DE_MASTERCLASS.PUBLIC._workshop_config WHERE key='attendee_id'").collect()[0][0].upper()
NB_SP_BRONZE_SCHEMA = f"{ATTENDEE_ID}_NB_SP_BRONZE"
NB_SP_SILVER_SCHEMA = f"{ATTENDEE_ID}_NB_SP_SILVER"
NB_SP_GOLD_SCHEMA   = f"{ATTENDEE_ID}_NB_SP_GOLD"

Replace SP_BRONZE_SCHEMA/SP_SILVER_SCHEMA/SP_GOLD_SCHEMA with NB_SP_* variants in the script body
Click Connect and select your notebook service, then run

Reference:

Topic	Link
Workspaces overview	Notebooks in Workspaces
Compute pool setup	Compute setup
Snowpark DataFrame API	Python Snowpark API
Running .py in Workspaces	Python files in Workspaces

Expected Results

Same as Databricks exercise — Bronze 3,646,319 rows, Silver 3,146,710 rows (499,609 removed, 13.7%), Gold KPIs identical. See Workshop dataset volumes.

Beyond the workshop — when to use Snowsight vs a terminal

The Snowsight paths you just completed — A.1 (SQL in browser) and B.1 (Notebook in Workspaces) — are designed for interactive learning, ad-hoc queries, and quick testing. They work well for workshops, data exploration, and one-off analysis because everything runs inside the browser with no local setup.

In production projects, data engineers typically run pipelines from a terminal or CI/CD system instead:

Path	Tool	Typical use
A.2 — Snowflake CLI	`snow sql -f <file>.sql`	Scheduled SQL runs, CI/CD pipelines, version-controlled deployments
B.2 — Snowpark (terminal)	`python snowflake/snowpark/...`	Automated data pipelines, GitHub Actions, Airflow DAGs

These codebase paths use .env credentials and run from your machine — the same pattern you would use in a real project.

If you have extra time, try the optional codebase lab to practice these production-style workflows.

Stretch Goals

A: Compare SQL vs Snowpark Output

Run both recommended paths (A.1 Workspaces SQL file + B.1 notebook) and compare row counts:

SET attendee_id = (SELECT value FROM DE_MASTERCLASS.PUBLIC._workshop_config WHERE key = 'attendee_id');
SET sql_silver = 'DE_MASTERCLASS.' || $attendee_id || '_SQL_SILVER.SILVER_NYC_TAXI_CLEANED';
SET sp_silver = 'DE_MASTERCLASS.' || $attendee_id || '_NB_SP_SILVER.SILVER_NYC_TAXI_CLEANED';
SELECT 'SQL Silver' AS source, COUNT(*) AS cnt FROM IDENTIFIER($sql_silver)
UNION ALL
SELECT 'Snowpark Silver', COUNT(*) FROM IDENTIFIER($sp_silver);

B: Use Query Profile

Run one of the Gold KPI queries
Click the Query ID link in the results pane
Explore the Query Profile — identify the most expensive operator
How does Snowflake distribute the GROUP BY across warehouse nodes?

C: Visualize Gold KPIs (Streamlit or Power BI)

Legacy Snowsight Dashboards are retired in 2026 — use one of these instead:

Streamlit in Snowflake (native): Projects → Streamlit → create an app → query kpi_trips_by_hour (bar chart) and kpi_top_pickup_zones (table) from your {attendee_id}_SQL_GOLD schema. Reference: Streamlit in Snowflake · Dashboard deprecation
Power BI (optional): if you completed Exercise: Power BI, connect the same Gold tables via your Snowflake account host

D: Compare Naming Conventions

Snowflake uses UPPER_CASE by default. Databricks uses lower_case. Both are valid — what implications does this have for cross-platform dbt models?

Discussion questions

These wrap up the Snowflake lab — your facilitator may use them to open the floor.

Snowflake auto-suspended the warehouse after 60 seconds of inactivity. Databricks kept the cluster running until you stopped it. What does this mean for the cost of running the pipeline daily vs. hourly?
The COPY INTO Bronze ingestion needed an external stage configured. How does this compare to the Databricks approach of reading directly from ADLS2?
UPPER_CASE column names in Snowflake vs lower_case in Databricks — you ran the same dbt models against both targets. How did dbt handle the case difference?

Ready to compare all three tools?

The full cross-tool observation table, the “what you should have noticed” insights, and the SQL-vs-Snowpark decision questions live on the Batch Pipeline Comparison page — fill them in once after running every pipeline (Databricks, Snowflake, dbt), verified against official documentation, covering storage formats, compute models, cost implications, and key gotchas.

Reference — What the Silver layer did (read later)

Not part of the lab steps. Use this when interpreting Gold KPIs or comparing SQL vs Snowpark outputs.

Silver outputs

Table	Contents
`silver_nyc_taxi_cleaned`	Quality filters, corrections, derived columns — before zone join
`silver_nyc_taxi_enriched`	Cleaned trips plus zone lookup — Gold KPIs read this table

1. Filtered out (rows removed)

Silver drops trips that would distort KPIs. Workshop month (Oct 2024): 499,609 of 3,646,319 Bronze rows removed (13.7%); 3,146,710 remain in Silver. See Workshop dataset volumes.

Rule	Condition
Valid timestamps	Pickup and dropoff not null; pickup ≤ dropoff
Reasonable duration	Trip ≤ 24 hours (1–1440 minutes)
Positive distance	`trip_distance > 0`
Positive fare & total	`fare_amount > 0`, `total_amount > 0`
Valid passengers	`passenger_count` between 1 and 8
Airport sanity	No zero-distance trips with airport fee
Deduplication	Exact duplicate rows removed

Databricks only: extra cross-month filter on raw Parquet (adjacent-month outliers).

Full list: Data Model — Data Quality Rules.

2. Corrected (row kept, value fixed)

Field	Change
`tip_amount`	Negative tips set to 0
`payment_type_desc`	Code → label (`Credit card`, `Cash`, `Unknown`, …)
`rate_code_desc`	Code → label (`Standard rate`, `JFK`, `Unknown`, …)

3. Derived (new columns)

Examples: trip_duration_minutes, fare_per_mile, tip_percentage, avg_speed, pickup_hour, day_of_week, is_weekend, is_peak_hour, time_of_day, distance_band.

Catalog: Data Model — Silver schema.

4. Joined / enriched

LEFT JOIN taxi_zone_lookup on pickup_location_id and dropoff_location_id:

pickup_zone, pickup_borough, pickup_service_zone
dropoff_zone, dropoff_borough, dropoff_service_zone
is_same_borough

Join succeeds for TLC zone IDs 1–265 (including catch-all zones below).

5. Kept on purpose (not filtered in Silver)

LocationID	Borough	Zone	Why you may see it in Gold / Power BI
264	`Unknown`	`N/A`	Valid TLC “unknown zone” bucket
265	`N/A`	Outside of NYC	Valid out-of-NYC bucket
1	`EWR`	Newark Airport	Valid airport zone — not a NYC borough

True null zones (ID not in lookup, e.g. corrupt 0) → pickup_zone IS NULL — tracked in kpi_data_quality_metrics; excluded from some Gold KPIs (kpi_top_pickup_zones, kpi_borough_analysis when borough is null).
Power BI Map page: filter EWR, Unknown, and N/A on pickup_borough if you want NYC boroughs only — optional visual filter, not a pipeline change.

Detail: Data Model — Zone lookup edge cases.

6. Silver → Gold → Power BI

Bronze (raw) → Silver cleaned → Silver enriched → 12 Gold kpi_* tables → Power BI

Gold aggregates enriched trips (hour, zone, borough, payment, …). It does not re-apply Silver quality rules — rows in silver_nyc_taxi_enriched can include Unknown / N/A boroughs unless a Gold model filters them. Per-table business meaning and design: Data Model — KPI catalog.

Cleanup

When you are finished with the workshop, run the central cleanup script to remove all your schemas and objects across every track in one shot:

Workspaces SQL file: open snowflake/sql/setup/99_cleanup.sql and run it.
Workspaces notebook: open 00_setup/99_cleanup and click Run all.

The script drops all attendee-specific schemas (_SQL_*, _SP_*, _NB_SQL_*, _NB_SP_*, _STREAMING, _DBT, _DBT_GOLD, _DBT_STREAMING) with CASCADE, removes ML tables and views, and suspends the warehouse.

Note

If you used Workspaces notebooks with SPCS (path B.1), stop your notebook service first before running the cleanup:

Primary path: Connected dropdown (top of notebook) → Manage service → Stop
Alternative: Monitoring → Services & Jobs → select service → Stop
Or via SQL: ALTER SERVICE <service_name> SUSPEND;

Trial accounts use SYSTEM_COMPUTE_POOL_CPU (Snowflake-managed) — no pool cleanup needed.

Tip

The cleanup script is idempotent — safe to run more than once. It uses DROP … IF EXISTS throughout.

Return to module

Module 3 — story wrapper