Exercise: Snowflake Pipeline
YellowLine NYC story · full hands-on lab
title: “Exercise: Snowflake Pipeline” subtitle: “YellowLine NYC story · full hands-on lab” —
Estimated time: 30–35 min (Base: 20 min · Stretch: 10–15 min)
YellowLine NYC context (Module 3)
Marcus needs SQL maintainability. Rebuild the same medallion and KPIs on Snowflake.
Use GitHub Codespaces for a ready-to-use environment — all tools pre-installed. Open your fork on GitHub → Code → Codespaces → Create codespace on main. All workshop files are at /workspace/. SQL scripts can be copied directly into a Snowflake worksheet.
Prerequisites
- Snowflake trial account
- Setup scripts (
00_account_setup.sqlthrough02_external_stage.sql) completed
Base Exercise — Choose Your Path
Option A: SQL Path
Step 1: Bronze Ingestion
Open
snowflake/sql/bronze/01_ingest_trips.sqlin SnowsightSet your attendee ID:
SET attendee_id = 'XX_yourname';Run all statements
Verify:
SELECT COUNT(*) FROM IDENTIFIER($attendee_id || '_BRONZE.BRONZE_NYC_TAXI_TRIPS'); SELECT COUNT(*) FROM IDENTIFIER($attendee_id || '_BRONZE.BRONZE_TAXI_ZONE_LOOKUP');
Step 2: Silver Transforms
Open
snowflake/sql/silver/01_create_cleaned.sqlRun the CTAS statement
Open
snowflake/sql/silver/02_create_enriched.sqlRun the CTAS statement
Verify:
SELECT COUNT(*) FROM IDENTIFIER($attendee_id || '_SILVER.SILVER_NYC_TAXI_CLEANED'); SELECT COUNT(*) FROM IDENTIFIER($attendee_id || '_SILVER.SILVER_NYC_TAXI_ENRICHED');
Step 3: Gold KPIs
Open
snowflake/sql/gold/01_create_kpis.sqlRun all 12 CTAS statements
Verify:
SELECT * FROM IDENTIFIER($attendee_id || '_GOLD.GOLD_KPI_TRIPS_BY_HOUR') ORDER BY PICKUP_HOUR;
Option B: Snowpark Python Path
Step 1: Bronze Ingestion
- Open a Snowflake Python worksheet (or use local Python with Snowpark)
- Run
snowflake/snowpark/01_bronze_ingestion.py
Step 2: Silver Transforms
- Run
snowflake/snowpark/02_silver_cleaning.py
Step 3: Gold KPIs
- Run
snowflake/snowpark/03_gold_kpis.py
Expected Results
Same as Databricks exercise — Bronze ~3M rows, Silver ~2.5-2.8M, Gold KPIs identical.
Stretch Goals
A: Compare SQL vs Snowpark Output
Run both paths and compare row counts:
-- If you ran both, counts should match exactly
SELECT 'SQL Silver' AS source, COUNT(*) AS cnt
FROM IDENTIFIER($attendee_id || '_SILVER.SILVER_NYC_TAXI_CLEANED')
UNION ALL
SELECT 'Snowpark Silver', COUNT(*)
FROM IDENTIFIER($attendee_id || '_SILVER.SILVER_NYC_TAXI_CLEANED_SP');B: Use Query Profile
- Run one of the Gold KPI queries
- Click the Query ID link in the results pane
- Explore the Query Profile — identify the most expensive operator
- How does Snowflake distribute the GROUP BY across warehouse nodes?
C: Create a Snowsight Dashboard
- Go to Dashboards in Snowsight
- Create a new dashboard
- Add a tile for
GOLD_KPI_TRIPS_BY_HOURas a bar chart - Add a tile for
GOLD_KPI_TOP_PICKUP_ZONESas a table
D: Compare Naming Conventions
Snowflake uses UPPER_CASE by default. Databricks uses lower_case. Both are valid — what implications does this have for cross-platform dbt models?
Compare Your Results
Complete this after you have also run the Databricks exercise. Having both sets of results makes the comparison concrete.
Snowflake SQL vs Snowpark vs Databricks
| What you observed | Snowflake SQL | Snowflake Snowpark | Databricks PySpark |
|---|---|---|---|
| Language | SQL | Python (Snowpark) | Python (PySpark) |
| Where you wrote the transform | Snowsight worksheet | Python file / notebook | Databricks notebook |
| How you ingested Bronze | COPY INTO from external stage |
Snowpark read.parquet() |
spark.read.parquet() from ADLS2 |
| Silver row count | ~2.5–2.8M | ~2.5–2.8M | ~2.5–2.8M |
| How you ran quality filters | WHERE clause in CTAS |
.filter(col(...)) |
.filter(col(...)) |
| Did data leave Snowflake? | No | No | No (data in ADLS2 / Delta) |
| How long did the Gold KPIs take? | ______ | ______ | ______ |
| How did you verify results? | Manual SELECT COUNT(*) |
Manual SELECT COUNT(*) |
Manual SELECT COUNT(*) |
1. Snowflake SQL was the most concise for standard transforms — each layer was a CREATE TABLE AS SELECT with filters and joins inline. No session management, no DataFrame API, no imports.
2. Snowpark and PySpark look nearly identical — this is intentional. session.table() vs spark.table(), .filter(col("fare_amount") > 0), .write.save_as_table() vs .write.saveAsTable(). A PySpark developer can read Snowpark code immediately.
3. The critical difference: where compute runs — Snowpark runs Python code on the Snowflake warehouse. PySpark runs on Spark executors. For large data, both scale — but the cost model, latency, and operational overhead differ.
4. COPY INTO vs spark.read.parquet() — Snowflake’s Bronze ingestion requires external stage setup and uses a SQL COPY INTO command. Databricks reads parquet files directly from cloud storage with one line of PySpark. Both approaches work; COPY INTO gives more control over format handling and error files.
5. Neither Snowflake path ran automated tests — you validated manually. Contrast this with dbt, where dbt test checks all models automatically after each run.
Discussion questions
- You ran the same Silver transform in SQL and Snowpark — the code looks different but the output is identical. When would you choose Snowpark over SQL in a real project?
- Snowflake auto-suspended the warehouse after 60 seconds of inactivity. Databricks kept the cluster running until you stopped it. What does this mean for the cost of running the pipeline daily vs. hourly?
- The
COPY INTOBronze ingestion needed an external stage configured. How does this compare to the Databricks approach of reading directly from ADLS2? - UPPER_CASE column names in Snowflake vs lower_case in Databricks — you ran the same dbt models against both targets. How did dbt handle the case difference?
Head to the Batch Pipeline Comparison page for a full side-by-side breakdown of Databricks, Snowflake, and dbt — verified against official documentation — including storage formats, compute models, cost implications, and key gotchas.
Cleanup
When you are finished exploring, suspend your warehouse to avoid ongoing charges:
-- Suspend the warehouse (auto-suspend handles this, but explicit is safer)
ALTER WAREHOUSE DE_WORKSHOP_WH SUSPEND;
-- (Optional) Drop your schemas to free storage
DROP SCHEMA IF EXISTS IDENTIFIER($attendee_id || '_GOLD') CASCADE;
DROP SCHEMA IF EXISTS IDENTIFIER($attendee_id || '_SILVER') CASCADE;
DROP SCHEMA IF EXISTS IDENTIFIER($attendee_id || '_BRONZE') CASCADE;Return to module
Source: merged from frozen workshop-2026-v1/exercises/ex-snowflake.qmd — do not edit workshop-2026-v1/.