Exercise: Snowflake Pipeline

YellowLine NYC story · full hands-on lab

title: “Exercise: Snowflake Pipeline” subtitle: “YellowLine NYC story · full hands-on lab” —

Estimated time: 30–35 min (Base: 20 min · Stretch: 10–15 min)

YellowLine NYC context (Module 3)

Marcus needs SQL maintainability. Rebuild the same medallion and KPIs on Snowflake.

TipWorking Environment

Use GitHub Codespaces for a ready-to-use environment — all tools pre-installed. Open your fork on GitHub → Code → Codespaces → Create codespace on main. All workshop files are at /workspace/. SQL scripts can be copied directly into a Snowflake worksheet.

Prerequisites

  • Snowflake trial account
  • Setup scripts (00_account_setup.sql through 02_external_stage.sql) completed

Base Exercise — Choose Your Path

Option A: SQL Path

Step 1: Bronze Ingestion

  1. Open snowflake/sql/bronze/01_ingest_trips.sql in Snowsight

  2. Set your attendee ID: SET attendee_id = 'XX_yourname';

  3. Run all statements

  4. Verify:

    SELECT COUNT(*) FROM IDENTIFIER($attendee_id || '_BRONZE.BRONZE_NYC_TAXI_TRIPS');
    SELECT COUNT(*) FROM IDENTIFIER($attendee_id || '_BRONZE.BRONZE_TAXI_ZONE_LOOKUP');

Step 2: Silver Transforms

  1. Open snowflake/sql/silver/01_create_cleaned.sql

  2. Run the CTAS statement

  3. Open snowflake/sql/silver/02_create_enriched.sql

  4. Run the CTAS statement

  5. Verify:

    SELECT COUNT(*) FROM IDENTIFIER($attendee_id || '_SILVER.SILVER_NYC_TAXI_CLEANED');
    SELECT COUNT(*) FROM IDENTIFIER($attendee_id || '_SILVER.SILVER_NYC_TAXI_ENRICHED');

Step 3: Gold KPIs

  1. Open snowflake/sql/gold/01_create_kpis.sql

  2. Run all 12 CTAS statements

  3. Verify:

    SELECT * FROM IDENTIFIER($attendee_id || '_GOLD.GOLD_KPI_TRIPS_BY_HOUR') ORDER BY PICKUP_HOUR;

Option B: Snowpark Python Path

Step 1: Bronze Ingestion

  1. Open a Snowflake Python worksheet (or use local Python with Snowpark)
  2. Run snowflake/snowpark/01_bronze_ingestion.py

Step 2: Silver Transforms

  1. Run snowflake/snowpark/02_silver_cleaning.py

Step 3: Gold KPIs

  1. Run snowflake/snowpark/03_gold_kpis.py

Expected Results

Same as Databricks exercise — Bronze ~3M rows, Silver ~2.5-2.8M, Gold KPIs identical.

Stretch Goals

A: Compare SQL vs Snowpark Output

Run both paths and compare row counts:

-- If you ran both, counts should match exactly
SELECT 'SQL Silver' AS source, COUNT(*) AS cnt
FROM IDENTIFIER($attendee_id || '_SILVER.SILVER_NYC_TAXI_CLEANED')
UNION ALL
SELECT 'Snowpark Silver', COUNT(*)
FROM IDENTIFIER($attendee_id || '_SILVER.SILVER_NYC_TAXI_CLEANED_SP');

B: Use Query Profile

  1. Run one of the Gold KPI queries
  2. Click the Query ID link in the results pane
  3. Explore the Query Profile — identify the most expensive operator
  4. How does Snowflake distribute the GROUP BY across warehouse nodes?

C: Create a Snowsight Dashboard

  1. Go to Dashboards in Snowsight
  2. Create a new dashboard
  3. Add a tile for GOLD_KPI_TRIPS_BY_HOUR as a bar chart
  4. Add a tile for GOLD_KPI_TOP_PICKUP_ZONES as a table

D: Compare Naming Conventions

Snowflake uses UPPER_CASE by default. Databricks uses lower_case. Both are valid — what implications does this have for cross-platform dbt models?


Compare Your Results

Complete this after you have also run the Databricks exercise. Having both sets of results makes the comparison concrete.

Snowflake SQL vs Snowpark vs Databricks

What you observed Snowflake SQL Snowflake Snowpark Databricks PySpark
Language SQL Python (Snowpark) Python (PySpark)
Where you wrote the transform Snowsight worksheet Python file / notebook Databricks notebook
How you ingested Bronze COPY INTO from external stage Snowpark read.parquet() spark.read.parquet() from ADLS2
Silver row count ~2.5–2.8M ~2.5–2.8M ~2.5–2.8M
How you ran quality filters WHERE clause in CTAS .filter(col(...)) .filter(col(...))
Did data leave Snowflake? No No No (data in ADLS2 / Delta)
How long did the Gold KPIs take? ______ ______ ______
How did you verify results? Manual SELECT COUNT(*) Manual SELECT COUNT(*) Manual SELECT COUNT(*)
TipThings you should have noticed

1. Snowflake SQL was the most concise for standard transforms — each layer was a CREATE TABLE AS SELECT with filters and joins inline. No session management, no DataFrame API, no imports.

2. Snowpark and PySpark look nearly identical — this is intentional. session.table() vs spark.table(), .filter(col("fare_amount") > 0), .write.save_as_table() vs .write.saveAsTable(). A PySpark developer can read Snowpark code immediately.

3. The critical difference: where compute runs — Snowpark runs Python code on the Snowflake warehouse. PySpark runs on Spark executors. For large data, both scale — but the cost model, latency, and operational overhead differ.

4. COPY INTO vs spark.read.parquet() — Snowflake’s Bronze ingestion requires external stage setup and uses a SQL COPY INTO command. Databricks reads parquet files directly from cloud storage with one line of PySpark. Both approaches work; COPY INTO gives more control over format handling and error files.

5. Neither Snowflake path ran automated tests — you validated manually. Contrast this with dbt, where dbt test checks all models automatically after each run.

Discussion questions

  1. You ran the same Silver transform in SQL and Snowpark — the code looks different but the output is identical. When would you choose Snowpark over SQL in a real project?
  2. Snowflake auto-suspended the warehouse after 60 seconds of inactivity. Databricks kept the cluster running until you stopped it. What does this mean for the cost of running the pipeline daily vs. hourly?
  3. The COPY INTO Bronze ingestion needed an external stage configured. How does this compare to the Databricks approach of reading directly from ADLS2?
  4. UPPER_CASE column names in Snowflake vs lower_case in Databricks — you ran the same dbt models against both targets. How did dbt handle the case difference?
NoteReady to compare all three tools?

Head to the Batch Pipeline Comparison page for a full side-by-side breakdown of Databricks, Snowflake, and dbt — verified against official documentation — including storage formats, compute models, cost implications, and key gotchas.


Cleanup

When you are finished exploring, suspend your warehouse to avoid ongoing charges:

-- Suspend the warehouse (auto-suspend handles this, but explicit is safer)
ALTER WAREHOUSE DE_WORKSHOP_WH SUSPEND;

-- (Optional) Drop your schemas to free storage
DROP SCHEMA IF EXISTS IDENTIFIER($attendee_id || '_GOLD') CASCADE;
DROP SCHEMA IF EXISTS IDENTIFIER($attendee_id || '_SILVER') CASCADE;
DROP SCHEMA IF EXISTS IDENTIFIER($attendee_id || '_BRONZE') CASCADE;

Return to module

Source: merged from frozen workshop-2026-v1/exercises/ex-snowflake.qmd — do not edit workshop-2026-v1/.