Module 3: Snowflake Pipeline

SQL, Snowpark, and external stages

Duration: 75 min — Animation (3) · Think & Discuss (7) · Theory (15) · Quiz (3) · Practice (47)

1. Animation

2. Think & Discuss

Situation: Marcus wants SQL maintainability. Elena asks Bob to rebuild the same 12 KPIs on Snowflake. Priya connects her Map page — identical Gold schema.

Prompts:

  • What is Marcus asking for — a different tool, skill set, or both?
  • How do you load Parquet from ADLS2 into Snowflake Bronze without Spark?
  • If Gold schema stays identical, can Priya keep the same Power BI reports?

3. Theory

Marcus: “We need SQL, not notebooks.” Elena: Rebuild the same medallion on Snowflake — identical Gold KPIs, SQL-first maintainability.

Same architecture. Different implementation philosophy.

NoteExternal stages from ADLS2

Snowflake loads Parquet via external stage + COPY INTO — two steps vs Databricks one-line spark.read.parquet(). More control over error handling; familiar pattern for SQL teams.

3.1 Theory: Snowflake Essentials

Architecture

Snowflake uses a three-layer architecture that separates storage, compute, and cloud services into independent, scalable tiers:

  • Storage layer: Data is stored in Snowflake’s proprietary micro-partitioned columnar format — compressed, automatically clustered, and not directly accessible as files (unlike Delta Lake’s open Parquet). Data is organized into databases and schemas, which are logical namespaces only (they don’t map to file paths).

  • Compute layer (Virtual Warehouses): Independent compute clusters that execute queries. Warehouses range from X-Small (1 credit/hour) to 6X-Large (512 credits/hour). They auto-suspend after a configurable idle period (default: 10 minutes) and resume in 2–5 seconds — so you only pay while queries are running. For ETL workloads that finish quickly, set auto-suspend to 60 seconds to minimize idle billing.

  • Cloud services layer: Authentication, query parsing, metadata management, and access control — all managed by Snowflake. This includes the Snowflake Horizon Catalog (governance: tags, masking policies, lineage), which is the parallel to Databricks Unity Catalog.

flowchart TD
    subgraph CloudServices ["Cloud Services Layer"]
        AUTH["Authentication & RBAC"]
        QP["Query Parsing & Optimization"]
        META["Metadata & Lineage<br/><i>Horizon Catalog</i>"]
    end

    subgraph Compute ["Compute Layer — Virtual Warehouses"]
        WH1["X-Small<br/><i>1 credit/hr</i>"]
        WH2["Small<br/><i>2 credits/hr</i>"]
        WH3["Medium+<br/><i>auto-scale</i>"]
    end

    subgraph Storage ["Storage Layer"]
        DB["Databases & Schemas"]
        MP["Micro-partitions<br/><i>compressed columnar</i>"]
        STG["External Stages<br/><i>ADLS2 pointers</i>"]
    end

    CloudServices --> Compute
    Compute --> Storage

    classDef cloud fill:#29B5E8,color:#fff,stroke:#1e9bc9
    classDef compute fill:#0369a1,color:#fff,stroke:#075985
    classDef storage fill:#475569,color:#fff,stroke:#334155

    class AUTH,QP,META cloud
    class WH1,WH2,WH3 compute
    class DB,MP,STG storage

Warehouse sizing for ETL

Set auto-suspend to 60 seconds for ETL warehouses — the default 10 minutes wastes credits for batch workloads that finish quickly. Use X-Small or Small for development and most Silver/Gold transforms; only scale up when you see spilling to remote storage or long query times.

External Stages + COPY INTO

Snowflake requires a two-step ingestion pattern: first create a stage pointing to cloud storage, then execute COPY INTO to bulk-load data. This is more verbose than Databricks’ one-line spark.read.parquet(), but gives explicit control over file format options, error handling (ON_ERROR = CONTINUE logs bad rows to an error table), and schema inference (INFER_SCHEMA = TRUE for Parquet/Avro/ORC/JSON/CSV).

-- Create external stage pointing to ADLS2
CREATE STAGE my_stage
  URL = 'azure://account.blob.core.windows.net/container/path/'
  CREDENTIALS = (AZURE_SAS_TOKEN = '...');

-- Bulk load Parquet into table (INFER_SCHEMA auto-detects columns)
COPY INTO my_table FROM @my_stage
  FILE_FORMAT = (TYPE = PARQUET)
  MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE;

flowchart LR
    ADLS2[("Azure ADLS2<br/>Parquet files")]
    STAGE["@my_stage<br/><i>External Stage</i>"]
    TABLE[("my_table<br/><i>micro-partitions</i>")]

    ADLS2 -->|"CREATE STAGE<br/>URL + SAS token"| STAGE
    STAGE -->|"COPY INTO<br/>FILE_FORMAT = PARQUET"| TABLE

    classDef source fill:#0057b8,color:#fff,stroke:#003d82
    classDef stage fill:#29B5E8,color:#fff,stroke:#1e9bc9
    classDef table fill:#01065c,color:#fff,stroke:#000940

    class ADLS2 source
    class STAGE stage
    class TABLE table

Column casing — a cross-platform gotcha

Snowflake normalizes unquoted identifiers to UPPER_CASE. Databricks defaults to lower_case. When you write fare_amount in a Snowflake query, Snowflake interprets it as FARE_AMOUNT. This matters when: (a) hand-writing cross-platform SQL, (b) reading Delta Lake tables from Snowflake or vice versa, (c) debugging why a SELECT fare_amount returns nulls from an externally-created table. dbt handles this transparently via { adapter.quote() }, but hand-written SQL does not.

Snowpipe and Time Travel

Beyond batch ingestion with COPY INTO, Snowflake provides two additional capabilities that are essential in production:

  • Snowpipe enables continuous, serverless ingestion — files arriving in cloud storage are automatically loaded into tables within minutes, without managing warehouse uptime. It pairs well with event-driven architectures (e.g., Azure Event Grid notifications triggering a pipe).
  • Time Travel allows querying historical data at any point within a configurable retention period (up to 90 days for Enterprise edition). This is invaluable for auditing, debugging pipeline errors, and recovering accidentally dropped tables (UNDROP TABLE).

These features have no direct equivalent in Databricks — Snowpipe replaces the need for external streaming ingestion for near-real-time batch loads, and Time Travel goes beyond Delta Lake’s version history by operating at the platform level across all tables.

Snowpark vs PySpark — Side by Side

Beyond API syntax, the bigger question for consultants is when to choose Snowflake vs Databricks. The next slide compares the two platforms at a strategic level.

Snowpark was deliberately designed to feel like PySpark — the API surface is nearly identical. However, there are important differences to be aware of:

Operation PySpark (Databricks) Snowpark (Snowflake)
Read table spark.table("my_table") session.table("MY_TABLE")
Filter df.filter(col("x") > 0) df.filter(col("X") > 0)
Add column df.withColumn("y", ...) df.with_column("Y", ...)
Group + Agg df.groupBy("x").agg(...) df.group_by("X").agg(...)
Write df.write.saveAsTable("t") df.write.save_as_table("T")
Import from pyspark.sql.functions from snowflake.snowpark.functions

Where the APIs diverge: Snowpark uses snake_case method names (with_column, group_by, save_as_table) while PySpark uses camelCase (withColumn, groupBy, saveAsTable). Column names are case-sensitive in Snowpark (reflecting Snowflake’s UPPER_CASE normalization) but case-insensitive by default in PySpark. Snowpark also lacks a .cache() equivalent — Snowflake manages caching internally at the warehouse level.

3.2 Demo: Trainer Walkthrough

Part 1: SQL Approach

  1. 00_account_setup.sql — Create database, warehouse, schemas
  2. 01_storage_integration.sql + 02_external_stage.sql — Configure ADLS2 access
  3. bronze/01_ingest_trips.sql — COPY INTO from external stage
  4. silver/01_create_cleaned.sql + 02_create_enriched.sql — SQL transforms
  5. gold/01_create_kpis.sql — 12 KPI tables

Part 2: Snowpark Python

The trainer shows the same Silver/Gold transforms using Snowpark: - Compare 02_silver_cleaning.py (Snowpark) with 01_create_cleaned.sql (SQL) - Note the API similarities with Databricks PySpark

3.3 Key Takeaways

  • Snowflake excels at SQL-native workloads with instant, elastic compute
  • External Stages + COPY INTO provide efficient bulk ingestion from cloud storage
  • Snowpark brings Python DataFrame processing to Snowflake’s engine
  • PySpark → Snowpark migration is straightforward due to similar APIs
  • In production, use Tasks + Streams + Stored Procedures (see Module 5)

4. Quiz

Quiz: Module 3 — Snowflake Pipeline Quiz

Scan QR to open quiz

Before moving on, make sure you can answer:

  1. What Snowflake object points to external cloud storage, and what command loads data from it?
  2. Name two differences between Snowpark and PySpark syntax (method naming, case sensitivity).
  3. What happens to a virtual warehouse after its auto-suspend period expires, and why does this matter for cost?

5. Practice

Hands-on lab

Scripts under snowflake/sql/ and snowflake/snowpark/.

Priya / Power BI: Map page — borough filled map and top pickup zones from kpi_borough_analysis, kpi_top_pickup_zones.

Next module

Module 4: dbt Pipeline — Marcus’s board asks: Where does each dashboard number come from?


Official Documentation