flowchart TD
subgraph CloudServices ["Cloud Services Layer"]
AUTH["Authentication & RBAC"]
QP["Query Parsing & Optimization"]
META["Metadata & Lineage<br/><i>Horizon Catalog</i>"]
end
subgraph Compute ["Compute Layer — Virtual Warehouses"]
WH1["X-Small<br/><i>1 credit/hr</i>"]
WH2["Small<br/><i>2 credits/hr</i>"]
WH3["Medium+<br/><i>auto-scale</i>"]
end
subgraph Storage ["Storage Layer"]
DB["Databases & Schemas"]
MP["Micro-partitions<br/><i>compressed columnar</i>"]
STG["External Stages<br/><i>ADLS2 pointers</i>"]
end
CloudServices --> Compute
Compute --> Storage
classDef cloud fill:#29B5E8,color:#fff,stroke:#1e9bc9
classDef compute fill:#0369a1,color:#fff,stroke:#075985
classDef storage fill:#475569,color:#fff,stroke:#334155
class AUTH,QP,META cloud
class WH1,WH2,WH3 compute
class DB,MP,STG storage
Module 3: Snowflake Pipeline
SQL, Snowpark, and external stages
Duration: 75 min — Animation (3) · Think & Discuss (7) · Theory (15) · Quiz (3) · Practice (47)
1. Animation
2. Think & Discuss
Situation: Marcus wants SQL maintainability. Elena asks Bob to rebuild the same 12 KPIs on Snowflake. Priya connects her Map page — identical Gold schema.
Prompts:
- What is Marcus asking for — a different tool, skill set, or both?
- How do you load Parquet from ADLS2 into Snowflake Bronze without Spark?
- If Gold schema stays identical, can Priya keep the same Power BI reports?
3. Theory
Marcus: “We need SQL, not notebooks.” Elena: Rebuild the same medallion on Snowflake — identical Gold KPIs, SQL-first maintainability.
Same architecture. Different implementation philosophy.
Snowflake loads Parquet via external stage + COPY INTO — two steps vs Databricks one-line spark.read.parquet(). More control over error handling; familiar pattern for SQL teams.
3.1 Theory: Snowflake Essentials
Architecture
Snowflake uses a three-layer architecture that separates storage, compute, and cloud services into independent, scalable tiers:
Storage layer: Data is stored in Snowflake’s proprietary micro-partitioned columnar format — compressed, automatically clustered, and not directly accessible as files (unlike Delta Lake’s open Parquet). Data is organized into databases and schemas, which are logical namespaces only (they don’t map to file paths).
Compute layer (Virtual Warehouses): Independent compute clusters that execute queries. Warehouses range from X-Small (1 credit/hour) to 6X-Large (512 credits/hour). They auto-suspend after a configurable idle period (default: 10 minutes) and resume in 2–5 seconds — so you only pay while queries are running. For ETL workloads that finish quickly, set auto-suspend to 60 seconds to minimize idle billing.
Cloud services layer: Authentication, query parsing, metadata management, and access control — all managed by Snowflake. This includes the Snowflake Horizon Catalog (governance: tags, masking policies, lineage), which is the parallel to Databricks Unity Catalog.
Warehouse sizing for ETL
Set auto-suspend to 60 seconds for ETL warehouses — the default 10 minutes wastes credits for batch workloads that finish quickly. Use X-Small or Small for development and most Silver/Gold transforms; only scale up when you see spilling to remote storage or long query times.
External Stages + COPY INTO
Snowflake requires a two-step ingestion pattern: first create a stage pointing to cloud storage, then execute COPY INTO to bulk-load data. This is more verbose than Databricks’ one-line spark.read.parquet(), but gives explicit control over file format options, error handling (ON_ERROR = CONTINUE logs bad rows to an error table), and schema inference (INFER_SCHEMA = TRUE for Parquet/Avro/ORC/JSON/CSV).
-- Create external stage pointing to ADLS2
CREATE STAGE my_stage
URL = 'azure://account.blob.core.windows.net/container/path/'
CREDENTIALS = (AZURE_SAS_TOKEN = '...');
-- Bulk load Parquet into table (INFER_SCHEMA auto-detects columns)
COPY INTO my_table FROM @my_stage
FILE_FORMAT = (TYPE = PARQUET)
MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE;flowchart LR
ADLS2[("Azure ADLS2<br/>Parquet files")]
STAGE["@my_stage<br/><i>External Stage</i>"]
TABLE[("my_table<br/><i>micro-partitions</i>")]
ADLS2 -->|"CREATE STAGE<br/>URL + SAS token"| STAGE
STAGE -->|"COPY INTO<br/>FILE_FORMAT = PARQUET"| TABLE
classDef source fill:#0057b8,color:#fff,stroke:#003d82
classDef stage fill:#29B5E8,color:#fff,stroke:#1e9bc9
classDef table fill:#01065c,color:#fff,stroke:#000940
class ADLS2 source
class STAGE stage
class TABLE table
Column casing — a cross-platform gotcha
Snowflake normalizes unquoted identifiers to UPPER_CASE. Databricks defaults to lower_case. When you write fare_amount in a Snowflake query, Snowflake interprets it as FARE_AMOUNT. This matters when: (a) hand-writing cross-platform SQL, (b) reading Delta Lake tables from Snowflake or vice versa, (c) debugging why a SELECT fare_amount returns nulls from an externally-created table. dbt handles this transparently via { adapter.quote() }, but hand-written SQL does not.
Snowpipe and Time Travel
Beyond batch ingestion with COPY INTO, Snowflake provides two additional capabilities that are essential in production:
- Snowpipe enables continuous, serverless ingestion — files arriving in cloud storage are automatically loaded into tables within minutes, without managing warehouse uptime. It pairs well with event-driven architectures (e.g., Azure Event Grid notifications triggering a pipe).
- Time Travel allows querying historical data at any point within a configurable retention period (up to 90 days for Enterprise edition). This is invaluable for auditing, debugging pipeline errors, and recovering accidentally dropped tables (
UNDROP TABLE).
These features have no direct equivalent in Databricks — Snowpipe replaces the need for external streaming ingestion for near-real-time batch loads, and Time Travel goes beyond Delta Lake’s version history by operating at the platform level across all tables.
Snowpark vs PySpark — Side by Side
Beyond API syntax, the bigger question for consultants is when to choose Snowflake vs Databricks. The next slide compares the two platforms at a strategic level.
Snowpark was deliberately designed to feel like PySpark — the API surface is nearly identical. However, there are important differences to be aware of:
| Operation | PySpark (Databricks) | Snowpark (Snowflake) |
|---|---|---|
| Read table | spark.table("my_table") |
session.table("MY_TABLE") |
| Filter | df.filter(col("x") > 0) |
df.filter(col("X") > 0) |
| Add column | df.withColumn("y", ...) |
df.with_column("Y", ...) |
| Group + Agg | df.groupBy("x").agg(...) |
df.group_by("X").agg(...) |
| Write | df.write.saveAsTable("t") |
df.write.save_as_table("T") |
| Import | from pyspark.sql.functions |
from snowflake.snowpark.functions |
Where the APIs diverge: Snowpark uses snake_case method names (with_column, group_by, save_as_table) while PySpark uses camelCase (withColumn, groupBy, saveAsTable). Column names are case-sensitive in Snowpark (reflecting Snowflake’s UPPER_CASE normalization) but case-insensitive by default in PySpark. Snowpark also lacks a .cache() equivalent — Snowflake manages caching internally at the warehouse level.
3.2 Demo: Trainer Walkthrough
Part 1: SQL Approach
- 00_account_setup.sql — Create database, warehouse, schemas
- 01_storage_integration.sql + 02_external_stage.sql — Configure ADLS2 access
- bronze/01_ingest_trips.sql — COPY INTO from external stage
- silver/01_create_cleaned.sql + 02_create_enriched.sql — SQL transforms
- gold/01_create_kpis.sql — 12 KPI tables
Part 2: Snowpark Python
The trainer shows the same Silver/Gold transforms using Snowpark: - Compare 02_silver_cleaning.py (Snowpark) with 01_create_cleaned.sql (SQL) - Note the API similarities with Databricks PySpark
3.3 Key Takeaways
- Snowflake excels at SQL-native workloads with instant, elastic compute
- External Stages + COPY INTO provide efficient bulk ingestion from cloud storage
- Snowpark brings Python DataFrame processing to Snowflake’s engine
- PySpark → Snowpark migration is straightforward due to similar APIs
- In production, use Tasks + Streams + Stored Procedures (see Module 5)
4. Quiz
Quiz: Module 3 — Snowflake Pipeline Quiz
Before moving on, make sure you can answer:
- What Snowflake object points to external cloud storage, and what command loads data from it?
- Name two differences between Snowpark and PySpark syntax (method naming, case sensitivity).
- What happens to a virtual warehouse after its auto-suspend period expires, and why does this matter for cost?
5. Practice
Hands-on lab
Scripts under snowflake/sql/ and snowflake/snowpark/.
Priya / Power BI: Map page — borough filled map and top pickup zones from kpi_borough_analysis, kpi_top_pickup_zones.
Next module
Module 4: dbt Pipeline — Marcus’s board asks: Where does each dashboard number come from?