Tool Comparison — Deep Dive

Self-study reference · Module 7 supplement

title: “Tool Comparison — Deep Dive” subtitle: “Self-study reference · Module 7 supplement” —

Use this page after class or when preparing Module 7. In-room delivery covers Objectives, Power BI demo notes, and When to Use What only.

Full Tool Comparison

Aspect Databricks Snowflake dbt
Role in pipeline End-to-end: ingestion + transform + storage End-to-end: ingestion + transform + storage Transform only — no ingestion
Primary language PySpark (Python) + SQL SQL + Snowpark (Python) SQL (Jinja-templated)
Ingestion spark.read.parquet() / Auto Loader — direct cloud read, no staging step COPY INTO from external stage; requires CREATE STAGE first; INFER_SCHEMA for Parquet/Avro/ORC/JSON/CSV N/A — reads existing tables via source()
Python API PySpark DataFrame — runs on Spark executors Snowpark DataFrame — same API surface, runs on Snowflake warehouse Python models (table + incremental only); Snowpark or PySpark depending on adapter
Storage format Delta Lake (open source): Parquet + JSON transaction log; ACID, time travel, schema enforcement on write Snowflake proprietary columnar: micro-partitioning, automatic clustering — not accessible as raw files Delegated to backend
Open format ✅ Readable by Spark, DuckDB, Trino, Polars — no vendor lock-in ❌ Proprietary — must COPY INTO stage to export N/A
ACID transactions ✅ Delta Lake serializable isolation ✅ Full ACID at statement level Delegated
Time travel VERSION AS OF / TIMESTAMP AS OF via Delta transaction log AT(OFFSET =>) / AT(TIMESTAMP =>) via automatic retention ❌ Not built in
Compute model Spark clusters (classic: cold start 3–5 min; serverless: sub-second) or SQL warehouses Virtual warehouses XS–6XL; auto-suspend default 10 min; resumes in 2–5 sec; billed per second active No own compute — billed through backend
Cost model DBU per VM-hour; idle cluster still billed; use job clusters or serverless to avoid idle cost Credit per second; auto-suspend eliminates idle cost entirely No compute cost
Data quality / testing Manual filters; Delta Constraints (NOT NULL, CHECK); DLT Expectations (separate product) Manual filters; DML error logging on INSERT/UPDATE/MERGE (ERROR_LOGGING = TRUE) Built-in: dbt test runs not_null, unique, accepted_values, relationships automatically; custom SQL tests
Schema enforcement Delta validates on write; mergeSchema to evolve COPY INTO options: MATCH_BY_COLUMN_NAME, INFER_SCHEMA, ERROR_ON_COLUMN_COUNT_MISMATCH dbt contracts enforce column-level constraints at run time
Materializations dbt-databricks: table, view, incremental, materialized_view dbt-snowflake: table, view, incremental, dynamic_table (not materialized_view — Snowflake-specific) 5 native types: view, table, incremental, ephemeral, materialized_view
Governance Unity Catalog: 3-level namespace, column tags, row filters, audit logs, lineage RBAC: role hierarchy, row access policies, column masking, data sharing Auto-generated data docs; column-level lineage graph; description: in YAML
Production scheduling Databricks Workflows: DAG of tasks (notebook, Python, dbt, SQL); Git source, retries, notifications Snowflake Tasks: DAG with cron or AFTER dependency; serverless or warehouse dbt Cloud (managed) or dbt run in GitHub Actions / Azure DevOps
Column name case lower_case (Spark default) UPPER_CASE (Snowflake normalises unquoted identifiers) Handles transparently via { adapter.quote() }
AI features Databricks Assistant, Genie data analyst, ai_query() SQL function Snowflake Copilot, Cortex Analyst, AI_COMPLETE() (replaces legacy SNOWFLAKE.CORTEX.COMPLETE()) dbt Copilot (dbt Cloud), MCP server
Streaming Structured Streaming (micro-batch or continuous-experimental); Kafka, Auto Loader Snowpipe Streaming (insert-only) + Dynamic Tables (TARGET_LAG) dynamic_table materialization (Snowflake adapter only)
Machine Learning sklearn, XGBoost, MLflow, AutoML; full OSS ecosystem Cortex ML: FORECAST, ANOMALY_DETECTION, CLASSIFICATION, TOP_INSIGHTS (SQL only); Snowpark ML (Python) Feature table prep + quality gates; no model training
Best for Complex Python/Spark transforms, ML, open-format data lake SQL-heavy workloads, elastic scaling, cross-org data sharing Transformation governance, CI/CD pipelines, automated testing, multi-backend portability
TipTrainer tip — the key insight to land

All three tools produced the same 12 KPIs from the same data. The platform does not change the answer — it changes the engineering experience, cost model, operational overhead, and team fit. That is the whole point of the workshop.


Key architectural facts (trainer reference)

Caution1. Column casing — a real integration gotcha

Databricks defaults to lower_case columns. Snowflake normalises unquoted SQL to UPPER_CASE. When attendees ran the same dbt models against both backends, dbt’s { adapter.quote() } handled this silently. In hand-written SQL queries, unquoted cross-platform column references will silently return nulls or fail.

Caution2. Ingestion is architecturally different
  • Databricks: spark.read.parquet(path) — one line, schema auto-inferred from Parquet metadata.
  • Snowflake: requires CREATE STAGE first, then COPY INTO <table> from the stage. Two steps, but INFER_SCHEMA automates column detection from Parquet/Avro/ORC/JSON/CSV; ON_ERROR = CONTINUE logs bad rows to an error table.

The Snowflake two-step gives more control over error handling. Databricks is simpler for Parquet sources.

Caution3. Auto-suspend changes the Snowflake cost model entirely

Snowflake warehouses auto-suspend (default 10 min) and resume in 2–5 seconds. An all-purpose Databricks cluster keeps billing until manually stopped (or auto-terminated at cluster creation).

Daily pipeline: Snowflake charges minutes. Databricks charges hours if the cluster is left running. High-frequency pipeline (every 5 min): Databricks serverless or a job cluster amortises startup cost better; Snowflake’s per-second billing is still fine but startup latency is a non-issue.

Caution4. Only dbt ran automated tests in this workshop

dbt test checked nulls, uniqueness, and accepted values automatically after every model run. In the Databricks and Snowflake paths, attendees had to write and run SELECT COUNT(*) manually.

In production, skipping automated tests is how data quality issues reach dashboards undetected. You can add dbt on top of a Databricks or Snowflake pipeline purely for its testing framework.

Caution5. Delta Lake is open; Snowflake storage is not

Delta Lake: open-source Parquet + JSON transaction log, readable by Spark, DuckDB, Trino, Polars, any Delta reader. Snowflake storage: proprietary micro-partition format — data is only accessible via Snowflake SQL or by unloading (COPY INTO <stage>).

This matters for vendor-lock-in risk, ad-hoc querying from external tools, and disaster recovery planning.


Return to Module 7