Tool Comparison — Deep Dive

Self-study reference · Module 7 supplement

title: “Tool Comparison — Deep Dive” subtitle: “Self-study reference · Module 7 supplement” —

Use this page after class or when preparing Module 7. In-room delivery covers Objectives, Power BI demo notes, and When to Use What only.

Full Tool Comparison

Aspect	Databricks	Snowflake	dbt
Role in pipeline	End-to-end: ingestion + transform + storage	End-to-end: ingestion + transform + storage	Transform only — no ingestion
Primary language	PySpark (Python) + SQL	SQL + Snowpark (Python)	SQL (Jinja-templated)
Ingestion	`spark.read.parquet()` / Auto Loader — direct cloud read, no staging step	`COPY INTO` from external stage; requires CREATE STAGE first; `INFER_SCHEMA` for Parquet/Avro/ORC/JSON/CSV	N/A — reads existing tables via `source()`
Python API	PySpark DataFrame — runs on Spark executors	Snowpark DataFrame — same API surface, runs on Snowflake warehouse	Python models (table + incremental only); Snowpark or PySpark depending on adapter
Storage format	Delta Lake (open source): Parquet + JSON transaction log; ACID, time travel, schema enforcement on write	Snowflake proprietary columnar: micro-partitioning, automatic clustering — not accessible as raw files	Delegated to backend
Open format	✅ Readable by Spark, DuckDB, Trino, Polars — no vendor lock-in	❌ Proprietary — must `COPY INTO` stage to export	N/A
ACID transactions	✅ Delta Lake serializable isolation	✅ Full ACID at statement level	Delegated
Time travel	✅ `VERSION AS OF` / `TIMESTAMP AS OF` via Delta transaction log	✅ `AT(OFFSET =>)` / `AT(TIMESTAMP =>)` via automatic retention	❌ Not built in
Compute model	Spark clusters (classic: cold start 3–5 min; serverless: sub-second) or SQL warehouses	Virtual warehouses XS–6XL; auto-suspend default 10 min; resumes in 2–5 sec; billed per second active	No own compute product — runs on backend cluster/warehouse
Cost model	DBU per VM-hour; idle cluster still billed; use job clusters or serverless to avoid idle cost	Credit per second; auto-suspend eliminates idle cost entirely	No standalone product fee — you pay backend DBUs/credits when models run
Data quality / testing	Manual filters in labs; Delta Constraints (`NOT NULL`, `CHECK`); LSDP `@dp.expect_or_drop` / `@dp.expect` in production (formerly DLT Expectations)	Manual filters in labs; DML error logging on INSERT/UPDATE/MERGE (`ERROR_LOGGING = TRUE`)	Built-in: `dbt test` runs `not_null`, `unique`, `accepted_values`, `relationships` automatically; custom SQL tests
Schema enforcement	Delta validates on write; `mergeSchema` to evolve	`COPY INTO` options: `MATCH_BY_COLUMN_NAME`, `INFER_SCHEMA`, `ERROR_ON_COLUMN_COUNT_MISMATCH`	`dbt test` + YAML schema tests in this workshop; optional `dbt contracts` in production (not used in base lab models)
Materializations	dbt-databricks: table, view, incremental, `materialized_view`	dbt-snowflake: table, view, incremental, `dynamic_table` (not `materialized_view` — Snowflake-specific)	Core types: view, table, incremental, ephemeral, `materialized_view`; base workshop uses table/view only
Governance	Unity Catalog: 3-level namespace, column tags, row filters, audit logs, lineage	RBAC: role hierarchy, row access policies, column masking, data sharing	Auto-generated data docs; column-level lineage graph; `description:` in YAML
Production scheduling	Lakeflow Jobs (formerly Workflows): DAG of tasks (notebook, LSDP pipeline, Python, dbt, SQL); Git source, retries, notifications	Snowflake Tasks: DAG with cron or `AFTER` dependency; serverless or warehouse	`dbt build` in GitHub Actions / cron (Core CLI); dbt Cloud optional (managed)
Production deployment	DABs — `databricks bundle deploy`; Git as source of truth for jobs/pipelines	Snowflake CLI + Git (`snow git fetch` / `execute`); stored procedures encapsulate logic	GitHub Actions CI; manifest download + `--defer` (Slim CI pattern in production lab)
Column name case	`lower_case` (Spark default)	`UPPER_CASE` (Snowflake normalises unquoted identifiers)	Handles transparently via `{ adapter.quote() }`
AI features (Module 6)	Databricks Assistant, Genie, `ai_query()` SQL function	Snowflake CoCo (formerly Cortex Code / Copilot in Workspaces), Cortex Analyst, `AI_COMPLETE()` (replaces legacy `SNOWFLAKE.CORTEX.COMPLETE()`)	dbt MCP server (local Core); dbt Copilot (dbt Cloud, optional)
Streaming	Structured Streaming; Kafka, Auto Loader. Workshop M8: Aiven Kafka → streaming Delta (separate user-activity dataset, not NYC Taxi batch bronze)	Dynamic Tables (`TARGET_LAG`); Snowpipe Streaming or file Snowpipe. Workshop M8: relay consumer → ADLS2 → file Snowpipe → Dynamic Tables	`dynamic_table` materialization (Snowflake adapter). Workshop M8: models in `{attendee}_DBT_STREAMING`
Machine Learning (Module 9)	sklearn, XGBoost, MLflow, AutoML; full OSS ecosystem	Cortex ML: `ML.FORECAST`, `ML.ANOMALY_DETECTION`, etc. (SQL); Snowpark ML (Python stretch)	Optional feature table + tests in `_DBT_GOLD`; no model training in dbt
Best for	Complex Python/Spark transforms, ML, open-format data lake	SQL-heavy workloads, elastic scaling, cross-org data sharing	Transformation governance, CI/CD pipelines, automated testing, multi-backend portability

Key takeaway

All three tools produced the same 12 KPIs from the same data. The platform does not change the answer — it changes the engineering experience, cost model, operational overhead, and team fit. Power BI (and other BI tools) consume Gold the same way regardless of which engine built the tables.

Key architectural facts

1. Column casing — a real integration gotcha

Databricks defaults to lower_case columns. Snowflake normalises unquoted SQL to UPPER_CASE. When attendees ran the same dbt models against both backends, dbt’s { adapter.quote() } handled this silently. In hand-written SQL queries, unquoted cross-platform column references will silently return nulls or fail.

2. Ingestion is architecturally different

Databricks: spark.read.parquet(path) — one line, schema auto-inferred from Parquet metadata.
Snowflake: requires CREATE STAGE first, then COPY INTO <table> from the stage. Two steps, but INFER_SCHEMA automates column detection from Parquet/Avro/ORC/JSON/CSV; ON_ERROR = CONTINUE logs bad rows to an error table.

The Snowflake two-step gives more control over error handling. Databricks is simpler for Parquet sources.

3. Auto-suspend changes the Snowflake cost model entirely

Snowflake warehouses auto-suspend (default 10 min) and resume in 2–5 seconds. An all-purpose Databricks cluster keeps billing until manually stopped (or auto-terminated at cluster creation).

Daily pipeline: Snowflake charges minutes. Databricks charges hours if the cluster is left running. High-frequency pipeline (every 5 min): Databricks serverless or a job cluster amortises startup cost better; Snowflake’s per-second billing is still fine but startup latency is a non-issue.

4. Only dbt ran automated tests in this workshop (batch path)

dbt test checked nulls, uniqueness, and accepted values automatically after every model run. In the Databricks and Snowflake batch paths, attendees had to write and run SELECT COUNT(*) manually.

In production, skipping automated tests is how data quality issues reach dashboards undetected. You can add dbt on top of a Databricks or Snowflake pipeline purely for its testing framework.

5. Delta Lake is open; Snowflake storage is not

Delta Lake: open-source Parquet + JSON transaction log, readable by Spark, DuckDB, Trino, Polars, any Delta reader. Snowflake storage: proprietary micro-partition format — data is only accessible via Snowflake SQL or by unloading (COPY INTO <stage>).

This matters for vendor-lock-in risk, ad-hoc querying from external tools, and disaster recovery planning.

6. Production adds declarative quality on Databricks

Batch labs use manual WHERE filters. Production Databricks adds LSDP @dp.expect_or_drop — violations are tracked in the pipeline event log instead of failing silently. Same business rules; different enforcement model (Module 5).

When to Use What

Map recommendations to Marcus’s Three Constraints: Cost, Performance, Compliance (Module 7 discussion).

Choose Databricks when:

You need complex Python/Spark transformations
ML/AI workloads are part of the pipeline (sklearn, MLflow, full algorithm control)
You want unified analytics + ML on one platform
Streaming workloads require sub-second latency (Structured Streaming)
Team is comfortable with PySpark

Choose Snowflake when:

SQL is your team’s primary language
You need instant, elastic compute scaling
Data sharing across organizations is important
You want both SQL and Python (Snowpark) options
No-code ML (Cortex ML.FORECAST, ML.ANOMALY_DETECTION) covers your use case
Near-real-time analytics (1-min lag) are sufficient — Dynamic Tables are simpler than Structured Streaming

Choose dbt when:

You want transformation-layer standardization
Testing and documentation are priorities
You need to run the same logic across multiple backends
Your team values version-controlled, testable SQL
You want an optional governed feature table (ml_features_tip_prediction in _DBT_GOLD) with automated column tests — not a shared runtime dependency for Databricks sklearn or Cortex ML

Combine tools when:

Databricks + dbt: Databricks for ingestion/ML, dbt for transformation governance
Snowflake + dbt: Snowflake for compute/storage, dbt for testing/docs/CI
All three: Different teams, different strengths — dbt as the common contract layer

Return to Module 7

Module 7 — Comparison & Wrap-up — open discussion and tool recommendation for Marcus