Tool Comparison — Deep Dive
Self-study reference · Module 7 supplement
title: “Tool Comparison — Deep Dive” subtitle: “Self-study reference · Module 7 supplement” —
Use this page after class or when preparing Module 7. In-room delivery covers Objectives, Power BI demo notes, and When to Use What only.
Full Tool Comparison
| Aspect | |||
|---|---|---|---|
| Role in pipeline | End-to-end: ingestion + transform + storage | End-to-end: ingestion + transform + storage | Transform only — no ingestion |
| Primary language | PySpark (Python) + SQL | SQL + Snowpark (Python) | SQL (Jinja-templated) |
| Ingestion | spark.read.parquet() / Auto Loader — direct cloud read, no staging step |
COPY INTO from external stage; requires CREATE STAGE first; INFER_SCHEMA for Parquet/Avro/ORC/JSON/CSV |
N/A — reads existing tables via source() |
| Python API | PySpark DataFrame — runs on Spark executors | Snowpark DataFrame — same API surface, runs on Snowflake warehouse | Python models (table + incremental only); Snowpark or PySpark depending on adapter |
| Storage format | Delta Lake (open source): Parquet + JSON transaction log; ACID, time travel, schema enforcement on write | Snowflake proprietary columnar: micro-partitioning, automatic clustering — not accessible as raw files | Delegated to backend |
| Open format | ✅ Readable by Spark, DuckDB, Trino, Polars — no vendor lock-in | ❌ Proprietary — must COPY INTO stage to export |
N/A |
| ACID transactions | ✅ Delta Lake serializable isolation | ✅ Full ACID at statement level | Delegated |
| Time travel | ✅ VERSION AS OF / TIMESTAMP AS OF via Delta transaction log |
✅ AT(OFFSET =>) / AT(TIMESTAMP =>) via automatic retention |
❌ Not built in |
| Compute model | Spark clusters (classic: cold start 3–5 min; serverless: sub-second) or SQL warehouses | Virtual warehouses XS–6XL; auto-suspend default 10 min; resumes in 2–5 sec; billed per second active | No own compute — billed through backend |
| Cost model | DBU per VM-hour; idle cluster still billed; use job clusters or serverless to avoid idle cost | Credit per second; auto-suspend eliminates idle cost entirely | No compute cost |
| Data quality / testing | Manual filters; Delta Constraints (NOT NULL, CHECK); DLT Expectations (separate product) |
Manual filters; DML error logging on INSERT/UPDATE/MERGE (ERROR_LOGGING = TRUE) |
Built-in: dbt test runs not_null, unique, accepted_values, relationships automatically; custom SQL tests |
| Schema enforcement | Delta validates on write; mergeSchema to evolve |
COPY INTO options: MATCH_BY_COLUMN_NAME, INFER_SCHEMA, ERROR_ON_COLUMN_COUNT_MISMATCH |
dbt contracts enforce column-level constraints at run time |
| Materializations | dbt-databricks: table, view, incremental, materialized_view |
dbt-snowflake: table, view, incremental, dynamic_table (not materialized_view — Snowflake-specific) |
5 native types: view, table, incremental, ephemeral, materialized_view |
| Governance | Unity Catalog: 3-level namespace, column tags, row filters, audit logs, lineage | RBAC: role hierarchy, row access policies, column masking, data sharing | Auto-generated data docs; column-level lineage graph; description: in YAML |
| Production scheduling | Databricks Workflows: DAG of tasks (notebook, Python, dbt, SQL); Git source, retries, notifications | Snowflake Tasks: DAG with cron or AFTER dependency; serverless or warehouse |
dbt Cloud (managed) or dbt run in GitHub Actions / Azure DevOps |
| Column name case | lower_case (Spark default) |
UPPER_CASE (Snowflake normalises unquoted identifiers) |
Handles transparently via { adapter.quote() } |
| AI features | Databricks Assistant, Genie data analyst, ai_query() SQL function |
Snowflake Copilot, Cortex Analyst, AI_COMPLETE() (replaces legacy SNOWFLAKE.CORTEX.COMPLETE()) |
dbt Copilot (dbt Cloud), MCP server |
| Streaming | Structured Streaming (micro-batch or continuous-experimental); Kafka, Auto Loader | Snowpipe Streaming (insert-only) + Dynamic Tables (TARGET_LAG) |
dynamic_table materialization (Snowflake adapter only) |
| Machine Learning | sklearn, XGBoost, MLflow, AutoML; full OSS ecosystem | Cortex ML: FORECAST, ANOMALY_DETECTION, CLASSIFICATION, TOP_INSIGHTS (SQL only); Snowpark ML (Python) | Feature table prep + quality gates; no model training |
| Best for | Complex Python/Spark transforms, ML, open-format data lake | SQL-heavy workloads, elastic scaling, cross-org data sharing | Transformation governance, CI/CD pipelines, automated testing, multi-backend portability |
All three tools produced the same 12 KPIs from the same data. The platform does not change the answer — it changes the engineering experience, cost model, operational overhead, and team fit. That is the whole point of the workshop.
Key architectural facts (trainer reference)
Databricks defaults to lower_case columns. Snowflake normalises unquoted SQL to UPPER_CASE. When attendees ran the same dbt models against both backends, dbt’s { adapter.quote() } handled this silently. In hand-written SQL queries, unquoted cross-platform column references will silently return nulls or fail.
- Databricks:
spark.read.parquet(path)— one line, schema auto-inferred from Parquet metadata. - Snowflake: requires
CREATE STAGEfirst, thenCOPY INTO <table>from the stage. Two steps, butINFER_SCHEMAautomates column detection from Parquet/Avro/ORC/JSON/CSV;ON_ERROR = CONTINUElogs bad rows to an error table.
The Snowflake two-step gives more control over error handling. Databricks is simpler for Parquet sources.
Snowflake warehouses auto-suspend (default 10 min) and resume in 2–5 seconds. An all-purpose Databricks cluster keeps billing until manually stopped (or auto-terminated at cluster creation).
Daily pipeline: Snowflake charges minutes. Databricks charges hours if the cluster is left running. High-frequency pipeline (every 5 min): Databricks serverless or a job cluster amortises startup cost better; Snowflake’s per-second billing is still fine but startup latency is a non-issue.
dbt test checked nulls, uniqueness, and accepted values automatically after every model run. In the Databricks and Snowflake paths, attendees had to write and run SELECT COUNT(*) manually.
In production, skipping automated tests is how data quality issues reach dashboards undetected. You can add dbt on top of a Databricks or Snowflake pipeline purely for its testing framework.
Delta Lake: open-source Parquet + JSON transaction log, readable by Spark, DuckDB, Trino, Polars, any Delta reader. Snowflake storage: proprietary micro-partition format — data is only accessible via Snowflake SQL or by unloading (COPY INTO <stage>).
This matters for vendor-lock-in risk, ad-hoc querying from external tools, and disaster recovery planning.
Return to Module 7
- Module 7 — Comparison & Wrap-up — open discussion and tool recommendation for Marcus