flowchart LR
SETUP["setup<br/><i>Create schemas</i>"] --> BRONZE["bronze_ingestion<br/><i>Read Parquet</i>"]
BRONZE --> SILVER["silver_cleaning<br/><i>Filter & enrich</i>"]
SILVER --> GOLD["gold_kpis<br/><i>12 aggregations</i>"]
GOLD --> VALIDATE["validate<br/><i>SQL checks</i>"]
classDef task fill:#0369a1,color:#fff,stroke:#075985
classDef final fill:#107c10,color:#fff,stroke:#0a5c0a
class SETUP,BRONZE,SILVER,GOLD task
class VALIDATE final
Module 5: Production Patterns
Scheduling, CI/CD, and go-live checklist
Duration: 45 min — Animation (3) · Think & Discuss (7) · Theory (15) · Quiz (3) · Practice (17)
1. Animation
2. Think & Discuss
Situation: Pipelines work in notebooks and manual dbt runs. YellowLine NYC needs scheduled, monitored, reliable jobs.
Prompts:
- What is different about production vs this morning’s lab?
- How would you schedule nightly Bronze → Silver → Gold on each platform?
- If Silver fails at 2 a.m., who should know?
3. Theory
| Legacy name in repo | Say in class |
|---|---|
| Delta Live Tables (DLT) | Lakeflow Spark Declarative Pipelines (LSDP) |
| Databricks Workflows / Jobs | Lakeflow Jobs |
@dlt.table |
@dp.table / LSDP decorators (old API still works) |
Lab file dlt_pipeline.py keeps the legacy import — no migration required for training.
3.1 The Big Question
You just learned to build pipelines. But you wouldn’t run notebooks in production.
This module bridges the training-to-production gap — showing how each tool deploys pipelines in the real world.
3.2 Training vs Production
| Aspect | Training (Today) | Production (Real World) |
|---|---|---|
| Interactive notebooks | LSDP + Lakeflow Jobs + Asset Bundles | |
| SQL Worksheets / Snowpark scripts | Tasks + Streams + Stored Procs | |
Manual dbt run in terminal |
dbt Cloud jobs / GitHub Actions CI | |
| Orchestration | Manual, sequential | Scheduled, event-driven, automated |
| Error handling | Interactive debugging | Retries, alerts, logging |
| Deployment | Copy/paste or import | Git + CI/CD + infrastructure as code |
| Testing | Ad-hoc SELECT COUNT(*) |
Automated tests on every change |
3.3
Databricks Production Patterns
Lakeflow Spark Declarative Pipelines (LSDP, formerly DLT)
The code below uses the legacy @dlt.table / @dlt.expect_or_drop decorators for training compatibility. In production, Lakeflow Spark Declarative Pipelines use @dp.table — same API, new name. No migration is required for this workshop.
LSDP replaces notebooks with declarative pipeline definitions:
@dlt.table(name="silver_nyc_taxi_cleaned")
@dlt.expect_or_drop("valid_distance", "trip_distance > 0")
def silver_nyc_taxi_cleaned():
return dlt.read("bronze_nyc_taxi_trips").filter(...)Key benefits:
- You declare what each table should contain — LSDP figures out how to run it
@dp.expectenforces data quality with automatic violation metrics- Auto-scaling, auto-retry, built-in lineage in Unity Catalog
- No explicit
spark.read()/.write.saveAsTable()— LSDP manages I/O
Lakeflow Jobs (formerly Databricks Workflows)
Schedule notebook pipelines as multi-task jobs with dependency DAGs:
- Task chaining: setup → bronze → silver → gold
- Retry policies per task
- Email/Slack alerting on failure
- Cluster policies for cost control
Databricks Asset Bundles (DABs)
Infrastructure-as-code for Databricks:
databricks bundle deploy -t prod # Deploy pipeline + job from GitFile: databricks/production/asset_bundle/databricks.yml — packages everything into a deployable bundle.
Job clusters vs all-purpose clusters
In production, always use job clusters (created fresh per run and terminated on completion) instead of all-purpose clusters left running. Job clusters cost ~5x less per DBU and eliminate the #1 cause of unexpected cloud bills in Databricks: forgotten interactive clusters. Combine with cluster policies to cap maximum worker count and prevent runaway costs.
3.4
Snowflake Production Patterns
Streams + Tasks
Streams detect changes (CDC). Tasks run on schedule or when triggered:
CREATE STREAM bronze_stream ON TABLE bronze_nyc_taxi_trips;
CREATE TASK refresh_silver
WAREHOUSE = DE_WORKSHOP_WH
SCHEDULE = 'USING CRON 0 6 * * * Europe/Berlin'
WHEN SYSTEM$STREAM_HAS_DATA('bronze_stream')
AS
-- Silver transformation SQLflowchart TD
BRONZE_TABLE[("Bronze Table")] --> STREAM["Stream<br/><i>detects new rows (CDC)</i>"]
STREAM -->|"SYSTEM$STREAM_HAS_DATA()"| TASK_SILVER["Task: refresh_silver<br/><i>SCHEDULE = '0 6 * * *'</i>"]
TASK_SILVER --> SILVER_TABLE[("Silver Table")]
SILVER_TABLE --> TASK_GOLD["Task: refresh_gold"]
TASK_GOLD --> GOLD_TABLE[("Gold Tables")]
classDef table fill:#01065c,color:#fff,stroke:#000940
classDef stream fill:#29B5E8,color:#fff,stroke:#1e9bc9
classDef task fill:#0369a1,color:#fff,stroke:#075985
class BRONZE_TABLE,SILVER_TABLE,GOLD_TABLE table
class STREAM stream
class TASK_SILVER,TASK_GOLD task
Stored Procedures
Wrap pipeline logic in callable units:
CREATE PROCEDURE run_silver_pipeline(attendee VARCHAR)
RETURNS VARCHAR
AS $$
BEGIN
-- Silver transformation logic
RETURN 'Silver pipeline completed';
END $$;
CALL run_silver_pipeline('01_alice');Snowpark Stored Procedures
Deploy Python code as native Snowflake procedures:
session.sproc.register(
func=run_silver_pipeline,
name="RUN_SILVER_PIPELINE_PY",
stage_location="@my_stage"
)3.5
dbt Production Patterns
GitHub Actions CI
Automated testing on every pull request:
- name: dbt build (slim CI)
run: dbt build --select state:modified+ --defer --state ./targetSlim CI only tests changed models + downstream — saves time and cost.
dbt now supports state-aware runs that detect which models have changed since the last run and only rebuild those — saving significant compute time on large projects. Combined with --select state:modified+, dbt compares the current project state against a previous manifest and skips unchanged models entirely. This is especially valuable for Gold-heavy projects where upstream Silver models rarely change.
dbt Cloud
Managed scheduling + monitoring:
- Scheduled jobs: daily
dbt buildwith alerting - CI jobs: automatic PR checks
- Hosted documentation
- Copilot: AI-assisted SQL, docs, tests (Starter+ plans)
“Works on my machine” is not production-ready
The most common production incident pattern: a pipeline runs fine in a development notebook or local dbt environment, then fails in production due to (a) different schema permissions, (b) hard-coded paths or attendee IDs, or (c) missing dependencies that were loaded interactively. Always test with dbt build in a clean environment that mirrors production — and never hard-code attendee IDs, warehouse names, or file paths that change between environments.
Blue-green deployment for data pipelines
For mission-critical pipelines, consider a blue-green deployment pattern: maintain two parallel environments (blue = current production, green = new version). Deploy changes to green, validate with dbt build and automated tests, then switch traffic by updating the production schedule to point at green. This enables zero-downtime releases and instant rollback — if green produces incorrect results, simply re-enable blue’s schedule. Both Databricks (via Asset Bundles with --target prod-blue / --target prod-green) and dbt Cloud (via separate environments) support this pattern natively.
3.6 Key Takeaways
- Production ≠ notebooks — scheduled, tested, monitored pipelines replace interactive development
- Databricks uses LSDP (declarative) + Lakeflow Jobs (orchestration) + Asset Bundles (IaC deployment)
- Snowflake uses Streams (CDC) + Tasks (scheduling) + Stored Procedures (encapsulated logic)
- dbt uses
dbt build(run + test) + GitHub Actions CI (slim CI) or dbt Cloud (managed scheduling) - Cost control is a first-class production concern: job clusters, auto-suspend, and cluster policies
- Testing in CI catches bad models before they reach Gold tables and dashboards
4. Quiz
Quiz: Module 5 — Production Patterns Quiz
Before moving on, make sure you can answer:
- What is the difference between a Databricks job cluster and an all-purpose cluster, and why does it matter for production cost?
- How does Snowflake’s
SYSTEM$STREAM_HAS_DATA()function enable event-driven task execution? - What does “slim CI” mean in dbt, and how does it save time and compute cost?
5. Practice
Hands-on lab
Priya / Power BI: Production SLAs define how fresh Priya’s dashboard can be — batch refresh vs DirectQuery comes in Module 8.
Next module
Module 6: AI Features — Marcus asks whether AI can help analysts explore data faster.