Exercise: Production Patterns

YellowLine NYC story · full hands-on lab

title: “Exercise: Production Patterns” subtitle: “YellowLine NYC story · full hands-on lab” —

Estimated time: 35–40 min (Databricks: 10 min · Snowflake: 20 min · dbt CI: 5–10 min)

YellowLine NYC context (Module 5)

What runs every night when the team is not in the room?

Working Environment

Use GitHub Codespaces for a ready-to-use environment — all tools pre-installed. Open your fork on GitHub → Code → Codespaces → Create codespace on master. All production files (databricks/production/, snowflake/sql/production/, dbt_project/production/) are at /workspace/.

GitHub blocked? (emergency only)

The normal path is fork + Codespace (Prerequisites § Step 2). Use Lab source files only if your facilitator approved it — e.g. you cannot create or use GitHub before class.

Prerequisites

Completed at least one pipeline module (Databricks, Snowflake, or dbt)
Production files available in databricks/production/, snowflake/sql/production/, dbt_project/production/

Databricks Production Exercise

Review Lakeflow Declarative Pipeline (formerly DLT)

In your Codespace Explorer (left sidebar), open databricks/production/dlt_pipeline.py
Compare @dp.expect_or_drop() decorators with the manual WHERE filters in 02_silver_cleaning.py (The workshop file uses from pyspark import pipelines as dp with @dp.materialized_view and spark.read.table(). The legacy import dlt / @dlt decorators still work.)
Answer:
- How many data quality expectations are defined? 10 (9× @dp.expect_or_drop + 1× @dp.expect on Silver)
- What happens to rows that fail an expectation?
- How does LSDP know the execution order (Bronze → Silver → Gold)? spark.read.table() references between views

Note

Why @expect_or_drop instead of manual WHERE? In the base exercise you wrote WHERE passenger_count > 0 in every query. DLT turns these into declarative expectations — you state the rule once, and the pipeline engine enforces it at every run. Failed rows are dropped and the drop count is tracked in the DLT event log — unlike a silent WHERE clause, violations are always auditable.

Review Asset Bundle

In your Codespace Explorer (left sidebar), open databricks/production/asset_bundle/databricks.yml
Answer:
- How many deployment targets are defined?
- What changes between dev and prod targets?
- What does databricks bundle deploy -t prod do?

Compute: Serverless vs. Classic

The pipeline editor defaults to Serverless compute, and this is the Databricks-recommended production path: it starts instantly and uses enhanced autoscaling (scaling both the node count and node size automatically), and it always runs on Unity Catalog for built-in governance and lineage.

Switch to Classic compute — via the pipeline Settings (uncheck Serverless) or a clusters: block in asset_bundle/databricks.yml — only when you need specific hardware (ARM nodes for cost, GPUs for ML), a cluster policy, spot instances, or a legacy Hive metastore. The bundle ships serverless: true with a commented-out classic example so you can see both.

Stretch: Create a Lakeflow Declarative Pipeline

In the Databricks sidebar (left), click Jobs & Pipelines → Create → ETL pipeline (Delta Live Tables was renamed to Lakeflow Declarative Pipelines in 2025; the ETL pipeline option is the one you want — not Job or Ingestion pipeline.)
Databricks auto-creates a pipeline with a default name and opens the Lakeflow Pipelines Editor (default settings: Unity Catalog, Serverless compute). At the top, rename the pipeline, then click the catalog/schema shown next to the name and set both:
- Default catalog → mhpdeworkshop_databricks_2026
- Default schema → {attendee_id}_pipeline (pre-created by 00_setup.py — isolated from your notebook schemas)
The pipeline code uses unqualified table names, so these defaults decide where every bronze_*/silver_*/kpi_* table is written and read. All tables land in the single {attendee_id}_pipeline schema (unlike the notebooks, which split across _bronze/_silver/_gold).
In the editor’s left asset browser there are two tabs — Pipeline (files in this pipeline) and All files (the rest of your workspace). Switch to All files; this is where your linked Git folder appears.
Navigate to databricks/production/dlt_pipeline.py, click its ⋮ (kebab) menu, and choose Include in pipeline to register it as source code. Confirm the Pipeline tab now lists dlt_pipeline.py, and delete the blank my_transformation starter file the editor created. (Point the source at the dlt_pipeline.py file, not the whole databricks/production folder — that folder also contains workflows_job.yml and asset_bundle/databricks.yml, and Lakeflow rejects any non-.py/.sql file in a source path with UNSUPPORTED_LIBRARY_FILE_TYPE.)
Click Run pipeline (top-right of the editor) and observe the Bronze → Silver → Gold execution graph

LSDP demo vs full 12 KPI contract

This production pipeline implements 4 sample Gold KPIs (kpi_trips_by_hour, kpi_trips_by_day, kpi_top_pickup_zones, kpi_revenue_by_hour) to keep the declarative demo runnable in class. The full 12 KPI contract is built in 03_gold_kpis.py and scheduled via workflows_job.yml. See Data Model — cross-platform differences.

Troubleshooting: NO_TABLES_IN_PIPELINE

If the run fails with NO_TABLES_IN_PIPELINE (SQL state 42617), the pipeline read your source but found zero table definitions. The usual cause is the source being a Databricks notebook whose Python code is not in a real code cell — make sure a # COMMAND ---------- separator sits between the top # MAGIC %md block and the first from pyspark import pipelines as dp line. Without it, the whole notebook is treated as one markdown cell and the @dp.materialized_view definitions never execute.

Storage access on serverless

The pipeline runs on serverless compute, where account keys (fs.azure.account.key) are blocked. It reads ADLS2 through a Unity Catalog external location that your trainer provisioned — so, unlike 00_setup.py, the pipeline sets no storage key. If a read fails with Invalid configuration value detected for fs.azure.account.key, the external location or its READ FILES grant is missing — flag it to your trainer.

Review Workflows Job (Lakeflow Jobs)

In your Codespace Explorer (left sidebar), open databricks/production/workflows_job.yml
Replace repo paths if needed: notebook paths use /Repos/MHPDataEngineerWorkshop/... (your fork). Trainers use 2026MHPDataEngineerWorkshop in the canonical repo.
Answer:
- How is the task DAG defined (Bronze → Silver → Gold)?
- What cluster configuration is used for production?
- How does this compare to Asset Bundles for deployment?

Snowflake Production Exercise

Create a Scheduled Task

In Snowsight: Workspaces → + Add New → SQL File
In the toolbar, select a Started warehouse (e.g. DE_WORKSHOP_WH or COMPUTE_WH)
Paste and run the SQL below (Run all or select all → Run)

Warning

Literal names in task bodies — A scheduled task runs in its own session and cannot read SET variables from your editor. Replace de_XX_yourname with your attendee_id (e.g. de_01_alice from your credential card). To look it up first, run SELECT value FROM DE_MASTERCLASS.PUBLIC._workshop_config WHERE key = 'attendee_id';

USE ROLE      DE_WORKSHOP_ROLE;
USE DATABASE  DE_MASTERCLASS;
USE WAREHOUSE DE_WORKSHOP_WH;

-- 1. Create a scheduled task that refreshes a Gold KPI.
--    Replace de_XX_yourname with your ATTENDEE_ID. A scheduled task runs in its
--    own session, so it cannot read SET/session variables — identifiers in the
--    task body must be literal, fully-qualified names.
CREATE OR REPLACE TASK de_XX_yourname_SQL_GOLD.REFRESH_TRIPS_BY_HOUR
  WAREHOUSE = DE_WORKSHOP_WH
  SCHEDULE = 'USING CRON 0 6 * * * Europe/Berlin'
AS
  CREATE OR REPLACE TABLE de_XX_yourname_SQL_GOLD.kpi_trips_by_hour AS
  SELECT
      pickup_hour,
      CASE WHEN is_peak_hour THEN 'Peak Hour' ELSE 'Off-Peak Hour' END AS peak_status,
      CASE WHEN is_weekend THEN 'Weekend' ELSE 'Weekday' END AS day_type,
      COUNT(*) AS total_trips,
      ROUND(SUM(total_amount), 2) AS total_revenue
  FROM de_XX_yourname_SQL_SILVER.silver_nyc_taxi_enriched
  GROUP BY pickup_hour, is_peak_hour, is_weekend;

-- 2. Start the task
ALTER TASK de_XX_yourname_SQL_GOLD.REFRESH_TRIPS_BY_HOUR RESUME;

-- 3. Check task history
SELECT *
FROM TABLE(INFORMATION_SCHEMA.TASK_HISTORY())
WHERE NAME = 'REFRESH_TRIPS_BY_HOUR'
  AND SCHEMA_NAME = 'DE_XX_YOURNAME_SQL_GOLD'   -- uppercase: Snowflake stores unquoted names in upper case
ORDER BY SCHEDULED_TIME DESC
LIMIT 5;

-- 4. Clean up (suspend when done)
ALTER TASK de_XX_yourname_SQL_GOLD.REFRESH_TRIPS_BY_HOUR SUSPEND;

Review Stored Procedures

In your Codespace Explorer (left sidebar), open snowflake/sql/production/stored_procedures.sql
Answer:
- How does error handling work in the stored procedure?
- How would you call this procedure from a Task?

Review Tasks + Streams (CDC)

In your Codespace Explorer (left sidebar), open snowflake/sql/production/tasks_and_streams.sql
This demonstrates a production CDC pattern using Streams + Tasks
Answer:
- How does the Stream detect new data in Bronze?
- What does WHEN SYSTEM$STREAM_HAS_DATA() do?
- How is this different from a cron-scheduled Task?

Review Snowpark Stored Procedure

In your Codespace Explorer (left sidebar), open snowflake/sql/production/snowpark_sproc.py
This shows deploying Snowpark Python as a Snowflake stored procedure (uses _snowpark_bootstrap.py for the session)
Answer:
- How is the pipeline logic packaged via session.sproc.register()?
- What are the advantages of running Python inside Snowflake vs locally?
- How would you call this procedure from a Task?

Review Snowflake DevOps (CI/CD Deployment)

In your Codespace Explorer (left sidebar), open snowflake/sql/production/README.md and review the deployment patterns table
In production Snowflake projects, deployment and scheduling are separate concerns:
- Deploy: Snowflake CLI (snow git fetch + snow git execute) pushes code from Git to Snowflake
- Schedule: Tasks + Streams define when deployed objects run

Review this GitHub Actions workflow that deploys SQL scripts to Snowflake:

name: Deploy to Snowflake
on:
  push:
    branches: [ main ]
jobs:
  deploy:
    runs-on: ubuntu-latest
    env:
      SNOWFLAKE_CONNECTIONS_DEFAULT_ACCOUNT: ${{ secrets.SNOWFLAKE_ACCOUNT }}
      SNOWFLAKE_CONNECTIONS_DEFAULT_USER: ${{ secrets.SNOWFLAKE_USER }}
      SNOWFLAKE_CONNECTIONS_DEFAULT_PASSWORD: ${{ secrets.SNOWFLAKE_PASSWORD }}
    steps:
      - uses: actions/checkout@v4
      - uses: Snowflake-Labs/snowflake-cli-action@v1.5
      - run: snow git fetch my_git_repo
      - run: |
          snow git execute @my_git_repo/branches/main/scripts/* \
            -D "environment='prod'"

Answer:
- What does snow git fetch do vs snow git execute?
- How does -D "environment='prod'" enable the same script to deploy to different environments?
- How does this compare to Databricks Asset Bundles (databricks bundle deploy -t prod)?
- Why is it better to deploy stored procedures via CI/CD rather than running them manually in Snowsight?

dbt Production Exercise

Review CI Configuration

In your Codespace Explorer (left sidebar), open dbt_project/production/github_actions_ci.yml
Trace through the job steps and answer:
- What triggers the CI workflow?
- What does --select state:modified+ do?
- What does --defer do and why is it useful? (Requires a production manifest.json in ./target — the workflow downloads artifact dbt-prod-manifest from your nightly prod job; falls back to full build if missing.)
- What happens if a dbt test fails?

Discussion Questions

How does slim CI save time and compute cost?
When would you choose dbt Cloud over GitHub Actions?
What would you add to this CI pipeline for production readiness?

Expected Results

Exercise	What you should see
Lakeflow Declarative Pipeline review	`@dp.expect_or_drop` / `@dp.expect` decorators define data quality rules (10 total on Silver + Bronze); execution order from `spark.read.table()` references; demo builds 4 Gold KPIs (full 12 via notebooks / Workflows)
Asset Bundle review	`databricks.yml` defines 3 targets (`dev`, `staging`, `prod`); `prod` differs from dev/staging in `development: false` and `pause_status: UNPAUSED` (schedule active); the pipeline runs on serverless compute (`serverless: true`) — the Databricks-recommended default, with a commented-out classic `clusters:` block for the hardware-specific cases
Workflows review	`workflows_job.yml` defines a sequential task DAG with cluster config — alternative to Asset Bundles; notebook paths must match your Repos folder name
Snowflake Task	Task appears in `TASK_HISTORY()` after `RESUME`; runs on cron schedule; Gold KPI table is refreshed
Tasks + Streams	Stream tracks CDC changes; Task runs only when stream has data — event-driven vs cron
Snowpark SProc	Python pipeline deployed as stored procedure via `sproc.register()`; runs inside Snowflake compute
Snowflake DevOps	`snow git fetch` pulls latest code from Git; `snow git execute` deploys parameterized SQL/Python to Snowflake; GitHub Actions orchestrates the pipeline
dbt CI review	`state:modified+` selects only changed models and downstream; `--defer` uses production `manifest.json` when present; CI downloads `dbt-prod-manifest` artifact or runs full build on first run

Things you should have noticed

1. Every platform has the same three concerns: scheduling, quality, and deployment — Databricks uses Workflows + DLT + Asset Bundles; Snowflake uses Tasks + Streams + Snowflake CLI; dbt uses GitHub Actions + dbt test + --defer. The vocabulary differs but the architecture is identical: deploy code → run on schedule → validate output.

2. Declarative quality rules beat manual filters — in the base exercises you wrote WHERE clauses. In production, LSDP’s @dp.expect_or_drop and dbt’s tests: catch regressions automatically. If a new data source introduces bad rows, the pipeline fails loudly instead of silently passing them through.

3. Cron vs event-driven scheduling is a real design choice — the Snowflake Task ran on a cron (0 6 * * * = 6 AM daily). The Tasks + Streams pattern runs only when new data arrives. Cron is simpler but wastes compute when there is no new data. Event-driven is more efficient but adds complexity (stream management, error handling).

4. Deployment and scheduling are separate concerns — snow git fetch / databricks bundle deploy / dbt run push code to the platform. Tasks / Workflows / CI define when that code runs. Mixing these two (e.g., running SQL manually in Snowsight) works in development but breaks in production.

5. dbt’s --defer is a cost optimiser — it compares the current branch against production state and only rebuilds models that actually changed. Without --defer, every CI run would rebuild all 16 models. With it, a typical PR rebuilds 2–3 models — saving compute time and warehouse credits.

Discussion questions

LSDP uses declarative expectations; the base exercise used manual WHERE filters. In a team of 5 engineers, what happens to data quality when someone forgets to add a filter in a new notebook?
Snowflake Tasks ran on cron; Tasks + Streams ran on data arrival. For a pipeline that processes daily taxi trip exports, which model would you choose and why?
dbt slim CI (state:modified+) only rebuilds changed models. What would happen in production if you ran dbt run (full rebuild) on every pull request?
You reviewed three deployment patterns (Asset Bundles, Snowflake CLI, GitHub Actions). Which would you choose for a team that deploys to Snowflake and Databricks simultaneously?

Cleanup

-- Suspend the Snowflake task to stop scheduled runs
SET attendee_id = (SELECT value FROM DE_MASTERCLASS.PUBLIC._workshop_config WHERE key = 'attendee_id');
SET task_refresh = 'DE_MASTERCLASS.' || $attendee_id || '_SQL_GOLD.REFRESH_TRIPS_BY_HOUR';
ALTER TASK IDENTIFIER($task_refresh) SUSPEND;

-- Suspend warehouse
ALTER WAREHOUSE DE_WORKSHOP_WH SUSPEND;

In Databricks, terminate any running clusters via Compute → Terminate.

Return to module

Module 5 — story wrapper