Exercise: Production Patterns

YellowLine NYC story · full hands-on lab

title: “Exercise: Production Patterns” subtitle: “YellowLine NYC story · full hands-on lab” —

Estimated time: 30–35 min (Databricks: 10 min · Snowflake: 15 min · dbt CI: 5–10 min)

YellowLine NYC context (Module 5)

What runs every night when the team is not in the room?

TipWorking Environment

Use GitHub Codespaces for a ready-to-use environment — all tools pre-installed. Open your fork on GitHub → Code → Codespaces → Create codespace on main. All production files (databricks/production/, snowflake/production/, dbt_project/production/) are at /workspace/.

Prerequisites

  • Completed at least one pipeline module (Databricks, Snowflake, or dbt)
  • Production files available in databricks/production/, snowflake/production/, dbt_project/production/

Databricks Production Exercise

Review Lakeflow Declarative Pipeline (formerly DLT)

  1. Open databricks/production/dlt_pipeline.py
  2. Compare @dlt.expect_or_drop() decorators with the manual WHERE filters in 02_silver_cleaning.py (Note: In 2026, the recommended API is from pyspark import pipelines as dp with @dp.table() / @dp.expect_or_drop(). The legacy import dlt / @dlt decorators still work.)
  3. Answer:
    • How many data quality expectations are defined?
    • What happens to rows that fail an expectation?
    • How does DLT know the execution order (Bronze → Silver → Gold)?

Review Asset Bundle

  1. Open databricks/production/asset_bundle/databricks.yml
  2. Answer:
    • How many deployment targets are defined?
    • What changes between dev and prod targets?
    • What does databricks bundle deploy -t prod do?

Stretch: Create a Lakeflow Declarative Pipeline

  1. In Databricks, go to Jobs & Pipelines in the sidebar → Create → Pipeline (Delta Live Tables was renamed to Lakeflow Declarative Pipelines in 2025.)
  2. Create a new pipeline pointing to dlt_pipeline.py
  3. Set target catalog to your attendee catalog
  4. Start the pipeline and observe the execution graph

Review Workflows Job (Lakeflow Jobs)

  1. Open databricks/production/workflows_job.yml
  2. Answer:
    • How is the task DAG defined (Bronze → Silver → Gold)?
    • What cluster configuration is used for production?
    • How does this compare to Asset Bundles for deployment?

Snowflake Production Exercise

Create a Scheduled Task

-- 1. Create a simple task that refreshes a Gold KPI
SET attendee_id = 'XX_yourname';

CREATE OR REPLACE TASK IDENTIFIER($attendee_id || '_GOLD.REFRESH_TRIPS_BY_HOUR')
  WAREHOUSE = DE_WORKSHOP_WH
  SCHEDULE = 'USING CRON 0 6 * * * Europe/Berlin'
AS
  CREATE OR REPLACE TABLE IDENTIFIER($attendee_id || '_GOLD.GOLD_KPI_TRIPS_BY_HOUR') AS
  SELECT
      EXTRACT(HOUR FROM PICKUP_DATETIME) AS PICKUP_HOUR,
      COUNT(*) AS TOTAL_TRIPS
  FROM IDENTIFIER($attendee_id || '_SILVER.SILVER_NYC_TAXI_ENRICHED')
  GROUP BY PICKUP_HOUR;

-- 2. Start the task
ALTER TASK IDENTIFIER($attendee_id || '_GOLD.REFRESH_TRIPS_BY_HOUR') RESUME;

-- 3. Check task history
SELECT *
FROM TABLE(INFORMATION_SCHEMA.TASK_HISTORY())
WHERE NAME = UPPER($attendee_id) || '_GOLD.REFRESH_TRIPS_BY_HOUR'
ORDER BY SCHEDULED_TIME DESC
LIMIT 5;

-- 4. Clean up (suspend when done)
ALTER TASK IDENTIFIER($attendee_id || '_GOLD.REFRESH_TRIPS_BY_HOUR') SUSPEND;

Review Stored Procedures

  1. Open snowflake/production/stored_procedures.sql
  2. Answer:
    • How does error handling work in the stored procedure?
    • How would you call this procedure from a Task?

Review Tasks + Streams (CDC)

  1. Open snowflake/production/tasks_and_streams.sql
  2. This demonstrates a production CDC pattern using Streams + Tasks
  3. Answer:
    • How does the Stream detect new data in Bronze?
    • What does WHEN SYSTEM$STREAM_HAS_DATA() do?
    • How is this different from a cron-scheduled Task?

Review Snowpark Stored Procedure

  1. Open snowflake/production/snowpark_sproc.py
  2. This shows deploying Snowpark Python as a Snowflake stored procedure
  3. Answer:
    • How is the pipeline logic packaged via session.sproc.register()?
    • What are the advantages of running Python inside Snowflake vs locally?
    • How would you call this procedure from a Task?

dbt Production Exercise

Review CI Configuration

  1. Open dbt_project/production/github_actions_ci.yml
  2. Trace through the job steps and answer:
    • What triggers the CI workflow?
    • What does --select state:modified+ do?
    • What does --defer do and why is it useful?
    • What happens if a dbt test fails?

Discussion Questions

  1. How does slim CI save time and compute cost?
  2. When would you choose dbt Cloud over GitHub Actions?
  3. What would you add to this CI pipeline for production readiness?

Expected Results

Exercise What you should see
Lakeflow Declarative Pipeline review @dlt.expect_or_drop (or @dp.expect_or_drop) decorators define data quality rules; the pipeline auto-discovers Bronze → Silver → Gold execution order from dlt.read() calls
Asset Bundle review databricks.yml defines dev and prod targets — prod uses a different catalog and job cluster config
Workflows review workflows_job.yml defines a sequential task DAG with cluster config — alternative to Asset Bundles
Snowflake Task Task appears in TASK_HISTORY() after RESUME; runs on cron schedule; Gold KPI table is refreshed
Tasks + Streams Stream tracks CDC changes; Task runs only when stream has data — event-driven vs cron
Snowpark SProc Python pipeline deployed as stored procedure via sproc.register(); runs inside Snowflake compute
dbt CI review state:modified+ selects only changed models and their downstream dependents; --defer uses production state for unmodified models

Cleanup

-- Suspend the Snowflake task to stop scheduled runs
ALTER TASK IDENTIFIER($attendee_id || '_GOLD.REFRESH_TRIPS_BY_HOUR') SUSPEND;

-- Suspend warehouse
ALTER WAREHOUSE DE_WORKSHOP_WH SUSPEND;

In Databricks, terminate any running clusters via Compute → Terminate.


Return to module

Source: merged from frozen workshop-2026-v1/exercises/ex-production.qmd — do not edit workshop-2026-v1/.