Exercise: AI Features
YellowLine NYC story · full hands-on lab
title: “Exercise: AI Features” subtitle: “YellowLine NYC story · full hands-on lab” —
Estimated time: 20–25 min (Cortex AI: 10 min · Databricks AI: 10 min · Discussion: 5 min)
YellowLine NYC context (Module 6)
Demo Cortex LLM assistants on governed Silver/Gold — not predictive ML.
Use GitHub Codespaces for a ready-to-use environment — all tools pre-installed. Open your fork on GitHub → Code → Codespaces → Create codespace on main. All workshop files are at /workspace/. AI features run in Databricks and Snowflake web UIs.
Prerequisites
- Completed at least one pipeline (Gold tables must exist)
- Snowflake trial account (for Cortex AI —
AI_COMPLETEfunction) - Snowflake role has
USE AI FUNCTIONSprivilege andCORTEX_USERdatabase role (trainer grants this) - Databricks workspace (for Genie features)
Snowflake Cortex AI (Hands-on)
Exercise 1: Trip Classification
Run this in a Snowflake SQL worksheet. (Note: AI_COMPLETE replaces the legacy SNOWFLAKE.CORTEX.COMPLETE function, which will be deprecated by end of 2026.)
SET attendee_id = 'XX_yourname';
SELECT
PICKUP_ZONE,
DROPOFF_ZONE,
TIME_OF_DAY,
TRIP_DISTANCE,
AI_COMPLETE(
'mistral-large2',
'Based on this NYC taxi trip, classify the likely purpose as one of: '
|| 'business_commute, leisure, airport_transfer, short_errand. '
|| 'Trip details: From ' || PICKUP_ZONE || ' to ' || DROPOFF_ZONE
|| ', distance ' || TRIP_DISTANCE || ' miles, during ' || TIME_OF_DAY
|| '. Respond with ONLY the classification label.'
) AS trip_purpose
FROM IDENTIFIER($attendee_id || '_SILVER.SILVER_NYC_TAXI_ENRICHED')
LIMIT 10;Exercise 2: KPI Insights
SELECT AI_COMPLETE(
'mistral-large2',
'Analyze these NYC taxi KPI results and provide 3 business insights: '
|| (SELECT LISTAGG(
'Hour ' || PICKUP_HOUR || ': ' || TOTAL_TRIPS || ' trips, $' || TOTAL_REVENUE || ' revenue',
'; '
)
FROM IDENTIFIER($attendee_id || '_GOLD.GOLD_KPI_TRIPS_BY_HOUR'))
) AS kpi_insights;Exercise 3: Anomaly Detection
SELECT
PICKUP_HOUR,
TOTAL_TRIPS,
TOTAL_REVENUE,
AI_COMPLETE(
'mistral-large2',
'Is this hour anomalous for NYC taxi data? Hour: ' || PICKUP_HOUR
|| ', Trips: ' || TOTAL_TRIPS
|| ', Revenue: $' || TOTAL_REVENUE
|| '. Respond: normal or anomalous, with brief reason.'
) AS anomaly_assessment
FROM IDENTIFIER($attendee_id || '_GOLD.GOLD_KPI_TRIPS_BY_HOUR')
ORDER BY PICKUP_HOUR;Databricks Genie (Guided)
Exercise 4: Databricks Assistant
Open a notebook and try these prompts in the Assistant:
- “Write PySpark code to find the top 5 routes by total revenue from silver_nyc_taxi_enriched”
- “Explain what this filter does: df.filter((col(‘trip_duration’) > 0) & (col(‘trip_duration’) <= 1440))”
- “Create a visualization of trips by hour using matplotlib”
Exercise 5: AI Functions
Run in a Databricks SQL cell:
SELECT
pickup_zone,
ai_query(
'databricks-meta-llama-3-3-70b-instruct',
CONCAT('Is this NYC zone primarily residential, commercial, or tourist area? Zone: ',
pickup_zone, '. Answer with one word.')
) AS zone_type
FROM {attendee_id}_gold.gold_kpi_top_pickup_zones
WHERE trip_rank <= 10;Exercise 6: Genie Space (Optional)
- In the Databricks sidebar, click Genie (under the SQL section)
- Create a Genie space connected to your Gold schema
- Try asking: “What hour has the highest revenue?”
- Try: “Compare weekend vs weekday trip patterns”
Discussion
- Which AI feature felt most useful for your daily work?
- Where do you see AI adding the most value in data engineering?
- What are the limitations of using LLMs on structured data?
Expected Results
| Exercise | What you should see |
|---|---|
| Trip Classification | Each row returns a label: business_commute, leisure, airport_transfer, or short_errand — classifications vary per trip (LLM is probabilistic) |
| KPI Insights | A paragraph of 3 business insights generated from the hourly KPI data — e.g., “peak hours are 7-9 AM and 5-7 PM” |
| Anomaly Detection | Each hour classified as normal or anomalous with a brief reason — low-trip hours (2-4 AM) often flagged as anomalous |
| Databricks Assistant | Generates PySpark code from natural language — code quality varies; always verify against known-good queries |
| AI Functions | Zone classification (residential/commercial/tourist) — results are reasonable but not 100% accurate |
AI outputs vary between runs
Because LLMs are probabilistic, running the same query twice may produce different classifications. This is expected behavior — never use AI-generated columns as primary keys or join conditions.
Cleanup
No specific cleanup needed — AI queries are read-only and do not create new tables. If you created a Genie space you no longer need, delete it from the Genie page in Databricks.
Suspend your Snowflake warehouse when finished:
ALTER WAREHOUSE DE_WORKSHOP_WH SUSPEND;Return to module
Source: merged from frozen workshop-2026-v1/exercises/ex-ai-features.qmd — do not edit workshop-2026-v1/.