Exercise: AI Features

YellowLine NYC story · full hands-on lab

title: “Exercise: AI Features” subtitle: “YellowLine NYC story · full hands-on lab” —

Estimated time: 20–25 min (Cortex AI: 10 min · Databricks AI: 10 min · Discussion: 5 min)

YellowLine NYC context (Module 6)

Demo Cortex LLM assistants on governed Silver/Gold — not predictive ML.

TipWorking Environment

Use GitHub Codespaces for a ready-to-use environment — all tools pre-installed. Open your fork on GitHub → Code → Codespaces → Create codespace on main. All workshop files are at /workspace/. AI features run in Databricks and Snowflake web UIs.

Prerequisites

  • Completed at least one pipeline (Gold tables must exist)
  • Snowflake trial account (for Cortex AI — AI_COMPLETE function)
  • Snowflake role has USE AI FUNCTIONS privilege and CORTEX_USER database role (trainer grants this)
  • Databricks workspace (for Genie features)

Snowflake Cortex AI (Hands-on)

Exercise 1: Trip Classification

Run this in a Snowflake SQL worksheet. (Note: AI_COMPLETE replaces the legacy SNOWFLAKE.CORTEX.COMPLETE function, which will be deprecated by end of 2026.)

SET attendee_id = 'XX_yourname';

SELECT
    PICKUP_ZONE,
    DROPOFF_ZONE,
    TIME_OF_DAY,
    TRIP_DISTANCE,
    AI_COMPLETE(
        'mistral-large2',
        'Based on this NYC taxi trip, classify the likely purpose as one of: '
        || 'business_commute, leisure, airport_transfer, short_errand. '
        || 'Trip details: From ' || PICKUP_ZONE || ' to ' || DROPOFF_ZONE
        || ', distance ' || TRIP_DISTANCE || ' miles, during ' || TIME_OF_DAY
        || '. Respond with ONLY the classification label.'
    ) AS trip_purpose
FROM IDENTIFIER($attendee_id || '_SILVER.SILVER_NYC_TAXI_ENRICHED')
LIMIT 10;

Exercise 2: KPI Insights

SELECT AI_COMPLETE(
    'mistral-large2',
    'Analyze these NYC taxi KPI results and provide 3 business insights: '
    || (SELECT LISTAGG(
        'Hour ' || PICKUP_HOUR || ': ' || TOTAL_TRIPS || ' trips, $' || TOTAL_REVENUE || ' revenue',
        '; '
       )
       FROM IDENTIFIER($attendee_id || '_GOLD.GOLD_KPI_TRIPS_BY_HOUR'))
) AS kpi_insights;

Exercise 3: Anomaly Detection

SELECT
    PICKUP_HOUR,
    TOTAL_TRIPS,
    TOTAL_REVENUE,
    AI_COMPLETE(
        'mistral-large2',
        'Is this hour anomalous for NYC taxi data? Hour: ' || PICKUP_HOUR
        || ', Trips: ' || TOTAL_TRIPS
        || ', Revenue: $' || TOTAL_REVENUE
        || '. Respond: normal or anomalous, with brief reason.'
    ) AS anomaly_assessment
FROM IDENTIFIER($attendee_id || '_GOLD.GOLD_KPI_TRIPS_BY_HOUR')
ORDER BY PICKUP_HOUR;

Databricks Genie (Guided)

Exercise 4: Databricks Assistant

Open a notebook and try these prompts in the Assistant:

  1. “Write PySpark code to find the top 5 routes by total revenue from silver_nyc_taxi_enriched”
  2. “Explain what this filter does: df.filter((col(‘trip_duration’) > 0) & (col(‘trip_duration’) <= 1440))”
  3. “Create a visualization of trips by hour using matplotlib”

Exercise 5: AI Functions

Run in a Databricks SQL cell:

SELECT
    pickup_zone,
    ai_query(
        'databricks-meta-llama-3-3-70b-instruct',
        CONCAT('Is this NYC zone primarily residential, commercial, or tourist area? Zone: ',
               pickup_zone, '. Answer with one word.')
    ) AS zone_type
FROM {attendee_id}_gold.gold_kpi_top_pickup_zones
WHERE trip_rank <= 10;

Exercise 6: Genie Space (Optional)

  1. In the Databricks sidebar, click Genie (under the SQL section)
  2. Create a Genie space connected to your Gold schema
  3. Try asking: “What hour has the highest revenue?”
  4. Try: “Compare weekend vs weekday trip patterns”

Discussion

  • Which AI feature felt most useful for your daily work?
  • Where do you see AI adding the most value in data engineering?
  • What are the limitations of using LLMs on structured data?

Expected Results

Exercise What you should see
Trip Classification Each row returns a label: business_commute, leisure, airport_transfer, or short_errand — classifications vary per trip (LLM is probabilistic)
KPI Insights A paragraph of 3 business insights generated from the hourly KPI data — e.g., “peak hours are 7-9 AM and 5-7 PM”
Anomaly Detection Each hour classified as normal or anomalous with a brief reason — low-trip hours (2-4 AM) often flagged as anomalous
Databricks Assistant Generates PySpark code from natural language — code quality varies; always verify against known-good queries
AI Functions Zone classification (residential/commercial/tourist) — results are reasonable but not 100% accurate

AI outputs vary between runs

Because LLMs are probabilistic, running the same query twice may produce different classifications. This is expected behavior — never use AI-generated columns as primary keys or join conditions.

Cleanup

No specific cleanup needed — AI queries are read-only and do not create new tables. If you created a Genie space you no longer need, delete it from the Genie page in Databricks.

Suspend your Snowflake warehouse when finished:

ALTER WAREHOUSE DE_WORKSHOP_WH SUSPEND;

Return to module

Source: merged from frozen workshop-2026-v1/exercises/ex-ai-features.qmd — do not edit workshop-2026-v1/.