Exercise: AI Features

YellowLine NYC story · full hands-on lab

title: “Exercise: AI Features” subtitle: “YellowLine NYC story · full hands-on lab” —

Estimated time: 25–35 min (Cortex AI: 12 min · Databricks AI: 12 min · Discussion: 5 min)

YellowLine NYC context (Module 6)

Demo Cortex LLM assistants on governed Silver/Gold — not predictive ML.

Working Environment

Use GitHub Codespaces for a ready-to-use environment — all tools pre-installed. Open your fork on GitHub → Code → Codespaces → Create codespace on main. All workshop files are at /workspace/. AI features run in Databricks and Snowflake web UIs.

GitHub blocked? (emergency only)

The normal path is fork + Codespace (Prerequisites § Step 2). Use Lab source files only if your facilitator approved it — e.g. you cannot create or use GitHub before class.

Prerequisites

Completed at least one pipeline (Gold tables must exist)
Snowflake: role DE_WORKSHOP_ROLE + DE_WORKSHOP_WH Started for AI_COMPLETE() (workshop role); SNOWFLAKE.CORTEX_USER on that role (Module 6 § Cortex)
Databricks: cluster attached for notebook Assistant and ai_query() on 04_ai_features (cluster); Genie uses SQL warehouse de-workshop-wh (cluster vs warehouse)

Note

Model availability: LLM model IDs (mistral-large2, databricks-meta-llama-3-3-70b-instruct, etc.) change over time. If a query fails with an unknown model, ask your facilitator for the current model list in your account — see Snowflake Cortex LLM functions and Databricks foundation model endpoints.

Lab numbering vs repo files

Lab exercise	Source
1–4	`snowflake/sql/ai/cortex_exercises.sql` (Workspaces SQL — `_SQL_*` schemas)
1–4 (notebook track)	`snowflake/notebooks/01_sql/ai/cortex_exercises.ipynb` — same queries, `_NB_SQL_*` schemas
6–8	`databricks/notebooks/04_ai_features.py` Exercises 1–3

Snowflake Cortex AI (Hands-on)

Important

Trial accounts need a payment method to run Cortex AI. On a trial account, AI_COMPLETE (and the legacy SNOWFLAKE.CORTEX.COMPLETE) fail with “AI function … is not available for trial accounts” until you add a card. If you want to actually run these exercises, add a payment method in Snowsight → Admin → Billing & Terms → Payment Method, then re-run. If you would rather not add a card, just review the code below — you can still see exactly how Cortex AI is called without executing it.

Run in a Workspaces SQL file (+ Add New → SQL File) or the notebook above. Start each session with:

SET attendee_id = (SELECT value FROM DE_MASTERCLASS.PUBLIC._workshop_config WHERE key = 'attendee_id');
SET silver_schema   = $attendee_id || '_SQL_SILVER';
SET gold_schema     = $attendee_id || '_SQL_GOLD';
SET silver_enriched = $silver_schema || '.silver_nyc_taxi_enriched';
SET kpi_trips_by_hour  = $gold_schema || '.kpi_trips_by_hour';

USE ROLE DE_WORKSHOP_ROLE;
USE DATABASE DE_MASTERCLASS;
USE WAREHOUSE DE_WORKSHOP_WH;

(Notebook SQL track: replace _SQL_ with _NB_SQL_ in schema names.)

(Note: AI_COMPLETE replaces the legacy SNOWFLAKE.CORTEX.COMPLETE function, deprecated by end of 2026.)

Exercise 1: Trip Purpose Classification

Note

Why AI_COMPLETE instead of a trained classifier? A traditional ML approach would require labelling hundreds of trips, training a model, and deploying it. AI_COMPLETE lets you classify trips using a natural-language prompt — zero training data needed. The tradeoff: it is slower per row and probabilistic, so it is best for exploration and prototyping, not for production scoring of millions of rows.

Matches cortex_exercises.sql → Exercise 1.

SELECT
    pickup_zone,
    dropoff_zone,
    time_of_day,
    pickup_hour,
    ROUND(trip_distance, 1) AS distance_mi,
    ROUND(total_amount, 2)  AS fare_usd,
    AI_COMPLETE(
        'mistral-large2',
        'Based on this NYC taxi trip, classify the likely purpose in 1-2 words. ' ||
        'Pickup: ' || COALESCE(pickup_zone, 'Unknown') || '. ' ||
        'Dropoff: ' || COALESCE(dropoff_zone, 'Unknown') || '. ' ||
        'Time: ' || time_of_day || ' (hour ' || pickup_hour || '). ' ||
        'Distance: ' || ROUND(trip_distance, 1) || ' miles.'
    ) AS trip_purpose_ai
FROM IDENTIFIER($silver_enriched)
WHERE pickup_zone IS NOT NULL AND dropoff_zone IS NOT NULL
LIMIT 5;

Exercise 2: KPI Insight Generation

Matches cortex_exercises.sql → Exercise 2. Uses weekday hours only.

WITH hourly_data AS (
    SELECT
        LISTAGG('Hour ' || pickup_hour || ': ' || total_trips || ' trips, $' || total_revenue, '; ')
            WITHIN GROUP (ORDER BY pickup_hour) AS hourly_summary
    FROM IDENTIFIER($kpi_trips_by_hour)
    WHERE day_type = 'Weekday'
)
SELECT
    AI_COMPLETE(
        'mistral-large2',
        'Analyze these NYC taxi hourly patterns and provide 3 key insights in bullet points: ' ||
        hourly_summary
    ) AS ai_insights
FROM hourly_data;

Exercise 3: Anomaly Detection (Hourly KPIs)

Matches cortex_exercises.sql → Exercise 3.

SELECT
    pickup_hour,
    total_trips,
    total_revenue,
    AI_COMPLETE(
        'mistral-large2',
        'Is this hour anomalous for NYC taxi data? Hour: ' || pickup_hour
        || ', Trips: ' || total_trips
        || ', Revenue: $' || total_revenue
        || '. Respond: normal or anomalous, with brief reason.'
    ) AS anomaly_assessment
FROM IDENTIFIER($kpi_trips_by_hour)
ORDER BY pickup_hour;

Exercise 4: Snowflake CoCo (Snowsight)

Matches cortex_exercises.sql → Exercise 4. CoCo (rebrand of Cortex Code, 2026) is an agentic assistant in Snowsight Workspaces (Cortex Code in Snowsight). Snowflake docs: multi-step plans, natural-language schema search, SQL generation, and query execution within your RBAC. Closest in-Snowsight NL experience to Databricks Genie for ad hoc KPI questions.

Projects → Workspaces — open a SQL file; set DE_MASTERCLASS and your Gold schema
CoCo / Cortex Code icon (lower-right) or Cmd+I / Ctrl+I
Type @ to add catalog objects, or describe data in plain language
Review response; run SQL from the panel if offered

Also try:

“Show me average trip distance by borough from the silver table”
“Top 5 routes by trip count from my Gold KPI tables”
“Find the hour with highest revenue per trip”
“Compare credit card vs cash tipping behavior”

Databricks AI/BI (Guided)

Open databricks/notebooks/04_ai_features.py in your Git folder on your cluster (after 00_setup and pipeline 01–03).

Exercise 6: Databricks Assistant

Matches 04_ai_features.py → Exercise 1.

Databricks Assistant is an AI pair programmer in the notebook UI. Click the Assistant icon or press Cmd+I / Ctrl+I. It understands Unity Catalog and can reference your tables.

Try these prompts (same as the notebook):

“Show me the top 5 pickup zones by revenue from my Gold KPI tables”
“Write a PySpark query to find the busiest hour on weekends vs weekdays”
“Create a visualization of trips by hour using matplotlib”
“Explain the data quality metrics in my Gold layer”

Demo cell: Run the notebook cell after “Assistant Demo: Auto-Generated Code” — it lists every Gold kpi_* table with row and column counts. The checkpoint should pass when ≥ 12 tables exist in your Gold schema.

Exercise 7: `ai_query()` — LLM-Powered Data Enrichment

Matches 04_ai_features.py → Exercise 2. Two equivalent ways to classify trip purpose from Silver enriched — run Option A, Option B, or both.

Option	Style	Notes
A	PySpark DataFrame API	Recommended for Git `.py` notebooks in this repo
B	SQL via `spark.sql()`	Same cluster; mirrors SQL warehouse syntax

Git `.py` notebooks — SQL pitfalls

Do not use a %sql cell in Git-backed .py files — Databricks may run it as Python (SyntaxError).
Do not use $ or deprecated ${param} in SQL strings (e.g. 'Fare: $' triggers a deprecation error). Use format_string(..., %s ...) and :param in native SQL notebooks instead.
Prefer a temp view (as in the notebook) or IDENTIFIER(:table) rather than embedding catalog.schema.table in hand-written SQL.
If you get a model-access error, ask the trainer — some workspaces require a SQL warehouse or Model Serving endpoint.

Run the notebook cells under Option A, or use this pattern (prompt matches _prompt_template in the notebook):

# After %run ./_workshop_bootstrap — see notebook for full cell
_enriched = f"{CATALOG_NAME}.{SILVER_SCHEMA}.silver_nyc_taxi_enriched"
_llm_model = "databricks-meta-llama-3-3-70b-instruct"
_prompt_template = (
    "Classify the likely trip purpose in 1-2 words for this NYC taxi trip. "
    "Pickup: %s (%s). Dropoff: %s (%s). Time: %s, hour %s. Distance: %s mi. Fare: %s USD."
)

spark.table(_enriched).createOrReplaceTempView("_workshop_enriched_ai")

df = (
    spark.table("_workshop_enriched_ai")
    .filter(col("pickup_zone").isNotNull() & col("dropoff_zone").isNotNull())
    .limit(10)
    .withColumn("_prompt", format_string(
        _prompt_template,
        col("pickup_zone"), col("pickup_borough"), col("dropoff_zone"), col("dropoff_borough"),
        col("time_of_day"), col("pickup_hour").cast("string"),
        round(col("trip_distance"), 1).cast("string"),
        round(col("total_amount"), 2).cast("string"),
    ))
    .select(
        "pickup_zone", "dropoff_zone", "pickup_hour", "time_of_day",
        "trip_distance", "total_amount",
        expr(f"ai_query('{_llm_model}', _prompt)").alias("trip_purpose_ai"),
    )
)
preview_df(df, n=10, title="Option A — PySpark ai_query() sample:", truncate=False)

Same prompt via spark.sql() on temp view _workshop_enriched_ai (notebook cell under Option B):

SELECT
    pickup_zone,
    dropoff_zone,
    pickup_hour,
    time_of_day,
    trip_distance,
    total_amount,
    ai_query(
        'databricks-meta-llama-3-3-70b-instruct',
        format_string(
            'Classify the likely trip purpose in 1-2 words for this NYC taxi trip. '
            || 'Pickup: %s (%s). Dropoff: %s (%s). Time: %s, hour %s. Distance: %s mi. Fare: %s USD.',
            pickup_zone, pickup_borough, dropoff_zone, dropoff_borough,
            time_of_day, CAST(pickup_hour AS STRING),
            CAST(ROUND(trip_distance, 1) AS STRING),
            CAST(ROUND(total_amount, 2) AS STRING)
        )
    ) AS trip_purpose_ai
FROM _workshop_enriched_ai
WHERE pickup_zone IS NOT NULL AND dropoff_zone IS NOT NULL
LIMIT 10;

Note

In a SQL warehouse or native SQL notebook, register the table with IDENTIFIER(:catalog || '.' || :schema || '.silver_nyc_taxi_enriched') or use :param markers — not ${param}.

Try it yourself (notebook stretch) — change _prompt_template to:

Estimate traffic conditions based on speed and time of day
Suggest optimal pricing based on trip characteristics
Detect anomalous trips that might indicate fraud

Exercise 8: Genie — Natural Language Data Exploration (Optional)

Matches 04_ai_features.py → Exercise 3. Do after Exercises 6–7, or skip here if you are short on time.

Compute and prerequisites

Requirement	Detail
Compute	SQL warehouse `de-workshop-wh` — Genie does not use your PySpark cluster (cluster vs warehouse)
Data	All 12 `kpi_*` tables in `{attendee_id}_gold` (from `03_gold_kpis.py`)
Catalog	`mhpdeworkshop_databricks_2026.{attendee_id}_gold`

If Genie cannot see tables, re-run 00_setup.py and Gold pipeline 03 first.

Setup

In the Databricks sidebar (left), click Genie (under the SQL / AI/BI section)
Click Create (or New Genie space)
Name the space: NYC Taxi Analysis - {attendee_id}
Example: NYC Taxi Analysis - de_01_alice
Connect data sources — Unity Catalog:
- Catalog: mhpdeworkshop_databricks_2026
- Schema: {attendee_id}_gold
- Add your Gold KPI tables (kpi_trips_by_hour, kpi_revenue_by_hour, kpi_top_pickup_zones, …) or select the whole Gold schema if the UI allows
SQL warehouse: choose de-workshop-wh (facilitator-managed) when prompted for compute

Genie space empty or wrong answers?

The space must point at {attendee_id}_gold, not Bronze or Silver. If results look empty, open Catalog and confirm the 12 kpi_* tables exist, then edit the Genie space data sources and re-add the Gold schema.

Try these questions

Ask in plain English (same prompts as the notebook):

“What are the busiest hours for taxi trips?”
“Which borough generates the most revenue?”
“Compare weekday vs weekend trip patterns”
“Show me the top routes between Manhattan and JFK”
“What payment method has the highest average tip?”

Also try (shorter checks):

“What hour has the highest revenue?”
“Compare weekend vs weekday trip patterns”

What success looks like

Genie proposes SQL against your Gold kpi_* tables and shows a result grid
You can edit and re-run the suggested query before trusting the answer
Peak-hour and borough answers should be directionally consistent with Gold queries you ran in Modules 2–3 (exact wording may differ)

Optional stretch

Databricks AI/BI Dashboards exist, but this workshop uses Power BI for Priya’s dashboard (Module 7). Skip building a second dashboard in Databricks unless you have extra time.

Reference: Databricks Genie documentation

Discussion

Which AI feature felt most useful for your daily work?
Where do you see AI adding the most value in data engineering?
What are the limitations of using LLMs on structured data?

Things you should have noticed

1. Snowflake AI_COMPLETE and Databricks ai_query() do the same thing — both call a hosted LLM from inside a SQL query or DataFrame. The syntax differs (AI_COMPLETE(model, prompt) vs ai_query(model, prompt)), but the pattern is identical: build a prompt string, pass it to the function, get text back. This means the skill transfers across platforms.

2. LLM outputs are probabilistic — never use them as keys or in joins — running the same AI_COMPLETE query twice can produce different trip classifications. This is fundamentally different from the deterministic transforms in Modules 2–4 where the same input always produces the same output. Treat AI-generated columns as advisory labels, not ground truth.

3. Prompt engineering replaces feature engineering — in traditional ML (Module 9) you selected features, tuned hyperparameters, and evaluated RMSE. Here, the quality of the result depends entirely on how you phrase the prompt. A small change in wording (“classify as one of: …” vs “describe the trip purpose”) can significantly change the output.

4. AI functions are slow at scale — each AI_COMPLETE call makes an API request to the LLM. Ten rows return in seconds; 10,000 rows could take minutes and cost real credits. In production, you would batch AI calls or apply them only to a sample, not run them on the entire Silver table.

5. The Databricks Assistant generated code, not data — it is a different use case from ai_query(). The Assistant helps you write PySpark, while ai_query() processes data inside a pipeline. Both use LLMs but solve different problems: developer productivity vs data enrichment.

Expected Results

Exercise	What you should see
1 — Trip classification	1–2 word labels per row (`trip_purpose_ai`) for 5 Silver trips with zones, hour, distance — varies per run
2 — KPI insights	Bullet-style 3 key insights from weekday hourly KPI summary
3 — Anomaly detection	Each hour `normal` or `anomalous` with brief reason — low-trip hours (2–4 AM) often flagged
4 — Snowflake CoCo	Agentic NL search + SQL on Gold/Silver tables; may discover `kpi_*` and return results (Genie-like in Workspaces)
6 — Databricks Assistant	PySpark/SQL/code from prompts; Gold table summary lists ≥ 12 `kpi_*` tables
7 — `ai_query()` (`04_ai_features` Ex. 2)	Trip-purpose labels (1–2 words) for 10 Silver rows — Option A or B; probabilistic
8 — Genie (`04_ai_features` Ex. 3)	NL questions → suggested SQL + grid from `{attendee_id}_gold` `kpi_*` tables

AI outputs vary between runs

Because LLMs are probabilistic, running the same query twice may produce different classifications. This is expected behavior — never use AI-generated columns as primary keys or join conditions.

Cleanup

No specific cleanup needed — AI queries are read-only and do not create new tables. If you created a Genie space you no longer need, delete it from the Genie page in Databricks.

Suspend your Snowflake warehouse when finished:

ALTER WAREHOUSE DE_WORKSHOP_WH SUSPEND;

Return to module

Module 6 — story wrapper

Prerequisites

Snowflake Cortex AI (Hands-on)

Exercise 1: Trip Purpose Classification

Exercise 2: KPI Insight Generation

Exercise 3: Anomaly Detection (Hourly KPIs)

Exercise 4: Snowflake CoCo (Snowsight)

Databricks AI/BI (Guided)

Exercise 6: Databricks Assistant

Exercise 7: ai_query() — LLM-Powered Data Enrichment

Git .py notebooks — SQL pitfalls

Exercise 8: Genie — Natural Language Data Exploration (Optional)

Setup

Genie space empty or wrong answers?

Try these questions

What success looks like

Discussion

Expected Results

AI outputs vary between runs

Cleanup

Return to module

Exercise 7: `ai_query()` — LLM-Powered Data Enrichment

Git `.py` notebooks — SQL pitfalls