Exercise: Machine Learning
YellowLine NYC story · full hands-on lab
title: “Exercise: Machine Learning” subtitle: “YellowLine NYC story · full hands-on lab” —
Estimated time: 45–50 min (Databricks sklearn: 20 min · Snowflake Cortex ML: 10 min · Snowpark ML: 10 min · dbt features: 5 min · Discussion: 5 min)
YellowLine NYC context (Module 9)
Predict tip amounts on credit-card NYC Taxi Silver.
Modules 2–3 required · Module 4 recommended for the dbt feature-table track. This is Cortex ML (Module 9) — not Module 6 Cortex LLM assistants.
Use GitHub Codespaces for a ready-to-use environment — all tools pre-installed. Open your fork on GitHub → Code → Codespaces → Create codespace on main. ML notebooks run in Databricks Runtime ML; Cortex runs in Snowflake worksheets.
This exercise requires Module 2 (Databricks) and Module 3 (Snowflake) Silver enriched tables to exist with data. Check with your trainer that the Gold KPI tables exist too (GOLD_TRIPS_BY_HOUR for Cortex ML.FORECAST).
Prerequisites
Base Exercise — Snowflake Cortex ML (SQL, ~20 min)
Start here — no Python, no setup. Cortex ML runs directly in a Snowsight SQL Worksheet.
Step 1: Run ML.FORECAST on trip demand
- Open a new Snowsight SQL Worksheet
- Open
ml/snowflake/01_cortex_ml.sql - Replace all
{ATTENDEE_ID}with your ID (uppercase) - Run Part A —
ML.FORECASTsection
SELECT *
FROM TABLE(
ML.FORECAST(
INPUT_DATA => SYSTEM$REFERENCE('TABLE',
'DE_MASTERCLASS.{ATTENDEE_ID}_GOLD.GOLD_TRIPS_BY_HOUR'),
TIMESTAMP_COLNAME => 'PICKUP_HOUR_TS',
TARGET_COLNAME => 'TOTAL_TRIPS',
SERIES_COLNAME => 'PICKUP_BOROUGH',
CONFIG_OBJECT => {'prediction_interval': 24}
)
)
ORDER BY SERIES, TS;Verify: You should see rows with SERIES (borough), TS (future timestamp), FORECAST, LOWER_BOUND, UPPER_BOUND.
Questions: - Which borough has the highest forecasted demand for the next peak hour? - How wide are the confidence intervals (UPPER_BOUND - LOWER_BOUND)? What does a wider interval mean?
Step 2: Run ML.ANOMALY_DETECTION on average fares
Still in 01_cortex_ml.sql, run Part B — ML.ANOMALY_DETECTION.
First create the hourly fare view:
CREATE OR REPLACE VIEW DE_MASTERCLASS.{ATTENDEE_ID}_SILVER.V_HOURLY_FARES AS
SELECT
DATE_TRUNC('hour', PICKUP_DATETIME) AS fare_hour,
PICKUP_BOROUGH,
AVG(FARE_AMOUNT) AS avg_fare
FROM DE_MASTERCLASS.{ATTENDEE_ID}_SILVER.SILVER_NYC_TAXI_ENRICHED
WHERE FARE_AMOUNT > 0
GROUP BY 1, 2;Then run anomaly detection:
SELECT
SERIES AS pickup_borough,
TS AS fare_hour,
Y AS avg_fare,
ANOMALY_SCORE,
CASE WHEN ANOMALY THEN '⚠️ ANOMALY' ELSE '✅ Normal' END AS status
FROM TABLE(
ML.ANOMALY_DETECTION(
INPUT_DATA => SYSTEM$REFERENCE('VIEW',
'DE_MASTERCLASS.{ATTENDEE_ID}_SILVER.V_HOURLY_FARES'),
TIMESTAMP_COLNAME => 'FARE_HOUR',
TARGET_COLNAME => 'AVG_FARE',
SERIES_COLNAME => 'PICKUP_BOROUGH',
LABEL_COLNAME => NULL
)
)
WHERE ANOMALY = TRUE
ORDER BY ANOMALY_SCORE DESC;Questions: - Which boroughs have anomalous fare hours? - Look up the anomalous timestamps — do they correspond to late night, holidays, or storms?
Base Exercise — Databricks sklearn + MLflow (~25 min)
Step 1: Run the training notebook
- Open
ml/databricks/01_ml_tip_prediction.pyin your Databricks workspace - Set
ATTENDEE_ID(should already be set from00_setup) - Adjust
SAMPLE_FRACTION = 0.10(leave as-is for speed) - Run all cells in order — training takes ~1–2 minutes
Watch for the output:
✅ MLflow Run ID: <run_id>
RMSE: $1.xx
MAE: $0.xx
R²: 0.xx
Step 2: Explore the MLflow Experiment
- In the Databricks sidebar: AI/ML → Experiments → find
tip_prediction_{ATTENDEE_ID} - Click your run — inspect:
- Parameters:
n_estimators,learning_rate,max_depth - Metrics: RMSE, MAE, R²
- Artifacts: the model pickle,
input_example.json
- Parameters:
- Click Feature Importances artifact (if present) or look at the notebook output
Questions: - Which feature has the highest importance? Does this make business sense? - Is fare_amount the top feature? Is that a problem? (Hint: it’s not leakage — tip is not part of fare — but it is highly correlated)
Step 3: Inspect the Gold predictions table
After the notebook finishes, run this SQL query in a new cell:
SELECT
pickup_borough,
ROUND(AVG(tip_amount), 2) AS avg_actual_tip,
ROUND(AVG(predicted_tip), 2) AS avg_predicted_tip,
ROUND(AVG(ABS(prediction_error)), 2) AS avg_abs_error,
COUNT(*) AS trips
FROM {catalog}.{attendee_id}_gold.ml_tip_predictions
GROUP BY pickup_borough
ORDER BY avg_actual_tip DESC;Questions: - In which borough are predictions most accurate (lowest avg_abs_error)? - In which borough does the model overestimate tips?
Base Exercise — dbt Feature Table (~10 min)
Step 1: Run the feature model
cd dbt_project
dbt run --select ml_features_tip_prediction
dbt test --select ml_features_tip_predictionAll 11 tests should pass. If any fail, look at: - dbt_utils.expression_is_true: tip_amount >= 0 — are there negative tips in the data? - accepted_values: payment_type_desc = [Credit Card] — are non-credit-card rows leaking through?
Step 2: View the lineage graph
dbt docs generate
dbt docs serveOpen the lineage graph — find ml_features_tip_prediction and click it. Trace the full path from Bronze source to the feature table.
Question: If a data engineer modifies silver_nyc_taxi_enriched (e.g., changes how time_of_day is calculated), how would the ML team know? (Answer: dbt lineage + CI tests)
Stretch Goal — Retrain with a different feature set
Stretch A: Remove fare_amount and observe the impact
fare_amount is the strongest predictor — but you might argue it’s “too close” to the target. In a real scenario, you might not know the fare until after the trip (though for NYC taxi you do).
- In
01_ml_tip_prediction.py, editFEATURE_COLS— remove"fare_amount" - Retrain the model
- Compare RMSE before and after in MLflow Experiments
- Check new feature importances — which feature rises to #1?
Expected: RMSE increases. This shows how much predictive power fare_amount adds.
Stretch Goal — Snowpark ML
Stretch B: Run Snowpark ML tip prediction
Install:
pip install snowflake-ml-python>=1.5.0Set environment variables:
export SNOWFLAKE_ACCOUNT=xy12345.eu-west-1 export SNOWFLAKE_USER=your_user export SNOWFLAKE_PASSWORD=your_password export ATTENDEE_ID=01_ALICERun:
python ml/snowflake/02_snowpark_ml_tip_prediction.pyIn Snowsight: Data → Databases → DE_MASTERCLASS → {schema} → Models — confirm
TIP_PREDICTOR_{ATTENDEE_ID}appears
Compare with Databricks: - Which RMSE is lower? - Which took longer to run? - In which approach did data leave Snowflake?
Stretch Goal — Databricks AutoML
Stretch C: Databricks AutoML
- In Databricks sidebar: AI/ML → Experiments → Create AutoML Experiment
- Prediction type: Regression
- Dataset:
{catalog}.{attendee_id}_silver.silver_nyc_taxi_enriched - Target column:
tip_amount - Metric: RMSE
- Click Start
After it completes (~5–10 minutes): - Which algorithm won (XGBoost, LightGBM, or sklearn)? - Compare the winning RMSE to your manual GBR — which is better? - Open the best-run notebook — what preprocessing did AutoML apply that you didn’t?
Compare Your Results
After running all three tools, fill in the table below with your own observed values. This is the core learning of the module — the same prediction problem, three very different experiences.
Observable differences from the exercises
| What you observed | Cortex ML (SQL) | Databricks (sklearn) | Snowflake (Snowpark ML) |
|---|---|---|---|
| Lines of code you wrote | ~5 (one SQL call) | ~80 (notebook) | ~60 (Python script) |
| Time to first result | ~10 seconds | ~2 minutes | ~1–2 minutes |
| Your RMSE / accuracy | N/A — forecasting, not regression | $______ | $______ |
| Did data leave Snowflake? | No | Yes — .toPandas() to driver |
No |
| Where did you see the model? | SQL result set | MLflow Experiments UI | Snowflake Model Registry |
| Could you see feature importances? | No | Yes — GBR built-in importance | Limited |
| Setup steps before running | 0 (just SQL) | Runtime ML cluster | pip install snowflake-ml-python |
1. Cortex ML is dramatically simpler — the entire forecast is one SELECT statement. No training loop, no model artifact, no hyperparameters. The tradeoff: you can only run FORECAST or ANOMALY_DETECTION — no other algorithms.
2. Databricks gives you the most visibility — MLflow Experiments shows every parameter, metric, and artifact. You can compare runs side-by-side. Cortex gives you results but no model introspection.
3. The RMSE from Databricks and Snowpark ML should be similar — because they use the same algorithm (GBR), same features, and same data. Small differences come from sampling randomness and warehouse compute differences.
4. Data movement is the hidden cost in Databricks — the .toPandas() call transfers data from the cluster to the driver. For 10% of the Silver table this is fast, but at 100% or on a larger dataset it becomes a bottleneck. Snowpark ML avoids this entirely.
5. The dbt tests caught the credit card filter — if you forgot and tried to include cash trips, the accepted_values: [Credit Card] test would catch it before training.
Discussion questions for the group
- Which approach would you choose for a production weekly tip prediction job? Consider: maintenance burden, model visibility, data governance.
- Cortex ML.FORECAST needed a Gold aggregation table as input (
GOLD_TRIPS_BY_HOUR). Why can’t it use the raw Silver trip table directly? - MLflow vs Snowflake Model Registry — both track models. What would make you choose one over the other in a real project?
- If your company’s data scientists use Python notebooks daily (Databricks), but your data platform is Snowflake — which training approach do you use and why?
- dbt ran in ~5 seconds; training took ~2 minutes — if the feature table changes daily, when should
dbt runandmodel.fit()be sequenced in a production pipeline?
Expected Results
| Exercise | Expected output |
|---|---|
| Cortex ML.FORECAST | 24 rows per borough with forecast + confidence bounds |
| Cortex ML.ANOMALY_DETECTION | Handful of anomalous fare hours (usually late-night outliers) |
| Databricks RMSE | ~$1.00–$2.00 (depends on sample size and features) |
| Databricks R² | ~0.60–0.80 for credit card trips |
| dbt tests | 11 passing tests on ML feature table |
| Snowpark ML RMSE | Similar to Databricks (same algorithm, same data) |
Clean Up
Databricks:
DROP TABLE IF EXISTS {catalog}.{attendee_id}_gold.ml_tip_predictions;Snowflake:
DROP TABLE IF EXISTS DE_MASTERCLASS.{ATTENDEE_ID}_GOLD.ML_TRIP_DEMAND_FORECAST;
DROP TABLE IF EXISTS DE_MASTERCLASS.{ATTENDEE_ID}_GOLD.ML_TIP_PREDICTIONS_SNOWPARK;
DROP VIEW IF EXISTS DE_MASTERCLASS.{ATTENDEE_ID}_SILVER.V_HOURLY_FARES;
DROP MODEL IF EXISTS DE_MASTERCLASS.{ATTENDEE_ID}_GOLD.TIP_PREDICTOR_{ATTENDEE_ID};Return to module
Source: merged from frozen workshop-2026-v1/exercises/ex-ml.qmd — do not edit workshop-2026-v1/.