Machine Learning Setup Guide

Databricks ML Runtime, Snowflake Cortex, and Snowpark ML for Module 9 (Optional)

title: “Machine Learning Setup Guide” subtitle: “Databricks ML Runtime, Snowflake Cortex, and Snowpark ML for Module 9 (Optional)” —

NoteOptional Module

This setup is only needed if you plan to run Module 9 (Machine Learning). Modules 2 and 3 (Databricks and Snowflake batch pipelines) must be completed first — the ML module reads from the Silver enriched tables produced by those pipelines.

Prerequisites

Before starting Module 9:

If any of these are missing, run the Bronze → Silver → Gold notebooks from Modules 2 and 3 first.


Databricks Setup

Runtime requirement

The ML notebooks require Databricks Runtime ML (not standard Runtime).

How to check your cluster runtime:

  1. Compute → select your cluster → Edit
  2. Under Databricks Runtime Version, look for a version ending with ML — e.g., 15.4 LTS ML
  3. If you have standard Runtime, create a new cluster with an ML runtime
TipWhat’s in the ML Runtime?

Databricks Runtime ML pre-installs: scikit-learn, XGBoost, LightGBM, PyTorch, TensorFlow, MLflow. You don’t need to pip install any of these.

Verify scikit-learn and MLflow are available

In a notebook cell:

import sklearn
import mlflow
import xgboost

print(f"sklearn: {sklearn.__version__}")
print(f"mlflow:  {mlflow.__version__}")
print(f"xgboost: {xgboost.__version__}")

Expected: all three import without error.

MLflow Experiment access

The ML notebook logs runs to MLflow Experiments (built in to Databricks).

  1. Click Experiments in the left sidebar
  2. Verify you can create an experiment (or that a shared one exists)
  3. After training, your run appears here with metrics, params, and model artifacts

Fallback: standard Runtime

If only standard Runtime is available, install the required packages at the top of the notebook:

%pip install scikit-learn xgboost mlflow pandas matplotlib

Snowflake Setup

1. Grant CORTEX_USER privilege (Trainer action)

Cortex ML Functions require the CORTEX_USER privilege. The trainer runs this once on the shared account:

USE ROLE ACCOUNTADMIN;
GRANT DATABASE ROLE SNOWFLAKE.CORTEX_USER TO ROLE SYSADMIN;

After this, all users with SYSADMIN (or their own role inheriting from it) can call ML.FORECAST and ML.ANOMALY_DETECTION.

2. Verify Cortex ML access

Run this in a Snowsight worksheet to confirm Cortex is enabled:

SELECT AI_COMPLETE('mistral-large2', 'Say hello') AS test;

Expected: short text response. If you get a ACCESS_DENIED error, ask the trainer to grant USE AI FUNCTIONS + CORTEX_USER.

3. Verify ML training data

-- Verify Silver enriched table exists with data
SELECT
    COUNT(*)                          AS total_rows,
    COUNT(CASE WHEN PAYMENT_TYPE_DESC = 'Credit Card' THEN 1 END) AS credit_card_rows,
    AVG(TIP_AMOUNT)                   AS avg_tip
FROM DE_MASTERCLASS.{ATTENDEE_ID}_SILVER.SILVER_NYC_TAXI_ENRICHED;

Expected: total_rows > 200,000 and credit_card_rows > 100,000.

-- Verify Gold hourly trips table (needed for Cortex FORECAST)
SELECT COUNT(*), MIN(PICKUP_HOUR_TS), MAX(PICKUP_HOUR_TS)
FROM DE_MASTERCLASS.{ATTENDEE_ID}_GOLD.GOLD_TRIPS_BY_HOUR;

Expected: at least 500+ rows spanning multiple days.

4. Snowpark ML — local Python (optional)

If you want to run Snowpark ML scripts locally (not in Snowsight notebooks):

uv pip install --system "snowflake-snowpark-python[pandas]>=1.14.0" snowflake-ml-python

Configure connection:

from snowflake.snowpark import Session

connection_params = {
    "account":   "your-account-id",
    "user":      "your-username",
    "password":  "your-password",
    "warehouse": "DE_WORKSHOP_WH",
    "database":  "DE_MASTERCLASS",
    "schema":    f"{ATTENDEE_ID}_SILVER",
    "role":      "SYSADMIN"
}
session = Session.builder.configs(connection_params).create()
NoteIn Snowsight notebooks

Snowpark Python is pre-installed in Snowsight notebooks — no local setup needed. Open your Snowflake account → Projects → Notebooks → create a new notebook.


dbt Setup for ml_features Model

The dbt ML model (ml/dbt/models/ml_features_tip_prediction.sql) materialises the feature table as a standard dbt table model — no special config needed.

Run it with your existing dbt Snowflake profile:

cd dbt_project/
dbt run --target snowflake --select ml_features_tip_prediction
dbt test --target snowflake --select ml_features_tip_prediction

Verify:

SELECT COUNT(*), AVG(fare_amount), AVG(tip_amount)
FROM DE_MASTERCLASS.{ATTENDEE_ID}_GOLD.ML_FEATURES_TIP_PREDICTION;

Day-of Checklist

Before Module 9 starts:


Troubleshooting

Issue Solution
ModuleNotFoundError: sklearn on Databricks Switch to ML Runtime (cluster edit → Runtime version)
MLflow experiment not visible Click Experiments in sidebar; first run creates it automatically
Cortex ACCESS_DENIED Ask trainer to grant CORTEX_USER to your role
Cortex ML.FORECAST returns 0 rows Check GOLD_TRIPS_BY_HOUR is not empty; verify column names match
Snowpark Session.builder fails Check account ID format: orgname-accountname or abc12345.west-europe.azure
dbt model fails with missing source Run Silver and Gold batch pipeline first (Modules 2–4)
tip_amount all zeros in predictions Filter to payment_type_desc = 'Credit Card' — cash tips are always 0