Glossary
Key terms and concepts
title: “Glossary” subtitle: “Key terms and concepts” —
A
- ADLS2 (Azure Data Lake Storage Gen2)
-
Azure’s scalable cloud storage service optimized for big data analytics. Supports hierarchical namespace and
abfss://protocol. - Asset Bundles (DABs)
-
Databricks’ infrastructure-as-code tool for deploying pipelines, jobs, and notebooks via CLI. See
databricks bundle deploy.
B
- Bronze Layer
- First layer of medallion architecture. Raw data ingested as-is from source, with metadata added. No transformations applied.
C
- Catalog (Unity Catalog)
- Top-level namespace in Databricks: Catalog > Schema > Table. Provides centralized governance, auditing, and lineage.
- Checkpoint
- A file on cloud storage that records the exact offset a streaming query last processed. Enables exactly-once semantics and fault recovery in Structured Streaming.
- COPY INTO
- Snowflake SQL command for bulk loading data from external stages into tables. Efficient for Parquet, CSV, JSON.
- Cortex AI
-
Snowflake’s suite of AI features including
AI_COMPLETE(),AI_EXTRACT(), Cortex Code, and Cortex Analyst. - Cortex Analyst
- Snowflake service that translates natural language questions into SQL against governed semantic models (YAML definitions of tables and metrics).
- Cortex Search
- Snowflake hybrid search service (semantic + keyword) over unstructured and structured content. Enables RAG patterns within Snowflake.
- Cross-validation
- ML evaluation technique that splits data into k folds, training on k-1 and testing on 1, rotating through all folds. More robust than a single train/test split.
D
- Data Leakage
-
Including information in ML features that wouldn’t be available at prediction time or mathematically encodes the target. The #1 ML mistake (e.g.,
total_amountincludestip_amount). - dbt (data build tool)
- Transformation framework that turns SQL SELECT statements into managed tables/views with testing, documentation, and lineage.
- Delta Lake
- Open-source storage format providing ACID transactions, time travel, and schema evolution on top of Parquet files.
- Delta Live Tables (DLT)
-
Databricks’ declarative ETL framework using
@dlt.tabledecorators. Handles orchestration, quality enforcement, and lineage automatically. - Dynamic Table (Snowflake)
-
A Snowflake object defined by a SQL SELECT that refreshes automatically based on
TARGET_LAG. Declarative streaming — you write SQL, Snowflake manages incremental refresh.
E
- ELT (Extract, Load, Transform)
- Pattern where raw data is loaded first, then transformed in-place within the data warehouse. Used by all three pipelines in this training.
- Exactly-once Semantics
- Guarantee that each event is processed exactly once, even during failures. Achieved in Structured Streaming via checkpoint files + Delta Lake transaction log.
- External Stage
- Snowflake object pointing to a cloud storage location (ADLS2, S3, GCS) for reading external files.
F
- Feature Engineering
- Process of creating input variables (features) for ML models from raw data. In this training, dbt defines the canonical feature table that both Databricks and Snowflake training pipelines consume.
G
- Genie Space
- Databricks AI/BI feature. Natural language interface over data tables — ask questions in plain English, get SQL + results.
- Gold Layer
- Third layer of medallion architecture. Pre-aggregated business KPIs and metrics optimized for reporting and dashboards.
I
- Incremental Materialization
-
dbt strategy that appends or merges only new rows instead of rebuilding the entire table. Requires
is_incremental()logic and aunique_key. Essential for large Silver tables.
J
- Jinja
-
Templating engine embedded in dbt SQL files. Enables conditional compilation (
target.type), reusable macros,ref()/source()dependency management, and configuration blocks.
K
- KPI (Key Performance Indicator)
- Pre-computed business metric. This training computes 12 KPIs from taxi trip data (trips by hour, top zones, revenue, etc.).
M
- Macro (dbt)
-
Reusable SQL function defined in
.sqlfiles undermacros/. Similar to functions in Python — accepts parameters and returns SQL snippets. Example:{ time_of_day('pickup_hour') }. - Materialization
- Strategy for how a dbt model becomes a database object. Types: view, table, incremental, ephemeral, materialized_view. Controls the tradeoff between freshness, cost, and complexity.
- Medallion Architecture
- Multi-layer data design pattern: Bronze (raw) → Silver (cleaned) → Gold (aggregated). Also called “multi-hop architecture.”
- MLflow
- Open-source platform for ML lifecycle management. Tracks experiments (hyperparameters, metrics, model artifacts). Databricks includes MLflow autolog for automatic experiment capture.
- Model (dbt)
-
A SQL SELECT statement that dbt materializes as a table or view. Each
.sqlfile inmodels/is one model.
P
- PySpark
- Python API for Apache Spark. Used in Databricks notebooks for distributed data processing.
R
- ref() (dbt)
-
dbt function that creates a dependency between models.
{ ref('my_model') }references another model and builds the DAG.
S
- SAS Token
- Shared Access Signature — a URL-based authentication method for Azure Storage. Used for Snowflake external stage access in trial accounts.
- Silver Layer
- Second layer of medallion architecture. Cleaned, validated, enriched data with derived metrics. Trustworthy data for analysis.
- Slim CI
-
dbt testing pattern that only tests changed models + their downstream dependencies instead of the entire project. Uses
--select state:modified+. - Snowpark
- Snowflake’s Python DataFrame API. Runs Python code natively on Snowflake compute. API is similar to PySpark.
- Snowflake Intelligence
- Umbrella brand (announced 2025) bundling all Cortex AI capabilities — LLM functions, Cortex Analyst, Cortex Search, and ML functions — into a unified AI layer.
- source() (dbt)
-
dbt function referencing a raw table not managed by dbt.
{ source('bronze', 'trips') }references an external source table. - Stream (Snowflake)
- Snowflake object that tracks changes (inserts, updates, deletes) on a table. Used for change data capture (CDC).
T
- Task (Snowflake)
- Snowflake object that runs SQL on a schedule or when triggered by a condition. Used with Streams for event-driven pipelines.
U
- Unity Catalog
- Databricks’ centralized governance layer. Manages catalogs, schemas, tables, permissions, lineage, and data sharing.
V
- Virtual Warehouse (Snowflake)
- Snowflake compute resource. Auto-suspends when idle, auto-resumes on query. Sizes from X-Small to 6X-Large.
Y
- YellowLine NYC
- Fictional NYC yellow-taxi operator used as the training client story (inspired by TLC public trip data, not a real company). Marcus Chen (Operations Manager) works for YellowLine NYC; MHP is the consulting vendor engaged to build analytics. On-screen dispatch UI and Marcus’s badge use YellowLine branding — not MHP.
W
- Watermark
- In streaming, a threshold that tells the engine to ignore events older than X minutes. Enables window finalization and state memory cleanup. Tradeoff: shorter = lower latency but more dropped late events.
- Workflows (Databricks)
- Databricks’ job scheduler. Runs notebooks, DLT pipelines, or Python scripts as multi-task DAGs with dependencies and retries.