Building Medallion Architectures in Databricks: What Actually Works
Every team that adopts Databricks eventually faces the same question: what is the right way to build a Bronze-Silver-Gold pipeline? There are at least half a dozen legitimate approaches, each with its own trade-offs, and the documentation rarely puts them side by side.
So I added the same pipeline seven different ways — same CSV data, same schema, same Kimball star schema at the end — and compared the results. The entire project is deployed as a Databricks Asset Bundle, and the code is in this repo.
This post will walk you through each one, from the most manual to the most declarative. If you just want the recommendation, skip to the end — but the journey is worth it.
The test case
The dataset is deliberately small: customers, products, orders, and order lines, split across two batches. Batch 1 is the initial load. Batch 2 introduces an SCD2 trigger — two customers change their details, one new customer appears, and there is a new order.
The gold layer is a classic Kimball dimensional model: dim_customer (with SCD2 history), dim_product, dim_date, and fact_order_line. After all batches are processed, every approach must produce exactly 8 customer rows (5 current + 2 historical + 1 new), 5 products, 91 dates (full months Jan–Mar 2024), and 11 fact rows.
If the numbers don't match, something is wrong with the implementation. That constraint keeps the comparison honest.
1. Python notebooks — the manual baseline
Code: bronze.py, silver.py, gold.py
This is where most teams start. Three PySpark notebooks — one per layer — doing everything by hand. Bronze reads CSVs and tags them with metadata. Silver deduplicates (latest batch wins via row_number over a window), casts types, standardizes text. Gold builds the dimensional model.
window = (Window
.partitionBy("customer_id")
.orderBy(F.col("_batch_id").desc()))
silver_customers = (bronze_customers
.withColumn("rn", F.row_number().over(window))
.where("rn = 1"))
The SCD2 logic in Gold is the most painful part. You end up manually comparing batch snapshots, building historical rows, and stitching together valid_from / valid_to / is_current. It works, but it is verbose and fragile.
The upside is full control. You can do whatever you want. The downside is that you have to do everything yourself — including things the platform can handle for you.
2. SQL notebooks with COPY INTO
Code: bronze.sql, silver.sql, gold.sql
Same architecture, but everything is SQL. Bronze uses COPY INTO instead of spark.read. Silver is a CREATE OR REPLACE TABLE AS SELECT with CTE-based dedup. Gold builds SCD2 with LEFT ANTI JOIN and change detection — still manual, but arguably more readable if your team thinks in SQL.
COPY INTO bronze_orders
FROM '/Volumes/.../orders'
FILEFORMAT = CSV
FORMAT_OPTIONS ('header' = 'true');
For SQL-heavy teams, this approach feels natural. Code reviews are easier. But the fundamental problem is the same: orchestration is still task-by-task, incremental processing is manual, and SCD2 requires the same careful merge logic.
3. Materialized Views + Streaming Tables
Code: setup.sql, scd2_merge.sql
This is the first step toward declarative thinking. Instead of writing imperative ETL, you tell Databricks what the table should look like, and it figures out how to get there.
CREATE OR REFRESH STREAMING TABLE bronze_orders AS
SELECT *
FROM STREAM read_files('/Volumes/.../orders', format => 'csv');
CREATE OR REPLACE MATERIALIZED VIEW silver_orders AS
SELECT ...
FROM bronze_orders;
Bronze tables become streaming tables (Auto Loader under the hood — it tracks which files have been processed). Silver tables become materialized views that Databricks refreshes automatically. The SQL is remarkably short and clean.
The catch? SCD2. Materialized views do not natively support slowly changing dimensions. You need a separate MERGE notebook as a workaround, which breaks the clean declarative pattern. If your use case does not require SCD2, this approach is hard to beat for simplicity. If it does, you end up mixing paradigms.
4. dbt-core on Databricks
Code: src/dbt_project
dbt brings a different philosophy. Instead of thinking about pipelines, you think about models — SQL SELECT statements that define what a table should contain. dbt handles dependency ordering, testing, and documentation.
SCD2 is handled via dbt snapshots, which compare row states between runs:
{`{% snapshot snap_dim_customer %}
{{ config(strategy='check', unique_key='customer_id',
check_cols=['email','address','city','country','segment']) }}
SELECT * FROM {{ ref('silver_customers') }}
{% endsnapshot %}`}
The workflow is a bit unusual. Because dbt snapshots capture changes between runs (not between batches), the bundle runs a two-phase process: first load batch 1, snapshot, then load batch 2, snapshot again, then build gold. The valid_from / valid_to values are derived from _batch_id rather than snapshot timestamps, since we need deterministic business dates, not execution timestamps.
The main advantage of dbt is organizational. It enforces a project structure, ref-based dependencies, and a testing culture (not_null, unique, relationships). If your team already uses dbt, or you value portable SQL skills, this is a strong choice.
The trade-off: you are adding a tool. There is a project to maintain (dbt_project.yml, profiles.yml, schema.yml), and dbt's model of the world does not map perfectly to streaming or pipeline-native features in Databricks. If you want to try it, the dbt-databricks adapter is the starting point.
5. Delta Live Tables — the classic syntax
Code: pipeline.sql
Delta Live Tables (DLT) was Databricks' first real attempt at declarative pipelines. It introduced some genuinely useful ideas: built-in data quality expectations, native SCD2 via APPLY CHANGES, and a managed runtime that handles incremental processing for you.
CREATE STREAMING LIVE TABLE bronze_customers AS
SELECT * FROM STREAM read_files('/Volumes/.../customers', format => 'csv');
APPLY CHANGES INTO LIVE.dim_customer
FROM STREAM(LIVE.bronze_customers)
KEYS (customer_id)
STORED AS SCD TYPE 2;
The SCD2 is the highlight. Two lines replace dozens of lines of manual merge logic. The runtime produces __START_AT / __END_AT columns automatically (slightly different naming from the valid_from / valid_to convention, but semantically equivalent).
So why not stop here? Because Databricks has since renamed and modernized this syntax. DLT keywords like CREATE STREAMING LIVE TABLE and APPLY CHANGES INTO still work, but they are legacy. New projects should use the current naming.
6. Declarative Pipelines (Lakeflow) — the recommended approach
Code: SQL version, Python version
This is the modern evolution of DLT — now called Lakeflow Declarative Pipelines. Same runtime, same capabilities, but with updated syntax that aligns with standard SQL conventions:
CREATE OR REFRESH STREAMING TABLE bronze_customers AS
SELECT * FROM STREAM read_files('/Volumes/.../customers', format => 'csv');
CREATE FLOW scd2_dim_customer AS AUTO CDC INTO dim_customer
FROM STREAM(bronze_customers)
KEYS (customer_id)
STORED AS SCD TYPE 2;
The SQL version reads like what it does. The Python version uses @dlt.table decorators and create_auto_cdc_flow() — same logic, different syntax depending on your team's preference.
What makes this approach stand out:
- Incremental by default. Streaming tables track what has been processed. You do not write checkpoint logic.
- Native SCD2.
AUTO CDC INTOhandles the history tracking. No manual merge, no snapshot workaround. - Data quality built in.
CONSTRAINT ... EXPECT ... ON VIOLATION DROP ROWvalidates data inline. - Single pipeline definition. The entire Bronze-Silver-Gold flow lives in one file. Databricks handles dependency resolution and execution ordering.
- Less code overall. The SQL pipeline is one file that replaces three notebooks worth of manual logic.
The trade-offs are real but manageable. The pipeline syntax has its own semantics that you need to learn. Debugging happens in the pipeline runtime, not in a notebook you can step through. And the __START_AT / __END_AT column names differ from the valid_from / valid_to convention (minor, but worth documenting for your team).
For most teams building new medallion pipelines on Databricks, this is where I would start.
So which one should you use?
It depends on your team and your situation, but here is a practical decision framework:
Start with Declarative Pipelines if you are building something new on Databricks. They give you the best balance of readability, built-in SCD2 support, data quality checks, and incremental processing — with the least amount of custom code.
Use dbt if your team is already invested in the dbt ecosystem, or if portability across platforms matters. dbt's testing and documentation culture is a real advantage, even if it adds project overhead.
Use Python or SQL notebooks for prototypes, experiments, or when you need full control over the processing logic. They are fast to start and easy to understand, but they accumulate complexity quickly in production.
Use Materialized Views + Streaming Tables for compact, SQL-only pipelines where SCD2 is not a hard requirement. When it works, it is the cleanest option.
Avoid classic DLT syntax for new projects. It still runs fine, but there is no reason to start with deprecated keywords when the modern equivalents do the same thing.
The full source code is at github.com/romaklimenko/databricks-medallion. If you want to dig into a specific approach, the README has detailed notes on each one.