Databricks Metric Views
I denne artikel tager vi et hurtigt kig på Databricks metric views -- hvorfor vi har brug for dem, hvordan de fungerer, og hvilke problemer de løser. Lad os starte med et simpelt datasæt og bygge op derfra.
Den fulde kildekode til demoen er tilgængelig i dette GitHub Gist.
Vi har tre tabeller: regions, customers og orders. Tre regioner (EMEA, AMER, APAC), otte kunder fordelt på dem og ti ordrer. Intet fancy -- bare nok data til at se problemet:
CREATE SCHEMA IF NOT EXISTS demo.metric_views;
USE demo.metric_views;
-- Regions
CREATE OR REPLACE TABLE regions (region_id INT, region_name STRING);
INSERT INTO regions
VALUES (1, 'EMEA'), (2, 'AMER'), (3, 'APAC');
-- Customers
CREATE OR REPLACE TABLE customers (customer_id INT, customer_name STRING, region_id INT);
INSERT INTO customers
VALUES
(101, 'Acme Corp', 1), -- EMEA
(102, 'Globex Ltd', 1), -- EMEA
(103, 'Initech Inc', 2), -- AMER
(104, 'Umbrella Co', 2), -- AMER
(105, 'Wonka Industries', 2), -- AMER
(106, 'Stark Enterprises', 3), -- APAC
(107, 'Wayne Corp', 3), -- APAC
(108, 'Oscorp', 3);
-- APAC
-- Orders
CREATE OR REPLACE TABLE orders (
order_id INT,
customer_id INT,
order_date DATE,
amount DECIMAL(10, 2)
);
INSERT INTO orders
VALUES
(1, 101, '2025-01-15', 1000.00), -- Acme (EMEA)
(2, 101, '2025-02-20', 1500.00), -- Acme (EMEA) — repeat customer
(3, 102, '2025-01-10', 800.00), -- Globex (EMEA)
(4, 103, '2025-03-05', 2000.00), -- Initech (AMER)
(5, 104, '2025-03-12', 500.00), -- Umbrella (AMER)
(6, 104, '2025-04-01', 700.00), -- Umbrella (AMER) — repeat customer
(7, 105, '2025-02-28', 1200.00), -- Wonka (AMER)
(8, 106, '2025-01-20', 3000.00), -- Stark (APAC)
(9, 107, '2025-04-15', 900.00), -- Wayne (APAC)
(10, 108, '2025-03-22', 1100.00); -- Oscorp (APAC)
SELECT
'regions' AS tbl,
COUNT(*) AS cnt
FROM
regions
UNION ALL
SELECT
'customers' AS tbl,
COUNT(*) AS cnt
FROM
customers
UNION ALL
SELECT
'orders' AS tbl,
COUNT(*) AS cnt
FROM
orders;
| tbl | cnt |
|---|---|
| regions | 3 |
| customers | 8 |
| orders | 10 |
Vi opretter også et denormaliseret view, der joiner alt sammen. Det bliver vores kilde til alle de følgende forespørgsler:
CREATE OR REPLACE VIEW orders_denorm AS
SELECT
o.order_id,
o.customer_id,
c.customer_name,
r.region_id,
r.region_name,
o.order_date,
o.amount
FROM
orders o
JOIN customers c
ON o.customer_id = c.customer_id
JOIN regions r
ON c.region_id = r.region_id;
SELECT
*
FROM
orders_denorm
ORDER BY
order_id;
| order_id | customer_id | customer_name | region_id | region_name | order_date | amount |
|---|---|---|---|---|---|---|
| 1 | 101 | Acme Corp | 1 | EMEA | 2025-01-15 | 1000.00 |
| 2 | 101 | Acme Corp | 1 | EMEA | 2025-02-20 | 1500.00 |
| 3 | 102 | Globex Ltd | 1 | EMEA | 2025-01-10 | 800.00 |
| 4 | 103 | Initech Inc | 2 | AMER | 2025-03-05 | 2000.00 |
| 5 | 104 | Umbrella Co | 2 | AMER | 2025-03-12 | 500.00 |
| 6 | 104 | Umbrella Co | 2 | AMER | 2025-04-01 | 700.00 |
| 7 | 105 | Wonka Industries | 2 | AMER | 2025-02-28 | 1200.00 |
| 8 | 106 | Stark Enterprises | 3 | APAC | 2025-01-20 | 3000.00 |
| 9 | 107 | Wayne Corp | 3 | APAC | 2025-04-15 | 900.00 |
| 10 | 108 | Oscorp | 3 | APAC | 2025-03-22 | 1100.00 |
Lad os nu bygge et simpelt KPI-view -- samlet omsætning, unikke kunder og omsætning pr. kunde, grupperet efter region:
CREATE OR REPLACE VIEW region_kpis AS
SELECT
r.region_name,
SUM(o.amount) AS total_revenue,
COUNT(DISTINCT o.customer_id) AS distinct_customers,
SUM(o.amount) / COUNT(DISTINCT o.customer_id) AS revenue_per_customer
FROM
orders o
JOIN customers c
ON o.customer_id = c.customer_id
JOIN regions r
ON c.region_id = r.region_id
GROUP BY
r.region_name;
-- This view looks great per region:
SELECT
*
FROM
region_kpis
ORDER BY
region_name;
| region_name | total_revenue | distinct_customers | revenue_per_customer |
|---|---|---|---|
| AMER | 4400.00 | 3 | 1466.67 |
| APAC | 5000.00 | 3 | 1666.67 |
| EMEA | 3300.00 | 2 | 1650.00 |
Per region ser det fint ud. Men hvad sker der, når nogen wrapper dette view i en anden forespørgsel for at få en samlet total?
-- THE PROBLEM — re-aggregating the view gives WRONG results
-- Someone wraps the view in another query to get a grand total...
SELECT
SUM(total_revenue) AS grand_total_revenue,
SUM(distinct_customers) AS grand_total_customers, -- WRONG: 3+3+2=8, happens to be correct here but only by luck
SUM(revenue_per_customer) AS grand_revenue_per_customer -- WRONG!
FROM
region_kpis;
-- WHY: Summing per-group ratios ≠ the global ratio.
-- Groups have different denominators (2, 3, 3 customers),
-- so the ratios aren't comparable units you can add up.
-- The ratio must be recomputed from the raw numerator and denominator.
-- (a1 + a2 + a3) / (b1 + b2 + b3) != (a1 / b1 + a2 / b2 + a3 / b3) ¯\_(ツ)_/
| grand_total_revenue | grand_total_customers | grand_revenue_per_customer |
|---|---|---|
| 12700.00 | 8 | 4783.33 |
Omsætning pr. kunde er forkert. Vi summerer tre forholdstiltal med forskellige nævnere (2, 3 og 3 kunder). Sådan fungerer matematik ikke.
Og nej, at tage gennemsnittet løser det heller ikke:
-- You can't average averages!
SELECT
AVG(revenue_per_customer) AS avg_revenue_per_customer -- STILL WRONG
FROM
region_kpis;
-- Result: (1466.67 + 1666.67 + 1650.00) / 3 = 1594.44
-- Correct: 1587.50
-- Close-ish, but wrong. And the error grows with uneven group sizes.
-- You cannot safely re-aggregate ratio measures.
-- The measure must be recomputed from raw data at every level of aggregation.
| avg_revenue_per_customer |
|---|
| 1594.44 |
Tæt på, men forkert. Og fejlen vokser med ujævne gruppestørrelser. Man kan ikke tage gennemsnittet af gennemsnit. Konklusionen er, at man ikke sikkert kan re-aggregere forholdstiltal. Metrikken skal genberegnes fra rådata på hvert aggregeringsniveau.
En traditionel tilgang er GROUP BY CUBE, som beregner alle mulige kombinationer af dimensioner:
-- With one dimension it's fine.
SELECT
COALESCE(region_name, 'All Regions') AS region_name,
SUM(amount) AS total_revenue,
COUNT(DISTINCT customer_id) AS distinct_customers,
SUM(amount) / COUNT(DISTINCT customer_id) AS revenue_per_customer,
region_name AS _sort
FROM
orders_denorm
GROUP BY
CUBE (region_name)
ORDER BY
_sort NULLS LAST;
| region_name | total_revenue | distinct_customers | revenue_per_customer | _sort |
|---|---|---|---|---|
| AMER | 4400.00 | 3 | 1466.67 | AMER |
| APAC | 5000.00 | 3 | 1666.67 | APAC |
| EMEA | 3300.00 | 2 | 1650.00 | EMEA |
| All Regions | 12700.00 | 8 | 1587.50 | null |
Med én dimension er det fint. Men med tre dimensioner
producerer CUBE 2^3 = 8 grupperingsniveauer.
Med 10 dimensioner er man allerede på 1024 grupperingsniveauer, og med flere dimensioner vokser det derfra.
Det kan hurtigt blive til millioner af rækker, som de fleste aldrig forespørger. Det skalerer ikke:
-- CUBE with 3 dimensions produces 2^3 = 8 grouping levels:
-- every possible subset of {region, customer, date}.
-- With 10 source rows this is manageable, but in production
-- with 10+ dimensions you'd get 1024+ grouping levels and
-- millions of rows — most of which nobody ever queries.
SELECT
COALESCE(region_name, 'All Regions') AS region_name,
COALESCE(customer_name, 'All Customers') AS customer_name,
COALESCE(CAST(order_date AS STRING), 'All Dates') AS order_date,
SUM(amount) AS total_revenue,
COUNT(DISTINCT customer_id) AS distinct_customers,
SUM(amount) / COUNT(DISTINCT customer_id) AS revenue_per_customer
FROM
orders_denorm
GROUP BY
CUBE (region_name, customer_name, order_date)
ORDER BY
region_name,
customer_name,
order_date
Forespørgslen ovenfor returnerer 60 rækker.
Her kommer metric views ind i billedet. Man definerer sine dimensioner og mål én gang, og motoren genberegner alt korrekt ved forespørgselstidspunktet, uanset hvordan man skærer det:
CREATE OR REPLACE VIEW order_metrics_view WITH METRICS LANGUAGE YAML AS
$$
version: "1.1"
source: demo.metric_views.orders_denorm
dimensions:
- name: region_name
expr: region_name
display_name: Region
- name: customer_name
expr: customer_name
display_name: Customer
- name: order_date
expr: order_date
display_name: Order Date
measures:
- name: total_revenue
expr: SUM(amount)
display_name: Total Revenue
- name: customer_count
expr: COUNT(DISTINCT customer_id)
display_name: Distinct Customers
- name: order_count
expr: COUNT(order_id)
display_name: Order Count
- name: revenue_per_customer
expr: SUM(amount) / COUNT(DISTINCT customer_id)
display_name: Revenue per Customer
$$;
display_name-felterne er en del af Databricks' semantiske metadata for metric views.
Forespørg nu efter region. Mål skal wrappes i funktionen
MEASURE, så motoren kan evaluere dem på forespørgslens grupperingsniveau:
-- Query by region — same results as the regular view
-- Measures MUST be wrapped in MEASURE()
SELECT
region_name,
MEASURE(total_revenue) AS total_revenue,
MEASURE(customer_count) AS distinct_customers,
MEASURE(revenue_per_customer) AS revenue_per_customer
FROM
order_metrics_view
GROUP BY
region_name
ORDER BY
region_name;
-- Expected:
-- AMER: 4400 / 3 = 1466.67
-- APAC: 5000 / 3 = 1666.67
-- EMEA: 3300 / 2 = 1650.0
| region_name | total_revenue | distinct_customers | revenue_per_customer |
|---|---|---|---|
| AMER | 4400.00 | 3 | 1466.67 |
| APAC | 5000.00 | 3 | 1666.67 |
| EMEA | 3300.00 | 2 | 1650.00 |
For den samlede total udelader vi bare GROUP BY-klausulen:
-- Grand total:
-- No GROUP BY -> the engine computes from ALL raw data
-- -----------------------------------------------------------------------------
SELECT
MEASURE(total_revenue) AS total_revenue,
MEASURE(customer_count) AS distinct_customers,
MEASURE(revenue_per_customer) AS revenue_per_customer -- The ratio is recomputed, not summed!
FROM
order_metrics_view;
-- Expected: 12700 / 8 = 1587.50 <- CORRECT
| total_revenue | distinct_customers | revenue_per_customer |
|---|---|---|
| 12700.00 | 8 | 1587.50 |
Vi kan også gruppere efter en anden dimension uden at omskrive viewet:
-- Group by a different dimension — no view rewrite needed
SELECT
customer_name,
MEASURE(total_revenue) AS total_revenue,
MEASURE(order_count) AS orders,
MEASURE(revenue_per_customer) AS revenue_per_customer
FROM
order_metrics_view
GROUP BY
customer_name
ORDER BY
total_revenue DESC;
-- Revenue per customer when grouped by customer is just their total revenue / 1,
-- because each group contains exactly one distinct customer
| customer_name | total_revenue | orders | revenue_per_customer |
|---|---|---|---|
| Stark Enterprises | 3000.00 | 1 | 3000.00 |
| Acme Corp | 2500.00 | 2 | 2500.00 |
| Initech Inc | 2000.00 | 1 | 2000.00 |
| Wonka Industries | 1200.00 | 1 | 1200.00 |
| Umbrella Co | 1200.00 | 2 | 1200.00 |
| Oscorp | 1100.00 | 1 | 1100.00 |
| Wayne Corp | 900.00 | 1 | 900.00 |
| Globex Ltd | 800.00 | 1 | 800.00 |
Og vi kan også filtrere efter specifikke regioner:
SELECT
region_name,
MEASURE(total_revenue) AS total_revenue,
MEASURE(revenue_per_customer) AS revenue_per_customer
FROM
order_metrics_view
WHERE -- Filtered query — also correct automatically
region_name IN ('EMEA', 'AMER')
GROUP BY
region_name
| region_name | total_revenue | revenue_per_customer |
|---|---|---|
| EMEA | 3300.00 | 1650.00 |
| AMER | 4400.00 | 1466.67 |
Og her er den samlede total for de to regioner:
-- Grand total for just EMEA + AMER:
SELECT
MEASURE(total_revenue) AS total_revenue,
MEASURE(customer_count) AS distinct_customers,
MEASURE(revenue_per_customer) AS revenue_per_customer
FROM
order_metrics_view
WHERE
region_name IN ('EMEA', 'AMER');
-- Expected: (3300 + 4400) / (2 + 3) = 7700 / 5 = 1540.00
-- MEASURE() tells the engine to recompute the expression at whatever grain the query defines.
-- You define the metric ONCE and slice it freely
| total_revenue | distinct_customers | revenue_per_customer |
|---|---|---|
| 7700.00 | 5 | 1540.00 |
Materialisering
Aggregeret materialisering
Til dashboards der gentagne gange rammer de samme udsnit, kan materialisering af metric views øge hastigheden. Ifølge dokumentationen er denne funktion eksperimentel og kræver serverless compute samt Databricks Runtime 17.2 eller nyere. Der er to typer: aggregeret og uaggregeret materialisering:
-- Aggregated Materialization
-- Pre-compute specific dimension + measure combinations.
-- "these are my hot dashboard queries."
CREATE OR REPLACE VIEW order_metrics_mat_agg WITH METRICS LANGUAGE YAML AS
$$
version: "1.1"
source: demo.metric_views.orders_denorm
dimensions:
- name: region_name
expr: region_name
- name: customer_name
expr: customer_name
- name: order_date
expr: order_date
measures:
- name: total_revenue
expr: SUM(amount)
- name: customer_count
expr: COUNT(DISTINCT customer_id)
- name: order_count
expr: COUNT(order_id)
- name: revenue_per_customer
expr: SUM(amount) / COUNT(DISTINCT customer_id)
materialization:
mode: "relaxed"
schedule: "EVERY 1 HOUR"
materialized_views:
- name: mv_by_region
type: aggregated
dimensions:
- region_name
measures:
- total_revenue
- customer_count
- revenue_per_customer
$$;
Lad os prøve denne forespørgsel. Når materialiseringen er tilgængelig, kan optimizeren servere den fra det materialiserede resultat:
-- this SHOULD hit the materialized view (fast path)
-- REFRESH MATERIALIZED VIEW order_metrics_mat_agg;
SELECT
region_name,
MEASURE(total_revenue),
MEASURE(revenue_per_customer)
FROM
order_metrics_mat_agg
GROUP BY
region_name
I planen kan vi se, at optimizeren bruger den materialiserede sti:
Lad os nu prøve en anden forespørgsel.
Den falder tilbage til kildedataene, fordi customer_name-dimensionen ikke er materialiseret:
-- this falls back to source data (no materialization covers this combo)
-- REFRESH MATERIALIZED VIEW order_metrics_mat_agg;
SELECT
customer_name,
MEASURE(total_revenue)
FROM
order_metrics_mat_agg
GROUP BY
customer_name
Denne forespørgsel falder tilbage til kildedataene. Beregningerne er stadig korrekte; de er bare ikke forudberegnede:
Vi kan også tilføje en separat materialisering for customer_name:
CREATE OR REPLACE VIEW order_metrics_mat_agg WITH METRICS LANGUAGE YAML AS
$$
version: "1.1"
source: demo.metric_views.orders_denorm
dimensions:
- name: region_name
expr: region_name
- name: customer_name
expr: customer_name
- name: order_date
expr: order_date
measures:
- name: total_revenue
expr: SUM(amount)
- name: customer_count
expr: COUNT(DISTINCT customer_id)
- name: order_count
expr: COUNT(order_id)
- name: revenue_per_customer
expr: SUM(amount) / COUNT(DISTINCT customer_id)
materialization:
mode: "relaxed"
schedule: "EVERY 1 HOUR"
materialized_views:
- name: mv_by_region
type: aggregated
dimensions:
- region_name
measures:
- total_revenue
- customer_count
- revenue_per_customer
# Separate materialized view at customer grain.
# Can't just add customer_name to mv_by_region — that changes
# the grain to (region, customer), and ratio measures like
# revenue_per_customer can't be safely re-aggregated from
# a finer grain back to region-only. One grain = one MV.
- name: mv_by_customer
type: aggregated
dimensions:
- customer_name
measures:
- total_revenue
$$;
Når opdateringen er færdig, kan denne forespørgsel bruge det nye materialiserede resultat.
Et forbehold: lige efter man ændrer viewet, kan man stadig se den gamle plan, fordi materialiseringspipelinen kører asynkront i baggrunden og har brug for tid til at blive færdig.
Man kan også udløse materialisering af viewet manuelt:
REFRESH MATERIALIZED VIEW order_metrics_mat_agg;
Jo flere dimensioner man materialiserer sammen, jo mere beregning og lagring kræver systemet, så brug denne funktion selektivt.
Uaggregeret materialisering
En anden materialiseringstype er uaggregeret. Med denne tilgang kan vi forudmaterialisere den joinede datamodel (fakta plus dimensioner fladgjort). Det er meget nyttigt, når joins er den dyre del, ikke aggregeringen:
-- Unaggregated Materialization
-- Pre-materialize the joined data model (fact + dimensions flattened).
-- Useful when joins are the expensive part, not the aggregation.
CREATE OR REPLACE VIEW order_metrics_mat_unagg WITH METRICS LANGUAGE YAML AS
$$
version: "1.1"
source: demo.metric_views.orders_denorm
dimensions:
- name: region_name
expr: region_name
- name: customer_name
expr: customer_name
- name: order_date
expr: order_date
measures:
- name: total_revenue
expr: SUM(amount)
- name: customer_count
expr: COUNT(DISTINCT customer_id)
- name: revenue_per_customer
expr: SUM(amount) / COUNT(DISTINCT customer_id)
materialization:
mode: "relaxed"
schedule: "EVERY 1 HOUR"
materialized_views:
- name: mv_unaggregated
type: unaggregated
$$;
-- Both of these queries benefit from the pre-joined data:
SELECT
region_name,
MEASURE(total_revenue)
FROM
order_metrics_mat_unagg
GROUP BY
region_name;
SELECT
customer_name,
MEASURE(total_revenue)
FROM
order_metrics_mat_unagg
GROUP BY
customer_name
Det var Databricks metric views. Funktionssættet er stadig under udvikling, men det er allerede en stærk måde at definere genbrugelige forretningsmetrikker ét sted.
Power BI-understøttelse af metric views er i øjeblikket i Beta og bygger på BI-kompatibilitetstilstand, som omskriver standard Power BI-aggregeringer til den underliggende metric view-definition. Dokumentationen bemærker også, at metric views med materialiseringer ikke understøttes i denne tilstand.
Metric views er Unity Catalog-sikrede objekter, så man kan administrere rettigheder der, og Unity Catalog lineage hjælper med at spore, hvordan data produceres og forbruges.