Master Data Management (MDM) Functions

Master Data Management (MDM) functions help consolidate and standardize data when you have multiple records representing the same entity. These functions are designed for scenarios where you need to select the "best" or most representative value from a group of duplicate or related records.

Common use cases:

Deduplicating customer records from multiple source systems
Selecting the most complete or accurate value when merging records
Building golden records from fragmented data
Ranking data quality across multiple sources

Aggregation Functions
- tdw_mdm_longest - Select longest string
- tdw_mdm_shortest - Select shortest string
- tdw_mdm_max_by - Select by maximum attribute
- tdw_mdm_min_by - Select by minimum attribute
- tdw_mdm_coalesce - Select first non-empty value
- tdw_mdm_majority - Select most frequent value
- tdw_mdm_prioritized - Select by priority list
- tdw_mdm_regex_prioritized - Select by regex priority
Window Functions
- tdw_rank_by_frequency - Rank by occurrence count
- tdw_rank_by_priority - Rank by priority list
- tdw_rank_by_regex_priority - Rank by regex priority
Function Selection Guide
Common Patterns

Aggregation Functions

Aggregation functions process multiple rows within a GROUP BY and return a single consolidated result per group.

tdw_mdm_longest

tdw_mdm_longest(value VARCHAR) → VARCHAR

Returns the longest non-null string value from multiple rows. If all values are null, returns null. Useful for selecting the most detailed or complete value.

Behavior:

Compares string lengths and returns the value with the maximum length
Empty strings are considered valid values with length 0
NULL values are skipped
If multiple values have the same maximum length, one is returned (not guaranteed which)

Example: Select the most detailed address

-- Input data
+-------------+---------------------------+
| customer_id |          address          |
+-------------+---------------------------+
|           1 | 123 Main St               |
|           1 | 123 Main Street, Apt 4B   |
|           1 | NULL                      |
|           2 | ''                        |
|           2 | PO Box 123                |
|           3 | NULL                      |
+-------------+---------------------------+

-- Query
SELECT customer_id,
       tdw_mdm_longest(address) AS primary_address
FROM customer_sources
GROUP BY customer_id;

-- Output
+-------------+---------------------------+
| customer_id |      primary_address      |
+-------------+---------------------------+
|           1 | 123 Main Street, Apt 4B   |  -- 24 chars (longest)
|           2 | PO Box 123                |  -- 10 chars (beats empty string)
|           3 | NULL                      |  -- All values null
+-------------+---------------------------+

tdw_mdm_shortest

tdw_mdm_shortest(value VARCHAR) → VARCHAR

Returns the shortest non-null, non-empty string value from multiple rows. If all values are null or empty, returns null. Useful for selecting concise display names or abbreviations.

Behavior:

Compares string lengths and returns the value with the minimum length
NULL values and empty strings are skipped
If multiple values have the same minimum length, one is returned (not guaranteed which)

Example: Select concise display name

-- Input data
+-------------+-------------------+
| customer_id |       name        |
+-------------+-------------------+
|           1 | John Smith        |
|           1 | J. Smith          |
|           1 | NULL              |
|           1 | ''                |
|           2 | ''                |
|           2 | NULL              |
|           3 | Alice             |
|           3 | Alice B. Johnson  |
+-------------+-------------------+

-- Query
SELECT customer_id,
       tdw_mdm_shortest(name) AS display_name
FROM customer_sources
GROUP BY customer_id;

-- Output
+-------------+--------------+
| customer_id | display_name |
+-------------+--------------+
|           1 | J. Smith     |  -- 8 chars (shortest non-empty)
|           2 | NULL         |  -- All values null or empty
|           3 | Alice        |  -- 5 chars (shortest)
+-------------+--------------+

tdw_mdm_max_by

tdw_mdm_max_by(value V, attribute K) → V

Returns the value associated with the maximum attribute. Automatically skips rows where the value is NULL, ensuring you get the best available non-null data. Supports any comparable attribute type (BIGINT, DOUBLE, TIMESTAMP, VARCHAR, etc.) and any value type.

Behavior:

Rows where the value is NULL are automatically skipped
Returns the value associated with the maximum (latest/largest) attribute
If all values are null, returns null
Attribute type must be comparable (numeric, timestamp, string, etc.)

Tip: Ideal for selecting the most recent value (by timestamp), or the value from the most trusted source (by trust score).

Example 1: Select most recent name by timestamp

-- Input data
+-------------+-------------+--------------+
| customer_id |    name     | last_updated |
+-------------+-------------+--------------+
|           1 | John Smith  | 2023-01-15   |
|           1 | J. Smith    | 2024-03-20   |
|           1 | NULL        | 2024-12-01   |  -- Skipped (null value)
|           2 | Alice       | 2023-06-10   |
|           2 | Alice Jones | 2024-01-15   |
+-------------+-------------+--------------+

-- Query
SELECT customer_id,
       tdw_mdm_max_by(name, last_updated) AS current_name
FROM customer_history
GROUP BY customer_id;

-- Output
+-------------+--------------+
| customer_id | current_name |
+-------------+--------------+
|           1 | J. Smith     |  -- Latest non-null (2024-03-20)
|           2 | Alice Jones  |  -- Latest (2024-01-15)
+-------------+--------------+

Example 2: Select by highest confidence score

-- Input data
+------------+-------+------------------+
| product_id | price | confidence_score |
+------------+-------+------------------+
|          1 | 99.99 |             0.95 |
|          1 | 89.99 |             0.80 |
|          1 | NULL  |             0.99 |  -- Skipped (null value)
|          2 | 49.99 |             0.75 |
+------------+-------+------------------+

-- Query
SELECT product_id,
       tdw_mdm_max_by(price, confidence_score) AS best_price
FROM product_sources
GROUP BY product_id;

-- Output
+------------+------------+
| product_id | best_price |
+------------+------------+
|          1 |      99.99 |  -- Confidence 0.95 (highest with non-null value)
|          2 |      49.99 |  -- Only non-null value
+------------+------------+

tdw_mdm_min_by

tdw_mdm_min_by(value V, attribute K) → V

Returns the value associated with the minimum attribute. Automatically skips rows where the value is NULL. Supports any comparable attribute type (BIGINT, DOUBLE, TIMESTAMP, VARCHAR, etc.) and any value type.

Behavior:

Rows where the value is NULL are automatically skipped
Returns the value associated with the minimum (earliest/smallest) attribute
If all values are null, returns null
Attribute type must be comparable (numeric, timestamp, string, etc.)

Tip: Ideal for selecting the original/earliest value (by registration date), or selecting by lowest priority score.

Example: Select original name from earliest record

-- Input data
+-------------+-------------+-------------------+
| customer_id |    name     | registration_date |
+-------------+-------------+-------------------+
|           1 | John Smith  | 2023-01-15        |
|           1 | J. Smith    | 2024-03-20        |
|           1 | NULL        | 2022-12-01        |  -- Skipped (null value)
|           2 | Alice Jones | 2023-05-01        |
|           2 | Alice       | 2023-06-10        |
+-------------+-------------+-------------------+

-- Query
SELECT customer_id,
       tdw_mdm_min_by(name, registration_date) AS original_name
FROM customer_history
GROUP BY customer_id;

-- Output
+-------------+---------------+
| customer_id | original_name |
+-------------+---------------+
|           1 | John Smith    |  -- Earliest non-null (2023-01-15)
|           2 | Alice Jones   |  -- Earliest (2023-05-01)
+-------------+---------------+

tdw_mdm_coalesce

tdw_mdm_coalesce(value VARCHAR) → VARCHAR

Returns the first non-null, non-empty string value encountered from multiple rows. Once a valid value is found, subsequent rows are ignored for efficiency. Similar to SQL COALESCE but works across rows in an aggregation.

Behavior:

Returns the first non-null, non-empty value encountered
Stops processing once a valid value is found (efficient for large datasets)
Row processing order is not guaranteed in distributed execution
Best used when any non-empty value is acceptable

Note: In distributed query execution, the order in which rows are processed is non-deterministic. If you need deterministic selection based on specific criteria (like priority or timestamp), use tdw_mdm_prioritized, tdw_mdm_max_by, or tdw_mdm_min_by instead.

Example: Get first available name

-- Input data
+-------------+----------+
| customer_id |   name   |
+-------------+----------+
|           1 | NULL     |
|           1 | ''       |
|           1 | John Doe |
|           1 | Jane Doe |
|           2 | NULL     |
|           2 | NULL     |
+-------------+----------+

-- Query
SELECT customer_id,
       tdw_mdm_coalesce(name) AS primary_name
FROM customer_sources
GROUP BY customer_id;

-- Output
+-------------+--------------+
| customer_id | primary_name |
+-------------+--------------+
|           1 | John Doe     |  -- First non-empty value
|           2 | NULL         |  -- All values null or empty
+-------------+--------------+

tdw_mdm_majority

tdw_mdm_majority(value VARCHAR) → VARCHAR

tdw_mdm_majority(value VARCHAR, weight BIGINT) → VARCHAR

Returns the most frequently occurring (plurality) non-null, non-empty string value. Supports optional weighting to give certain values more influence. In case of a tie, returns the lexicographically smallest value.

Behavior:

Counts occurrences of each non-null, non-empty value
Returns the value with the highest count (or weighted sum)
Ties are broken by lexicographic order (alphabetically smallest wins)
Maximum 1,000 unique values per group (safety limit for memory)
Non-positive weights are ignored

Example 1: Basic usage (unweighted)

-- Input data
+-------------+----------+
| customer_id |   city   |
+-------------+----------+
|           1 | New York |
|           1 | New York |
|           1 | NYC      |
|           2 | Boston   |
|           2 | Boston   |
|           2 | Boston   |
+-------------+----------+

-- Query
SELECT customer_id,
       tdw_mdm_majority(city) AS primary_city
FROM customer_addresses
GROUP BY customer_id;

-- Output
+-------------+--------------+
| customer_id | primary_city |
+-------------+--------------+
|           1 | New York     |  -- Appears 2 times (most frequent)
|           2 | Boston       |  -- Appears 3 times
+-------------+--------------+

Example 2: Tie-breaking (lexicographic order)

-- Input data
+-------------+----------+
| customer_id |  status  |
+-------------+----------+
|           1 | Inactive |
|           1 | Active   |
+-------------+----------+

-- Query
SELECT customer_id,
       tdw_mdm_majority(status) AS primary_status
FROM customer_data
GROUP BY customer_id;

-- Output
+-------------+----------------+
| customer_id | primary_status |
+-------------+----------------+
|           1 | Active         |  -- Tie: both appear once, 'Active' < 'Inactive'
+-------------+----------------+

Example 3: With weights

-- Input data (with source trust scores as weights)
+-------------+----------+-------------+
| customer_id |   city   | trust_score |
+-------------+----------+-------------+
|           1 | Toronto  |          15 |
|           1 | Montreal |           5 |
|           1 | Montreal |           5 |
+-------------+----------+-------------+

-- Query
SELECT customer_id,
       tdw_mdm_majority(city, trust_score) AS primary_city
FROM customer_addresses
GROUP BY customer_id;

-- Output
+-------------+--------------+
| customer_id | primary_city |
+-------------+--------------+
|           1 | Toronto      |  -- Total weight 15 beats Montreal's 10
+-------------+--------------+

tdw_mdm_prioritized

tdw_mdm_prioritized(value V, reference VARCHAR, priorities ARRAY<VARCHAR>) → V

Returns the value with the highest priority based on matching the reference to a priority array. Earlier positions in the array have higher priority (index 0 = highest). The value can be any type, while the reference and priorities must be VARCHAR.

Behavior:

Exact string matching between reference and priority array elements
Earlier positions in the array = higher priority
NULL values and empty strings are skipped
References that don't match any priority are skipped
Ties (same priority) return any matching value (non-deterministic in distributed execution)

Tip: Ideal for selecting values from trusted source systems in a specific order (e.g., CRM over ERP over Legacy).

Example: Select name from highest-priority source

-- Input data
+-------------+--------------+---------+
| customer_id |     name     | source  |
+-------------+--------------+---------+
|           1 | John Smith   | Legacy  |
|           1 | J. Smith     | ERP     |
|           1 | Johnny Smith | CRM     |
|           2 | Alice Jones  | Legacy  |
|           2 | Alice J.     | ERP     |
|           3 | Bob Brown    | Legacy  |
+-------------+--------------+---------+

-- Query
SELECT customer_id,
       tdw_mdm_prioritized(name, source, ARRAY['CRM', 'ERP', 'Legacy']) AS primary_name
FROM customer_data
GROUP BY customer_id;

-- Output
+-------------+--------------+
| customer_id | primary_name |
+-------------+--------------+
|           1 | Johnny Smith |  -- CRM (priority 0) beats ERP and Legacy
|           2 | Alice J.     |  -- ERP (priority 1) beats Legacy
|           3 | Bob Brown    |  -- Only Legacy available
+-------------+--------------+

Example: Numeric values with prioritization

-- Input data
+-------------+--------+--------+
| customer_id | amount | source |
+-------------+--------+--------+
|           1 |  100.0 | API    |
|           1 |  150.0 | DB     |
|           1 |   95.0 | File   |
|           2 |  200.0 | File   |
+-------------+--------+--------+

-- Query
SELECT customer_id,
       tdw_mdm_prioritized(amount, source, ARRAY['DB', 'API', 'File']) AS trusted_amount
FROM transactions
GROUP BY customer_id;

-- Output
+-------------+----------------+
| customer_id | trusted_amount |
+-------------+----------------+
|           1 |          150.0 |  -- DB (priority 0) beats API and File
|           2 |          200.0 |  -- Only File available
+-------------+----------------+

tdw_mdm_regex_prioritized

tdw_mdm_regex_prioritized(value V, reference VARCHAR, patterns ARRAY<VARCHAR>) → V

Returns the value with the highest priority based on matching the reference to a regex pattern array. Earlier positions in the array have higher priority (index 0 = highest). Uses partial matching - patterns can match anywhere in the string.

Behavior:

Uses Java regex Pattern.matcher().find() for partial matching
Use ^ and $ anchors for full-string matching
First matching pattern wins (earlier index = higher priority)
NULL values and empty strings are skipped
Non-matching references are skipped

Example 1: Domain-based email prioritization

-- Input data
+-------------+------------------------+
| customer_id |         email          |
+-------------+------------------------+
|           1 | john@company.com       |
|           1 | john123@gmail.com      |
|           1 | john.smith@yahoo.com   |
|           2 | alice@gmail.com        |
|           2 | alice.j@hotmail.com    |
+-------------+------------------------+

-- Query
SELECT customer_id,
       tdw_mdm_regex_prioritized(email, email,
           ARRAY['@company\.com$', '@gmail\.com$', '@.*']) AS primary_email
FROM customer_contacts
GROUP BY customer_id;

-- Output
+-------------+------------------------+
| customer_id |     primary_email      |
+-------------+------------------------+
|           1 | john@company.com       |  -- Matches priority 0 (company domain)
|           2 | alice@gmail.com        |  -- Matches priority 1 (gmail)
+-------------+------------------------+

Example 2: Source system priority by pattern

-- Input data
+-------------+---------+------------------+
| customer_id |  name   |      source      |
+-------------+---------+------------------+
|           1 | John S. | legacy-system-v1 |
|           1 | John    | crm-prod         |
|           1 | Johnny  | api-integration  |
|           2 | Alice   | unknown-source   |
|           2 | Alice J | crm-prod         |
+-------------+---------+------------------+

-- Query
SELECT customer_id,
       tdw_mdm_regex_prioritized(name, source,
           ARRAY['^crm-.*', '^api-.*', '^legacy-.*']) AS primary_name
FROM customer_data
GROUP BY customer_id;

-- Output
+-------------+--------------+
| customer_id | primary_name |
+-------------+--------------+
|           1 | John         |  -- crm-prod matches ^crm-.*
|           2 | Alice J      |  -- crm-prod matches ^crm-.*
+-------------+--------------+
-- Note: 'unknown-source' doesn't match any pattern, so it's skipped

Window Functions

Window functions compute a value for each row based on all rows in its partition. Unlike aggregation functions, they return a value for every input row rather than collapsing rows into groups.

tdw_rank_by_frequency

tdw_rank_by_frequency(value VARCHAR) OVER (PARTITION BY ...) → BIGINT

tdw_rank_by_frequency(value VARCHAR, weight BIGINT) OVER (PARTITION BY ...) → BIGINT

Returns a rank based on frequency of occurrence within the partition. More frequent values receive better (lower) ranks. Uses RANK semantics with gaps - tied values get the same rank, and the next distinct frequency gets a rank with a gap.

Behavior:

Most frequent values get rank 1
Ties in frequency get the same rank
Next distinct frequency gets rank = (count of values with higher frequency) + 1
NULL values get NULL rank (not counted in frequency)
Empty strings are skipped (not counted in frequency)
Maximum 1,000 unique values per partition (safety limit)
Weighted variant: sums weights instead of counting occurrences

Example: Rank cities by frequency

-- Input data
+-------------+----------+
| customer_id |   city   |
+-------------+----------+
|           1 | New York |
|           1 | New York |
|           1 | New York |
|           1 | Boston   |
|           1 | Boston   |
|           1 | Chicago  |
+-------------+----------+

-- Query
SELECT
    customer_id,
    city,
    tdw_rank_by_frequency(city) OVER (PARTITION BY customer_id) AS city_rank
FROM customer_addresses;

-- Output
+-------------+----------+-----------+
| customer_id |   city   | city_rank |
+-------------+----------+-----------+
|           1 | New York |         1 |  -- 3 occurrences (most frequent)
|           1 | New York |         1 |
|           1 | New York |         1 |
|           1 | Boston   |         2 |  -- 2 occurrences
|           1 | Boston   |         2 |
|           1 | Chicago  |         3 |  -- 1 occurrence
+-------------+----------+-----------+

Example: Weighted frequency (by trust score)

-- Query with trust_score weighting
SELECT
    customer_id,
    city,
    source,
    trust_score,
    tdw_rank_by_frequency(city, trust_score)
        OVER (PARTITION BY customer_id) AS city_rank
FROM customer_addresses;

-- A city appearing once from CRM (trust_score=100) will rank higher than
-- a city appearing twice from Legacy (trust_score=10 each, total=20)

tdw_rank_by_priority

tdw_rank_by_priority(reference VARCHAR, priorities ARRAY<VARCHAR>) OVER (...) → BIGINT

Returns a rank based on exact string matching against a priority array. Earlier positions in the array receive better (lower) ranks. Uses index-based ranking where rank equals array position + 1.

Behavior:

Matching values get rank = array_index + 1 (1-based)
Multiple rows matching the same priority get the same rank
Non-matching values get rank = NULL
NULL reference value returns NULL rank
NULL values in the priorities array are skipped
Case-sensitive comparison

Example: Rank by source system priority

-- Input data
+-------------+---------+---------+
| customer_id | source  |  name   |
+-------------+---------+---------+
|           1 | CRM     | John    |
|           1 | CRM     | Jane    |
|           1 | ERP     | J. Doe  |
|           1 | Legacy  | Johnny  |
|           1 | Unknown | Jon     |
+-------------+---------+---------+

-- Query
SELECT
    customer_id,
    source,
    name,
    tdw_rank_by_priority(source, ARRAY['CRM', 'ERP', 'Legacy'])
        OVER (PARTITION BY customer_id) AS source_rank
FROM customer_data;

-- Output
+-------------+---------+--------+-------------+
| customer_id | source  |  name  | source_rank |
+-------------+---------+--------+-------------+
|           1 | CRM     | John   |           1 |
|           1 | CRM     | Jane   |           1 |
|           1 | ERP     | J. Doe |           2 |
|           1 | Legacy  | Johnny |           3 |
|           1 | Unknown | Jon    |        NULL |  -- Not in priority array
+-------------+---------+--------+-------------+

Tip: Combine with ORDER BY source_rank ASC NULLS LAST to sort by priority, or filter with WHERE source_rank = 1 to get only the highest-priority records.

tdw_rank_by_regex_priority

tdw_rank_by_regex_priority(reference VARCHAR, patterns ARRAY<VARCHAR>) OVER (...) → BIGINT

Returns a rank based on regex pattern matching against a priority array. Earlier positions in the array receive better (lower) ranks. Uses partial matching - patterns can match anywhere in the string.

Behavior:

Matching values get rank = array_index + 1 (1-based)
First matching pattern wins (earlier index = better rank)
Partial matching: pattern can match substring (use ^ and $ for full match)
Non-matching values get rank = NULL
Invalid regex pattern throws an error with descriptive message

Example: Rank emails by domain pattern

-- Query
SELECT
    customer_id,
    email,
    tdw_rank_by_regex_priority(email, ARRAY[
        '@company\.com$',   -- Rank 1: Company emails
        '@partner\.org$',   -- Rank 2: Partner emails
        '@.*\.edu$',        -- Rank 3: Educational
        '@gmail\.com$'      -- Rank 4: Gmail
    ]) OVER (PARTITION BY customer_id) AS email_rank
FROM customer_data;

-- Output
+-------------+------------------------+------------+
| customer_id |         email          | email_rank |
+-------------+------------------------+------------+
|           1 | john@company.com       |          1 |
|           1 | john@partner.org       |          2 |
|           1 | john@university.edu    |          3 |
|           1 | john@gmail.com         |          4 |
|           1 | john@unknown.com       |       NULL |  -- No pattern match
+-------------+------------------------+------------+

Function Selection Guide

Scenario	Recommended Function	Why
Most complete/detailed value	`tdw_mdm_longest`	Longer strings often contain more detail
Concise display name	`tdw_mdm_shortest`	Shorter strings work better for display
Most recent value	`tdw_mdm_max_by(value, timestamp)`	Select by maximum timestamp
Original/earliest value	`tdw_mdm_min_by(value, timestamp)`	Select by minimum timestamp
Most trusted source	`tdw_mdm_max_by(value, trust_score)`	Select by highest trust/confidence score
Any available value	`tdw_mdm_coalesce`	Fast, stops at first valid value
Most common value	`tdw_mdm_majority`	Democratic selection by frequency
Weighted voting	`tdw_mdm_majority(value, weight)`	Source trust affects vote weight
Specific source priority	`tdw_mdm_prioritized`	Exact match against priority list
Pattern-based priority	`tdw_mdm_regex_prioritized`	Flexible regex matching
Rank all rows by frequency	`tdw_rank_by_frequency`	Window function, returns rank per row
Rank all rows by source priority	`tdw_rank_by_priority`	Window function with priority array

Common Patterns

These patterns are typically used when creating Data Products to consolidate and deduplicate records from multiple sources.

Building a Golden Record

-- Combine multiple MDM functions to build a golden record
SELECT
    customer_id,
    -- Most recent name from CRM system
    tdw_mdm_prioritized(name, source, ARRAY['CRM', 'ERP', 'Legacy']) AS golden_name,
    -- Most recent email
    tdw_mdm_max_by(email, last_updated) AS golden_email,
    -- Most common city (weighted by source trust)
    tdw_mdm_majority(city, trust_score) AS golden_city,
    -- Most complete address (longest)
    tdw_mdm_longest(address) AS golden_address
FROM customer_sources
GROUP BY customer_id;

Filtering to Best Records

-- Use window functions to filter to highest-priority records
WITH ranked AS (
    SELECT
        *,
        tdw_rank_by_priority(source, ARRAY['CRM', 'ERP', 'Legacy'])
            OVER (PARTITION BY customer_id) AS source_rank
    FROM customer_data
)
SELECT *
FROM ranked
WHERE source_rank = 1;

Handling Multiple Attributes

-- Select different fields based on different criteria
SELECT
    customer_id,
    -- Name from most trusted source
    tdw_mdm_max_by(name, trust_score) AS name,
    -- Email that's most recently verified
    tdw_mdm_max_by(email, email_verified_date) AS email,
    -- Phone from most complete record
    tdw_mdm_max_by(phone, CAST(LENGTH(phone) AS BIGINT)) AS phone
FROM customer_sources
GROUP BY customer_id;

Multi-Criteria Ranking with Window Functions

Use ranking window functions to score records, then use Trino's min_by to select the best value for each attribute based on its rank. This allows each field to be selected independently using its own quality criteria.

-- Rank records by multiple criteria, then pick best value per attribute
WITH ranked AS (
    SELECT
        *,
        -- Rank by source system priority (for name selection)
        tdw_rank_by_priority(source, ARRAY['CRM', 'ERP', 'Legacy'])
            OVER (PARTITION BY customer_id) AS source_rank,
        -- Rank by data frequency (for city selection)
        tdw_rank_by_frequency(city)
            OVER (PARTITION BY customer_id) AS city_frequency_rank,
        -- Rank by email domain priority (for email selection)
        tdw_rank_by_regex_priority(email, ARRAY['@company\.com$', '@gmail\.com$', '.*'])
            OVER (PARTITION BY customer_id) AS email_rank
    FROM customer_data
)
SELECT
    customer_id,
    -- Best name: from highest-priority source
    min_by(name, source_rank) AS best_name,
    -- Best email: from highest-priority domain
    min_by(email, email_rank) AS best_email,
    -- Best city: most frequently occurring
    min_by(city, city_frequency_rank) AS best_city
FROM ranked
GROUP BY customer_id;

How it works:

ranked CTE: Apply ranking window functions to score each record by different quality dimensions
min_by aggregation: For each attribute, select the value with the lowest (best) rank

Key advantage: Each attribute can come from a different source record. The best name might come from CRM while the best email comes from a different record with a company domain.

Example with sample data:

-- Input data
+-------------+---------+----------------------+----------+--------+
| customer_id | source  |        email         |   city   |  name  |
+-------------+---------+----------------------+----------+--------+
|           1 | CRM     | john@yahoo.com       | Boston   | John S |
|           1 | ERP     | john@company.com     | New York | J Smith|
|           1 | Legacy  | john@gmail.com       | New York | Johnny |
+-------------+---------+----------------------+----------+--------+

-- After ranking and aggregation:
-- best_name: "John S" (from CRM, source_rank=1)
-- best_email: "john@company.com" (from ERP, email_rank=1)
-- best_city: "New York" (appears twice, city_frequency_rank=1)

-- Output (golden record with best of each attribute)
+-------------+-----------+----------------------+-----------+
| customer_id | best_name |      best_email      | best_city |
+-------------+-----------+----------------------+-----------+
|           1 | John S    | john@company.com     | New York  |
+-------------+-----------+----------------------+-----------+

Search

How can we help you?

Mastering Functions

Master Data Management (MDM) Functions

Table of Contents

Aggregation Functions

tdw_mdm_longest

tdw_mdm_shortest

tdw_mdm_max_by

tdw_mdm_min_by

tdw_mdm_coalesce

tdw_mdm_majority

tdw_mdm_prioritized

tdw_mdm_regex_prioritized

Window Functions

tdw_rank_by_frequency

tdw_rank_by_priority

tdw_rank_by_regex_priority

Function Selection Guide

Common Patterns

Building a Golden Record

Filtering to Best Records

Handling Multiple Attributes

Multi-Criteria Ranking with Window Functions