Master Data Management (MDM) Functions
Master Data Management (MDM) functions help consolidate and standardize data when you have multiple records representing the same entity. These functions are designed for scenarios where you need to select the "best" or most representative value from a group of duplicate or related records.
Common use cases:
- Deduplicating customer records from multiple source systems
- Selecting the most complete or accurate value when merging records
- Building golden records from fragmented data
- Ranking data quality across multiple sources
Table of Contents
-
Aggregation Functions
- tdw_mdm_longest - Select longest string
- tdw_mdm_shortest - Select shortest string
- tdw_mdm_max_by - Select by maximum attribute
- tdw_mdm_min_by - Select by minimum attribute
- tdw_mdm_coalesce - Select first non-empty value
- tdw_mdm_majority - Select most frequent value
- tdw_mdm_prioritized - Select by priority list
- tdw_mdm_regex_prioritized - Select by regex priority
-
Window Functions
- tdw_rank_by_frequency - Rank by occurrence count
- tdw_rank_by_priority - Rank by priority list
- tdw_rank_by_regex_priority - Rank by regex priority
- Function Selection Guide
- Common Patterns
Aggregation Functions
Aggregation functions process multiple rows within a GROUP BY and return a single consolidated result per group.
tdw_mdm_longest
tdw_mdm_longest(value VARCHAR) → VARCHAR
Returns the longest non-null string value from multiple rows. If all values are null, returns null. Useful for selecting the most detailed or complete value.
Behavior:
- Compares string lengths and returns the value with the maximum length
- Empty strings are considered valid values with length 0
- NULL values are skipped
- If multiple values have the same maximum length, one is returned (not guaranteed which)
Example: Select the most detailed address
-- Input data
+-------------+---------------------------+
| customer_id | address |
+-------------+---------------------------+
| 1 | 123 Main St |
| 1 | 123 Main Street, Apt 4B |
| 1 | NULL |
| 2 | '' |
| 2 | PO Box 123 |
| 3 | NULL |
+-------------+---------------------------+
-- Query
SELECT customer_id,
tdw_mdm_longest(address) AS primary_address
FROM customer_sources
GROUP BY customer_id;
-- Output
+-------------+---------------------------+
| customer_id | primary_address |
+-------------+---------------------------+
| 1 | 123 Main Street, Apt 4B | -- 24 chars (longest)
| 2 | PO Box 123 | -- 10 chars (beats empty string)
| 3 | NULL | -- All values null
+-------------+---------------------------+
tdw_mdm_shortest
tdw_mdm_shortest(value VARCHAR) → VARCHAR
Returns the shortest non-null, non-empty string value from multiple rows. If all values are null or empty, returns null. Useful for selecting concise display names or abbreviations.
Behavior:
- Compares string lengths and returns the value with the minimum length
- NULL values and empty strings are skipped
- If multiple values have the same minimum length, one is returned (not guaranteed which)
Example: Select concise display name
-- Input data
+-------------+-------------------+
| customer_id | name |
+-------------+-------------------+
| 1 | John Smith |
| 1 | J. Smith |
| 1 | NULL |
| 1 | '' |
| 2 | '' |
| 2 | NULL |
| 3 | Alice |
| 3 | Alice B. Johnson |
+-------------+-------------------+
-- Query
SELECT customer_id,
tdw_mdm_shortest(name) AS display_name
FROM customer_sources
GROUP BY customer_id;
-- Output
+-------------+--------------+
| customer_id | display_name |
+-------------+--------------+
| 1 | J. Smith | -- 8 chars (shortest non-empty)
| 2 | NULL | -- All values null or empty
| 3 | Alice | -- 5 chars (shortest)
+-------------+--------------+
tdw_mdm_max_by
tdw_mdm_max_by(value V, attribute K) → V
Returns the value associated with the maximum attribute. Automatically skips rows where the value is NULL, ensuring you get the best available non-null data. Supports any comparable attribute type (BIGINT, DOUBLE, TIMESTAMP, VARCHAR, etc.) and any value type.
Behavior:
- Rows where the value is NULL are automatically skipped
- Returns the value associated with the maximum (latest/largest) attribute
- If all values are null, returns null
- Attribute type must be comparable (numeric, timestamp, string, etc.)
Tip: Ideal for selecting the most recent value (by timestamp), or the value from the most trusted source (by trust score).
Example 1: Select most recent name by timestamp
-- Input data
+-------------+-------------+--------------+
| customer_id | name | last_updated |
+-------------+-------------+--------------+
| 1 | John Smith | 2023-01-15 |
| 1 | J. Smith | 2024-03-20 |
| 1 | NULL | 2024-12-01 | -- Skipped (null value)
| 2 | Alice | 2023-06-10 |
| 2 | Alice Jones | 2024-01-15 |
+-------------+-------------+--------------+
-- Query
SELECT customer_id,
tdw_mdm_max_by(name, last_updated) AS current_name
FROM customer_history
GROUP BY customer_id;
-- Output
+-------------+--------------+
| customer_id | current_name |
+-------------+--------------+
| 1 | J. Smith | -- Latest non-null (2024-03-20)
| 2 | Alice Jones | -- Latest (2024-01-15)
+-------------+--------------+
Example 2: Select by highest confidence score
-- Input data
+------------+-------+------------------+
| product_id | price | confidence_score |
+------------+-------+------------------+
| 1 | 99.99 | 0.95 |
| 1 | 89.99 | 0.80 |
| 1 | NULL | 0.99 | -- Skipped (null value)
| 2 | 49.99 | 0.75 |
+------------+-------+------------------+
-- Query
SELECT product_id,
tdw_mdm_max_by(price, confidence_score) AS best_price
FROM product_sources
GROUP BY product_id;
-- Output
+------------+------------+
| product_id | best_price |
+------------+------------+
| 1 | 99.99 | -- Confidence 0.95 (highest with non-null value)
| 2 | 49.99 | -- Only non-null value
+------------+------------+
tdw_mdm_min_by
tdw_mdm_min_by(value V, attribute K) → V
Returns the value associated with the minimum attribute. Automatically skips rows where the value is NULL. Supports any comparable attribute type (BIGINT, DOUBLE, TIMESTAMP, VARCHAR, etc.) and any value type.
Behavior:
- Rows where the value is NULL are automatically skipped
- Returns the value associated with the minimum (earliest/smallest) attribute
- If all values are null, returns null
- Attribute type must be comparable (numeric, timestamp, string, etc.)
Tip: Ideal for selecting the original/earliest value (by registration date), or selecting by lowest priority score.
Example: Select original name from earliest record
-- Input data
+-------------+-------------+-------------------+
| customer_id | name | registration_date |
+-------------+-------------+-------------------+
| 1 | John Smith | 2023-01-15 |
| 1 | J. Smith | 2024-03-20 |
| 1 | NULL | 2022-12-01 | -- Skipped (null value)
| 2 | Alice Jones | 2023-05-01 |
| 2 | Alice | 2023-06-10 |
+-------------+-------------+-------------------+
-- Query
SELECT customer_id,
tdw_mdm_min_by(name, registration_date) AS original_name
FROM customer_history
GROUP BY customer_id;
-- Output
+-------------+---------------+
| customer_id | original_name |
+-------------+---------------+
| 1 | John Smith | -- Earliest non-null (2023-01-15)
| 2 | Alice Jones | -- Earliest (2023-05-01)
+-------------+---------------+
tdw_mdm_coalesce
tdw_mdm_coalesce(value VARCHAR) → VARCHAR
Returns the first non-null, non-empty string value encountered from multiple rows. Once a valid value is found, subsequent rows are ignored for efficiency. Similar to SQL COALESCE but works across rows in an aggregation.
Behavior:
- Returns the first non-null, non-empty value encountered
- Stops processing once a valid value is found (efficient for large datasets)
- Row processing order is not guaranteed in distributed execution
- Best used when any non-empty value is acceptable
Note: In distributed query execution, the order in which rows are processed is non-deterministic. If you need deterministic selection based on specific criteria (like priority or timestamp), use tdw_mdm_prioritized, tdw_mdm_max_by, or tdw_mdm_min_by instead.
Example: Get first available name
-- Input data
+-------------+----------+
| customer_id | name |
+-------------+----------+
| 1 | NULL |
| 1 | '' |
| 1 | John Doe |
| 1 | Jane Doe |
| 2 | NULL |
| 2 | NULL |
+-------------+----------+
-- Query
SELECT customer_id,
tdw_mdm_coalesce(name) AS primary_name
FROM customer_sources
GROUP BY customer_id;
-- Output
+-------------+--------------+
| customer_id | primary_name |
+-------------+--------------+
| 1 | John Doe | -- First non-empty value
| 2 | NULL | -- All values null or empty
+-------------+--------------+
tdw_mdm_majority
tdw_mdm_majority(value VARCHAR) → VARCHAR
tdw_mdm_majority(value VARCHAR, weight BIGINT) → VARCHAR
Returns the most frequently occurring (plurality) non-null, non-empty string value. Supports optional weighting to give certain values more influence. In case of a tie, returns the lexicographically smallest value.
Behavior:
- Counts occurrences of each non-null, non-empty value
- Returns the value with the highest count (or weighted sum)
- Ties are broken by lexicographic order (alphabetically smallest wins)
- Maximum 1,000 unique values per group (safety limit for memory)
- Non-positive weights are ignored
Example 1: Basic usage (unweighted)
-- Input data
+-------------+----------+
| customer_id | city |
+-------------+----------+
| 1 | New York |
| 1 | New York |
| 1 | NYC |
| 2 | Boston |
| 2 | Boston |
| 2 | Boston |
+-------------+----------+
-- Query
SELECT customer_id,
tdw_mdm_majority(city) AS primary_city
FROM customer_addresses
GROUP BY customer_id;
-- Output
+-------------+--------------+
| customer_id | primary_city |
+-------------+--------------+
| 1 | New York | -- Appears 2 times (most frequent)
| 2 | Boston | -- Appears 3 times
+-------------+--------------+
Example 2: Tie-breaking (lexicographic order)
-- Input data
+-------------+----------+
| customer_id | status |
+-------------+----------+
| 1 | Inactive |
| 1 | Active |
+-------------+----------+
-- Query
SELECT customer_id,
tdw_mdm_majority(status) AS primary_status
FROM customer_data
GROUP BY customer_id;
-- Output
+-------------+----------------+
| customer_id | primary_status |
+-------------+----------------+
| 1 | Active | -- Tie: both appear once, 'Active' < 'Inactive'
+-------------+----------------+
Example 3: With weights
-- Input data (with source trust scores as weights)
+-------------+----------+-------------+
| customer_id | city | trust_score |
+-------------+----------+-------------+
| 1 | Toronto | 15 |
| 1 | Montreal | 5 |
| 1 | Montreal | 5 |
+-------------+----------+-------------+
-- Query
SELECT customer_id,
tdw_mdm_majority(city, trust_score) AS primary_city
FROM customer_addresses
GROUP BY customer_id;
-- Output
+-------------+--------------+
| customer_id | primary_city |
+-------------+--------------+
| 1 | Toronto | -- Total weight 15 beats Montreal's 10
+-------------+--------------+
tdw_mdm_prioritized
tdw_mdm_prioritized(value V, reference VARCHAR, priorities ARRAY<VARCHAR>) → V
Returns the value with the highest priority based on matching the reference to a priority array. Earlier positions in the array have higher priority (index 0 = highest). The value can be any type, while the reference and priorities must be VARCHAR.
Behavior:
- Exact string matching between reference and priority array elements
- Earlier positions in the array = higher priority
- NULL values and empty strings are skipped
- References that don't match any priority are skipped
- Ties (same priority) return any matching value (non-deterministic in distributed execution)
Tip: Ideal for selecting values from trusted source systems in a specific order (e.g., CRM over ERP over Legacy).
Example: Select name from highest-priority source
-- Input data
+-------------+--------------+---------+
| customer_id | name | source |
+-------------+--------------+---------+
| 1 | John Smith | Legacy |
| 1 | J. Smith | ERP |
| 1 | Johnny Smith | CRM |
| 2 | Alice Jones | Legacy |
| 2 | Alice J. | ERP |
| 3 | Bob Brown | Legacy |
+-------------+--------------+---------+
-- Query
SELECT customer_id,
tdw_mdm_prioritized(name, source, ARRAY['CRM', 'ERP', 'Legacy']) AS primary_name
FROM customer_data
GROUP BY customer_id;
-- Output
+-------------+--------------+
| customer_id | primary_name |
+-------------+--------------+
| 1 | Johnny Smith | -- CRM (priority 0) beats ERP and Legacy
| 2 | Alice J. | -- ERP (priority 1) beats Legacy
| 3 | Bob Brown | -- Only Legacy available
+-------------+--------------+
Example: Numeric values with prioritization
-- Input data
+-------------+--------+--------+
| customer_id | amount | source |
+-------------+--------+--------+
| 1 | 100.0 | API |
| 1 | 150.0 | DB |
| 1 | 95.0 | File |
| 2 | 200.0 | File |
+-------------+--------+--------+
-- Query
SELECT customer_id,
tdw_mdm_prioritized(amount, source, ARRAY['DB', 'API', 'File']) AS trusted_amount
FROM transactions
GROUP BY customer_id;
-- Output
+-------------+----------------+
| customer_id | trusted_amount |
+-------------+----------------+
| 1 | 150.0 | -- DB (priority 0) beats API and File
| 2 | 200.0 | -- Only File available
+-------------+----------------+
tdw_mdm_regex_prioritized
tdw_mdm_regex_prioritized(value V, reference VARCHAR, patterns ARRAY<VARCHAR>) → V
Returns the value with the highest priority based on matching the reference to a regex pattern array. Earlier positions in the array have higher priority (index 0 = highest). Uses partial matching - patterns can match anywhere in the string.
Behavior:
- Uses Java regex
Pattern.matcher().find()for partial matching - Use
^and$anchors for full-string matching - First matching pattern wins (earlier index = higher priority)
- NULL values and empty strings are skipped
- Non-matching references are skipped
Example 1: Domain-based email prioritization
-- Input data
+-------------+------------------------+
| customer_id | email |
+-------------+------------------------+
| 1 | john@company.com |
| 1 | john123@gmail.com |
| 1 | john.smith@yahoo.com |
| 2 | alice@gmail.com |
| 2 | alice.j@hotmail.com |
+-------------+------------------------+
-- Query
SELECT customer_id,
tdw_mdm_regex_prioritized(email, email,
ARRAY['@company\.com$', '@gmail\.com$', '@.*']) AS primary_email
FROM customer_contacts
GROUP BY customer_id;
-- Output
+-------------+------------------------+
| customer_id | primary_email |
+-------------+------------------------+
| 1 | john@company.com | -- Matches priority 0 (company domain)
| 2 | alice@gmail.com | -- Matches priority 1 (gmail)
+-------------+------------------------+
Example 2: Source system priority by pattern
-- Input data
+-------------+---------+------------------+
| customer_id | name | source |
+-------------+---------+------------------+
| 1 | John S. | legacy-system-v1 |
| 1 | John | crm-prod |
| 1 | Johnny | api-integration |
| 2 | Alice | unknown-source |
| 2 | Alice J | crm-prod |
+-------------+---------+------------------+
-- Query
SELECT customer_id,
tdw_mdm_regex_prioritized(name, source,
ARRAY['^crm-.*', '^api-.*', '^legacy-.*']) AS primary_name
FROM customer_data
GROUP BY customer_id;
-- Output
+-------------+--------------+
| customer_id | primary_name |
+-------------+--------------+
| 1 | John | -- crm-prod matches ^crm-.*
| 2 | Alice J | -- crm-prod matches ^crm-.*
+-------------+--------------+
-- Note: 'unknown-source' doesn't match any pattern, so it's skipped
Window Functions
Window functions compute a value for each row based on all rows in its partition. Unlike aggregation functions, they return a value for every input row rather than collapsing rows into groups.
tdw_rank_by_frequency
tdw_rank_by_frequency(value VARCHAR) OVER (PARTITION BY ...) → BIGINT
tdw_rank_by_frequency(value VARCHAR, weight BIGINT) OVER (PARTITION BY ...) → BIGINT
Returns a rank based on frequency of occurrence within the partition. More frequent values receive better (lower) ranks. Uses RANK semantics with gaps - tied values get the same rank, and the next distinct frequency gets a rank with a gap.
Behavior:
- Most frequent values get rank 1
- Ties in frequency get the same rank
- Next distinct frequency gets rank = (count of values with higher frequency) + 1
- NULL values get NULL rank (not counted in frequency)
- Empty strings are skipped (not counted in frequency)
- Maximum 1,000 unique values per partition (safety limit)
- Weighted variant: sums weights instead of counting occurrences
Example: Rank cities by frequency
-- Input data
+-------------+----------+
| customer_id | city |
+-------------+----------+
| 1 | New York |
| 1 | New York |
| 1 | New York |
| 1 | Boston |
| 1 | Boston |
| 1 | Chicago |
+-------------+----------+
-- Query
SELECT
customer_id,
city,
tdw_rank_by_frequency(city) OVER (PARTITION BY customer_id) AS city_rank
FROM customer_addresses;
-- Output
+-------------+----------+-----------+
| customer_id | city | city_rank |
+-------------+----------+-----------+
| 1 | New York | 1 | -- 3 occurrences (most frequent)
| 1 | New York | 1 |
| 1 | New York | 1 |
| 1 | Boston | 2 | -- 2 occurrences
| 1 | Boston | 2 |
| 1 | Chicago | 3 | -- 1 occurrence
+-------------+----------+-----------+
Example: Weighted frequency (by trust score)
-- Query with trust_score weighting
SELECT
customer_id,
city,
source,
trust_score,
tdw_rank_by_frequency(city, trust_score)
OVER (PARTITION BY customer_id) AS city_rank
FROM customer_addresses;
-- A city appearing once from CRM (trust_score=100) will rank higher than
-- a city appearing twice from Legacy (trust_score=10 each, total=20)
tdw_rank_by_priority
tdw_rank_by_priority(reference VARCHAR, priorities ARRAY<VARCHAR>) OVER (...) → BIGINT
Returns a rank based on exact string matching against a priority array. Earlier positions in the array receive better (lower) ranks. Uses index-based ranking where rank equals array position + 1.
Behavior:
- Matching values get rank = array_index + 1 (1-based)
- Multiple rows matching the same priority get the same rank
- Non-matching values get rank = NULL
- NULL reference value returns NULL rank
- NULL values in the priorities array are skipped
- Case-sensitive comparison
Example: Rank by source system priority
-- Input data
+-------------+---------+---------+
| customer_id | source | name |
+-------------+---------+---------+
| 1 | CRM | John |
| 1 | CRM | Jane |
| 1 | ERP | J. Doe |
| 1 | Legacy | Johnny |
| 1 | Unknown | Jon |
+-------------+---------+---------+
-- Query
SELECT
customer_id,
source,
name,
tdw_rank_by_priority(source, ARRAY['CRM', 'ERP', 'Legacy'])
OVER (PARTITION BY customer_id) AS source_rank
FROM customer_data;
-- Output
+-------------+---------+--------+-------------+
| customer_id | source | name | source_rank |
+-------------+---------+--------+-------------+
| 1 | CRM | John | 1 |
| 1 | CRM | Jane | 1 |
| 1 | ERP | J. Doe | 2 |
| 1 | Legacy | Johnny | 3 |
| 1 | Unknown | Jon | NULL | -- Not in priority array
+-------------+---------+--------+-------------+
Tip: Combine with ORDER BY source_rank ASC NULLS LAST to sort by priority, or filter with WHERE source_rank = 1 to get only the highest-priority records.
tdw_rank_by_regex_priority
tdw_rank_by_regex_priority(reference VARCHAR, patterns ARRAY<VARCHAR>) OVER (...) → BIGINT
Returns a rank based on regex pattern matching against a priority array. Earlier positions in the array receive better (lower) ranks. Uses partial matching - patterns can match anywhere in the string.
Behavior:
- Matching values get rank = array_index + 1 (1-based)
- First matching pattern wins (earlier index = better rank)
- Partial matching: pattern can match substring (use
^and$for full match) - Non-matching values get rank = NULL
- Invalid regex pattern throws an error with descriptive message
Example: Rank emails by domain pattern
-- Query
SELECT
customer_id,
email,
tdw_rank_by_regex_priority(email, ARRAY[
'@company\.com$', -- Rank 1: Company emails
'@partner\.org$', -- Rank 2: Partner emails
'@.*\.edu$', -- Rank 3: Educational
'@gmail\.com$' -- Rank 4: Gmail
]) OVER (PARTITION BY customer_id) AS email_rank
FROM customer_data;
-- Output
+-------------+------------------------+------------+
| customer_id | email | email_rank |
+-------------+------------------------+------------+
| 1 | john@company.com | 1 |
| 1 | john@partner.org | 2 |
| 1 | john@university.edu | 3 |
| 1 | john@gmail.com | 4 |
| 1 | john@unknown.com | NULL | -- No pattern match
+-------------+------------------------+------------+
Function Selection Guide
| Scenario | Recommended Function | Why |
|---|---|---|
| Most complete/detailed value | tdw_mdm_longest |
Longer strings often contain more detail |
| Concise display name | tdw_mdm_shortest |
Shorter strings work better for display |
| Most recent value | tdw_mdm_max_by(value, timestamp) |
Select by maximum timestamp |
| Original/earliest value | tdw_mdm_min_by(value, timestamp) |
Select by minimum timestamp |
| Most trusted source | tdw_mdm_max_by(value, trust_score) |
Select by highest trust/confidence score |
| Any available value | tdw_mdm_coalesce |
Fast, stops at first valid value |
| Most common value | tdw_mdm_majority |
Democratic selection by frequency |
| Weighted voting | tdw_mdm_majority(value, weight) |
Source trust affects vote weight |
| Specific source priority | tdw_mdm_prioritized |
Exact match against priority list |
| Pattern-based priority | tdw_mdm_regex_prioritized |
Flexible regex matching |
| Rank all rows by frequency | tdw_rank_by_frequency |
Window function, returns rank per row |
| Rank all rows by source priority | tdw_rank_by_priority |
Window function with priority array |
Common Patterns
These patterns are typically used when creating Data Products to consolidate and deduplicate records from multiple sources.
Building a Golden Record
-- Combine multiple MDM functions to build a golden record
SELECT
customer_id,
-- Most recent name from CRM system
tdw_mdm_prioritized(name, source, ARRAY['CRM', 'ERP', 'Legacy']) AS golden_name,
-- Most recent email
tdw_mdm_max_by(email, last_updated) AS golden_email,
-- Most common city (weighted by source trust)
tdw_mdm_majority(city, trust_score) AS golden_city,
-- Most complete address (longest)
tdw_mdm_longest(address) AS golden_address
FROM customer_sources
GROUP BY customer_id;
Filtering to Best Records
-- Use window functions to filter to highest-priority records
WITH ranked AS (
SELECT
*,
tdw_rank_by_priority(source, ARRAY['CRM', 'ERP', 'Legacy'])
OVER (PARTITION BY customer_id) AS source_rank
FROM customer_data
)
SELECT *
FROM ranked
WHERE source_rank = 1;
Handling Multiple Attributes
-- Select different fields based on different criteria
SELECT
customer_id,
-- Name from most trusted source
tdw_mdm_max_by(name, trust_score) AS name,
-- Email that's most recently verified
tdw_mdm_max_by(email, email_verified_date) AS email,
-- Phone from most complete record
tdw_mdm_max_by(phone, CAST(LENGTH(phone) AS BIGINT)) AS phone
FROM customer_sources
GROUP BY customer_id;
Multi-Criteria Ranking with Window Functions
Use ranking window functions to score records, then use Trino's min_by to select the best value for each attribute based on its rank. This allows each field to be selected independently using its own quality criteria.
-- Rank records by multiple criteria, then pick best value per attribute
WITH ranked AS (
SELECT
*,
-- Rank by source system priority (for name selection)
tdw_rank_by_priority(source, ARRAY['CRM', 'ERP', 'Legacy'])
OVER (PARTITION BY customer_id) AS source_rank,
-- Rank by data frequency (for city selection)
tdw_rank_by_frequency(city)
OVER (PARTITION BY customer_id) AS city_frequency_rank,
-- Rank by email domain priority (for email selection)
tdw_rank_by_regex_priority(email, ARRAY['@company\.com$', '@gmail\.com$', '.*'])
OVER (PARTITION BY customer_id) AS email_rank
FROM customer_data
)
SELECT
customer_id,
-- Best name: from highest-priority source
min_by(name, source_rank) AS best_name,
-- Best email: from highest-priority domain
min_by(email, email_rank) AS best_email,
-- Best city: most frequently occurring
min_by(city, city_frequency_rank) AS best_city
FROM ranked
GROUP BY customer_id;
How it works:
- ranked CTE: Apply ranking window functions to score each record by different quality dimensions
- min_by aggregation: For each attribute, select the value with the lowest (best) rank
Key advantage: Each attribute can come from a different source record. The best name might come from CRM while the best email comes from a different record with a company domain.
Example with sample data:
-- Input data +-------------+---------+----------------------+----------+--------+ | customer_id | source | email | city | name | +-------------+---------+----------------------+----------+--------+ | 1 | CRM | john@yahoo.com | Boston | John S | | 1 | ERP | john@company.com | New York | J Smith| | 1 | Legacy | john@gmail.com | New York | Johnny | +-------------+---------+----------------------+----------+--------+ -- After ranking and aggregation: -- best_name: "John S" (from CRM, source_rank=1) -- best_email: "john@company.com" (from ERP, email_rank=1) -- best_city: "New York" (appears twice, city_frequency_rank=1) -- Output (golden record with best of each attribute) +-------------+-----------+----------------------+-----------+ | customer_id | best_name | best_email | best_city | +-------------+-----------+----------------------+-----------+ | 1 | John S | john@company.com | New York | +-------------+-----------+----------------------+-----------+