Data Matching
Matching automates the process of identifying records that represent the same real-world entity (aka duplicate records) and merging the identified records into a single consolidated record. To implement matching you first create a matching specification. Then you can:
- Execute it and see the result by right clicking on the specification name;
- Or create the MatchOneSource operation in a scenario that executes the specification.
Matching Specification
Matching specifications are located in the Matching subsection of the Specifications section in the navigation tree. To create a matching specification:
- Right click on the Matching subsection of the Specifications section of the navigation tree and select Create match specification.
- In the pop-up window enter a match specification name and select a data source which contains data you want to match.
A matching specification consists of:
- Match table - name of the input table provided by the user that contains duplicate records to match.
- Consolidated table - name of the result table created by the system that contains consolidated records compiled from matched records.
- One or more match runs. Each match run defines rules to identify duplicates.
- Configuration for manual editing - name of tables that will be created by the system to store the results of manual editing.
- Consolidation rules - Rules to compile a consolidated (aka merged) record from a set of duplicate records.
Match table must have the following columns:
- data_id nvarchar(900) - must be a primary key, a unique identifier that cannot be null.
- set_id nvarchar(900) - must be the same as data_id. The result of matching will be stored in this column. Records with the same set_id are duplicates.
- prev_set_id nvarchar(900) - must be null. It will be used for debugging. It will store the results of the previous match run so that you know what duplicates were identified on the current match run.
- match_specs nvarchar(4000) - must be null. It will be used to store a list of the names of matching runs that effected this record.
Matching specification is executed as follows:
- Match run specifications are executed in the order as specified in the matching specification. Each match run takes the match table as an input and updates the set_id column.
- The manual editing is applied. You can review and edit the result of matching and consolidation in the Stewardship User Interface (UI) described below.
- Consolidation rules are executed. Consolidation takes the match table as an input and creates the consolidated table.
Match Run Specification
The matching specification can contain one or more match runs that are executed in the order as specified. Match runs define rules to match duplicates. Each match run specification consists of:
- Name of the match run that is used for diagnostics and also will be appended to the match_specs column of the match table.
- Type of the match run. It can be Conditional or Probabilistic.
- Match columns that are used to identify duplicates.
Conditional Type of Match Run
In a conditional type of match run, list match column names. Records which have equal values in the match columns are considered duplicates. The conditional type of match run is simple, yet very powerful.
Probabilistic type of match run
Probabilistic type of matching allows you to match records by a combination of columns without specifying the exact set of columns that must be matched. It is implemented by comparing all pairs of records, computing a composite score for each pair of records, and comparing the composite score to a cutoff value.
In the probabilistic type of matching you can use various comparison functions (specified in the comparison column in the list of Match columns) for comparing column values:
- equal - exact equality. The result of an equal operation is computed as percentage: 0 - not equal; 100 - equal.
- edit_distance - is a way of quantifying how dissimilar two strings are from one another by counting the minimum number of operations required to transform one string into the other. The result is computed as percentage: from 0 - not equal to 100 - exactly equal.
The composite score of comparing two records is computed as a sum of scores of column comparisons for each column in the list ofMatch columns divided by the number of the Match columns. The score of column comparison is computed as follows:
<Comparison function result> * <Column Uniqueness> * <Column Weight>
Column Uniqueness is a decimal value in in the range of 0 to 1. Uniqueness is inversely proportional to the probability of a random chance agreement of column values in a pair of nonmatched records. For example, the probability that two gender values agree at random is 0.5 (i.e. uniqueness is 0.5 in this case); the probability that two SSN values agree at random is 0 thus uniqueness is 1.
Column Weight is a decimal in the range of 0 to 1. It is used to incorporate any additional knowledge about column importance.
Performance optimization tip:
For tables of reasonable size, it is infeasible to compare all record
pairs, because the number of possible pairs is the product of the
number of records in such tables. To void all-pairs comparison,
internally the optimizer implements a blocking technique. Blocking
partitions records into mutually exclusive and exhaustive subsets, and
the matching process searches for matches only within a subset.
Blocking is implemented by selecting blocking columns (i.e. match
columns which comparison function is equal) and preliminarily grouping
records by the blocking columns. Thus practical recommendation is to
have at least one blocking column (the more the better) in each Match
Run.
There are two types of cutoff values: Match Cutoff, which is considered a perfect match; and Clerical Cutoff, which is a match that should be reviewed. Precise rules are as follows:
- Two records are a Match if the composite weight is greater or equal to Match Cutoff value;
- Two records are Clerical if the composite weight is greater or equal to Clerical Cutoff value and less than Match Cutoff value;
- Two records are a nonmatch if the Composite Weight is less than Clerical Cutoff Value.
To determine Match and Clerical Cutoff values you should review the Weight Histogram Table, which will be created as result of running the match run in the determine cutoffs mode. The Weight Histogram Table contains two columns:
- weight - the composite score
- c - the number of pairs that have this composite score.
Manual Matching
In the Manual matching section of matching, you specify the names of two tables that will be used to store the results of manual editing in the Stewardship User Interface. These tables will be created and managed by the system so you do not need to understand their structure.
Consolidation Rules
Consolidation rules define how to compile a consolidated (aka merged) record from a set of matched duplicates. Consolidation takes the match table as an input and creates a new consolidated table with the name defined in the matching specification (i.e. Consolidated table field). A set of matched records are those that have the same set_id in the match table.
Consolidation rules consists of a list of columns of the consolidated table. Those columns are called Survival columns. For each column you specify a strategy to compute the value of the column (Survivorship Function). You can select from the following options:
- max - select the maximum value among all values from the matched records;
- min - select the minimum value among all values from the matched records;
- most_frequent - select the most frequent value among all values from the matched records.
In addition to the per-column compilation, you can specify a strategy to select a primary record from the matched records and then prefer the value of the corresponding column from the primary record when you compile a value for a consolidated column. Preferring primary record value (Prefer Primary Record in the Survival columns) means that if the value of the column in the primary record is not null then it will be chosen as the value of the consolidated column; if the value of the column in the primary record is null then the per-column computation strategy (Survivorship Function) is used to compute the value of the consolidated column. Preferring the value from the primary record is useful to get a consistent set of values across the columns of the consolidated record. For example, you want to have all fields of address to be chosen from the same record, as different duplicate records might have different addresses. There are two options for selecting a primary record:
- any - selects a primary record randomly.
- source_priority - selects a primary record based on sources priority specified in thesources priority table.Sources priority table must contain two columns: source_name(nvarchar) and priority (int). You also specify Source column name in the match result table that must contains values matchingsource_name in the sources priority table.
Executing Matching Specification
- Right click on the matching specification in the navigation tree. Matching specifications are located in the Matching subsection of the Specifications section of the navigation tree.
-
You will see the following options you can select from:
- Execute all will execute all the steps of the specification: matching, applying manual matching, and computing consolidated records.
- Execute all match runs executes only match runs.
- Execute consolidation executes only consolidation rules.
- Execute match run… executes a particular match run (selected from a dropdown list) in any of the two modes:match executes the match run completely; determine cutoffs executes only part of the match run to generate the Weight Histogram Table which is used to determine cutoff values in the Probabilistic type of match run.
Stewardship User Interface
Via Stewardship user interface (UI) you can view and edit the result of matching. Viewing allows you to see consolidated records with links to source records. Editing allows you to manually merge and unmerge records, and change the selection of the consolidated record values.
To open Stewardship UI:
- Right click on the matching specification in thenavigation tree and select Stewardship UI.
View/Search Matching Results
In the Stewardship UI you can:
- Search : You can search through consolidated records. Use the Search field to enter any predicate in SQL language. You can limit the result of the search by changing the Limit field. Click the Apply button after you change the search parameters to update the result of the search.
- Sort : You can sort the result of searching by selecting the Order by column from the dropdown menu and specifying the direction of ordering: desc (descending) or asc (ascending).
- Show only summary columns. Select the check box called Only summary columns.
· See source records. Click on the triangle sign on the left of a consolidated record to see source records, which are duplicate records used to compile the consolidated record.
Manually Merge Records
If the result of matching still contains duplicates you can merge them manually as follows:
- Right click on a consolidated record you want to merge and select Make target. The record will appear at the top of the window in the target box.
- You can use search to find another record you want to merge with the target.
- Right click on a record that you want to merge with the target and select Merge with target. The clicked record and the target records will be merged. The target will now represent the merged record.
- You can repeat the previous two steps for any number of records.
- The results of manual merging will be saved and applied on subsequent executions of the matching specification.
Manually Unmerge Records
If you see that some non-duplicate records were merged, you can unmerge them manually as follows:
- Right click on a consolidated record that is compiled from the source records that you want to unmerge. Select Unmerge. Each source record will become a separate record from which the consolidated record will be compiled.
Note: You might not (and usually will not) see the unmerged records in the current window because the current window shows only (consolidated) records that meet the search criteria specified in the search fields and unmerge records might not meet the criteria. Nevertheless you often want to merge the unmerged records with each other or some other records. To find recently unmerged records click on the Unmerge manually tab at the top of the window.
The results of manual unmerging will be saved and applied on subsequent executions of the matching specification.
Change Selection of Consolidated Record Values
The values of a consolidated record are selected form the corresponding columns of the source records by the Consolidation rules. If you do not like the selection for a particular consolidated value you can change it as follows:
- Click on the triangle sign on the left of a consolidated record to see the source records.
- Right click on a value in a source record and click Select as master value. The source value will appear in the corresponding column of the consolidated record.
The result of manual selection will be saved and applied on subsequent executions of the matching specification.