Configuring the match rules
On the Rule Configuration panel, you can:
- Configure one of the predefined rules from Template Rules located on the left corner of the page which you can use as-is, or
- Configure a new match rule and publish it to the repository for re-use. You can configure a new match rule without using one of the predefined match rules. For more information, see Creating a match rule.
Creating a match rule
- On the Rules panel, click Create new
rule.Note: By default, Custom is added as a prefix to the Rule Name followed by the name you define in Step 2, for example, Custom - xyz.
- Specify the dataflow fields, parent or child; you want to use in
the match rule and match rule hierarchy.
- Click button and enter a name for the parent
below Match when not true.Note: The name you enter for your first parent in the hierarchy is used as the match rule name, for example, Custom - xyz.
- Click button and select a field to add to the
parent from the drop-down list below Match when not
true.Note: All children under a parent must use the same logical operator. If you want to use different logical operators between fields, you must first create intermediate parents.
- Click button and enter a name for the parent
below Match when not true.
- Define these parent options as listed in the table below, which are
displayed on the parent node:
Option Description Match when not true
It changes the logical operator for the parent from and to and not. If you select this option, records will only match if they do not match the logic defined in this parent.Note: If you select the Match when not true option, it negates the Matching Method options. For more information, see Negative Match Conditions.Matching Method Select one of these from the drop-down list to determine if a parent is a match or non-match: - All true: A parent is considered a match if all children are determined to match. This method creates an "and" connector between children.
- Any true: A parent is considered a match if at least one child is determined to match. This method creates an "or" connector between children.
- Based on threshold: A
parent is considered a match if at least one child
is determined to match. This method creates an
"or" connector between
children.
If you select this option, the Threshold field enables you to specify a threshold value. The Scoring Method determines which logical connector to use. The thresholds at the parent cannot be higher than the threshold of the children. For more information, see the matching method-to-scoring method matrix below this table.
Missing Data Select one of these from the drop-down list to specify how to score blank data in a field: - Ignore blanks: Ignores the field if it contains blank data.
- Count as 0: Scores the field as 0 if it contains blank data.
- Count as 100: Scores the field as 100 if it contains blank data.
- Compare blanks: Scores the suspect and candidate fields as 100 if they both contain blank data; otherwise, scores the suspect and candidate fields as 0.
Scoring Method Select one of these from the drop-down list to determine the matching score: - Weighted Average: Uses the weight of each child to determine the average match score.
- Average: Uses the average score of each child to determine the score of a parent.
- Maximum: Uses the highest child score to determine the score of a parent.
- Minimum: Uses the lowest child score to determine the score of a parent.
- Vector Summation: Uses
the vector summation of each child score to
determine the score of the parent. The formula for
calculation is:
sqrt(a^2+b^2+c^2) / sqrt(n), where a, b, and c are the scores of three children, and n is the number of children.
For more information, see the matching method-to-scoring method matrix below this table.
Evaluate For more information, see Evaluating a match rule. Copy settings to It allows you to copy the same settings for any number of elements. - Use the drop-down list to select or de-select the elements.
- Click Apply adjacent to the Copy Settings to field to copy and apply the same settings for the selected elements.
Note: You can copy the parent settings to a parent element and child settings to a child element only.Matching Method-to-Scoring Method Matrix
The table below shows the logical relationship between Matching Method and Scoring Method and how each combination changes the logic used during match processing.Scoring Method Matching Method Comments Any true All true Based on threshold Weighted Average NA and and Only available when All true or Based on threshold are selected as the Matching Method.
Average NA and and Vector Summation NA and and Maximum or NA or Only available when All true or Based on threshold are selected as the Matching Method. Minimum or NA or - Define these child options as listed in the table below, which are displayed
on the child node:
Option Description Match when not true
It changes the logical operator from and to not. If you select this option, the match rule will only evaluate to true if the records do not match the logic defined in this child.
For example, if you want to identify individuals who are associated with multiple accounts, you could create a match rule that matches the name but where the account number does not match. You would use the Match when not true option for the child that matches the account number.
Candidate field Select this to map the child record field you select from the drop-down list to a field in the input file.
Cross match against Select this to choose one or more field names from the drop-down list to match different fields to one another between two records. Threshold Enter the threshold that must be met at the individual field level for that field to be determined a match.
Missing Data Select one of these from the drop-down list to specify how to score blank data in a field: - Ignore blanks: Ignores the field if it contains blank data.
- Count as 0: Scores the field as 0 if it contains blank data.
- Count as 100: Scores the field as 100 if it contains blank data.
- Compare blanks: Scores the suspect and candidate fields as 100 if they both contain blank data; otherwise, scores the suspect and candidate fields as 0.
Scoring Method Select one of these from the drop-down list to determine the matching score: - Weighted Average: Uses the weight of each algorithm to determine the average match score.
- Average: Uses the average score of each algorithm to determine the match score.
- Maximum: Uses the highest algorithm score to determine the match score.
- Minimum: Uses the lowest algorithm score to determine the match score.
- Vector Summation: Uses
vector summation of the score of each algorithm to
determine the match score. This scoring method is
useful if you want a higher vector summation match
score in one or more algorithms to get
proportionately represented in the final match
score. The formula for calculating the final score
is:
sqrt(a^2+b^2+c^2) / sqrt(n), where a, b, and c are the scores of three different algorithms, and n is the number of algorithms used.
Evaluate For more information, see Evaluating a match rule. Copy Settings to It allows you to copy the same settings for any number of elements. - Use the drop-down list to select or de-select the elements.
- Click Apply adjacent to the Copy Settings to field to copy and apply the same settings for the selected elements.
Note: You can copy the parent settings to a parent element and child settings to a child element only. - To configure algorithms for your child type, click Configure
Algorithms on the child options node to add one or more of
these algorithms to determine the match in the field values:Note: Use Search to selectively configure the algorithms.
String Matching Algorithms
- Acronym
- It determines whether a business name matches its acronym by
looking for acronym data; otherwise, it creates an acronym using
the first character of every word.
For example, Internal Revenue Service and its acronym IRS would be considered a match and return a match score of 100.
- Character Frequency
- It determines the frequency of occurrence of each character in a string and compares the overall frequencies between two strings.
- Exact Match
- It determines if two strings are the same.
- Initials
- It matches the initials for parsed personal names.
- Name Variant
- It determines whether two names are variants of each other. The
algorithm returns a match score of 100 if two names are
variations of each other, and a match score of 0 if two names
are not variations of each other.
For example, JOHN is a variation of JAKE and returns a match score of 100. JOHN is not a variant of HENRY and returns a match score of 0.
Click Edit to specify the name variant options. For more information, see Name Variant Finder.
- Numeric String
- It compares address lines by separating the numerical attributes
of an address line from the characters. See the examples
below.
- In the string address 1234 Main Street Apt 567, the
numerical attributes of the string (1234567) are parsed
and handled differently from the remaining string value
(Main Street Apt). The algorithm first matches numeric
data in the string with the numeric algorithm. If the
numeric data match is 100, the alphabetic data is
matched using Edit distance and Character Frequency. The
final match score is calculated as
follows:
(numericScore + (EditDistanceScore + CharacterFrequencyScore) / 2) / 2)
- If you calculate the match score of these two
addresses:
123 Main St Apt 567
the match score would be 95.5, calculated as follows:
123 Maon St Apt 567Numeric Score = 100
Edit Distance = 91
Character Frequency = 9191 + 91 = 182
182/2 = 91
100 + 91 = 191
191/2 = 95.5
- In the string address 1234 Main Street Apt 567, the
numerical attributes of the string (1234567) are parsed
and handled differently from the remaining string value
(Main Street Apt). The algorithm first matches numeric
data in the string with the numeric algorithm. If the
numeric data match is 100, the alphabetic data is
matched using Edit distance and Character Frequency. The
final match score is calculated as
follows:
- SubString
- It determines whether one string occurs within another.
Phonetic Algorithms
- Daitch-Mokotoff Soundex
- A Phonetic algorithm that allows greater accuracy in matching of Slavic and Yiddish surnames with similar pronunciation but differences in spelling. Coded names are six digits long, and multiple possible encodings can be returned for a single name. This option was developed to respond to the limitations of Soundex in the processing of Germanic or Slavic surnames.
- Double Metaphone
- It determines the similarity between two strings based on a phonetic representation of their characters. Double Metaphone is an improved version of the Metaphone algorithm and attempts to account for the many irregularities found in different languages.
- Koeln
- Indexes names by sound as they are pronounced in German. Allows names with the same pronunciation to be encoded to the same representation so that they can be matched, despite minor differences in spelling. The result is always a sequence of numbers; special characters and white spaces are ignored. This option was developed to respond to the limitations of Soundex.
- Metaphone
-
It determines the similarity between two English-language strings based on a phonetic representation of their characters. This option was developed to respond to the limitations of Soundex.
- Metaphone (Spanish)
-
It determines the similarity between two strings based on a phonetic representation of their characters. This option was developed to respond to the limitations of Soundex.
- Metaphone3
-
It improves upon the Metaphone and Double Metaphone algorithms with a more exact consonant and internal vowel settings that allow you to produce words or names more or less closely matched to search terms on a phonetic basis. Metaphone3 increases the accuracy of phonetic encoding to 98%. This option was developed to respond to the limitations of Soundex.
- Nysiis
- It is a phonetic code algorithm that matches an approximate
pronunciation to an exact spelling and indexes words that are
pronounced similarly—part of the New York State Identification
and Intelligence System. For example, consider that you are looking for someone's information in a database of people. You believe that the person's name sounds like "John Smith," but it is spelled "Jon Smath". If you conducted a search looking for an exact match for "John Smith," no results would be returned. However, if you index the database using the NYSIIS algorithm and search using the NYSIIS algorithm again, the correct match will be returned because both "John Smith" and "Jon Smath" are indexed as "JANSNATH" by the algorithm. This option was developed to respond to limitations of Soundex; it handles some multicharacter n-grams and maintains relative vowel positioning, whereas Soundex does not.Note: This algorithm does not process non-alpha characters; records containing them will fail during processing.
- Phonix
- It preprocesses name strings by applying more than 100 transformation rules to single characters or sequences of several characters. Nineteen of those rules are applied only if the characters are at the beginning of the string, while 12 of the rules are applied only if they are at the middle of the string, and 28 of the rules are applied only if they are at the end of the string. The transformed name string is encoded into a code that is comprised of a starting letter followed by three digits (removing zeros and duplicate numbers). This option was developed to respond to the limitations of Soundex; it is more complex and, therefore, slower than Soundex.
- Sonnex
- It determines the similarity between two French-language strings based on the phonetic representation of their characters. It returns a Sonnex coded key of the selected fields.
- Soundex
- It determines the similarity between two strings based on a phonetic representation of their characters.
- Syllable Alignment
- It combines phonetic information with edit distance based calculations. Converts the strings to be compared into their corresponding sequences of syllables and calculates the number of edits required to convert one sequence of syllables to the other.
Similarity and Distance Measures
- Edit Distance
- It determines the similarity between two strings based on the number of deletions, insertions, or substitutions required to transform one string into another.
- Euclidean Distance
- It provides a similarity measure between two strings using the
vector space of combined terms as the dimensions. It also
determines the greatest common divisor of two integers. It takes
a pair of positive integers and forms a new pair that consists
of the smaller number and the difference between the larger and
smaller numbers. The process repeats until the numbers are
equal. That number then is the greatest common divisor of the
original pair.
For example, 21 is the greatest common divisor of 252 and 105: (252 = 12 × 21; 105 = 5 × 21); since 252 − 105 = (12 − 5) × 21 = 147, the GCD of 147 and 105 is also 21.
- Jaro-Winkler Distance
- It determines the similarity between two strings based on the number of character replacements it takes to transform one string into another. This option was developed for short strings, such as personal names.
- Keyboard Distance
- It determines the similarity between two strings based on the
number of deletions, insertions, or substitutions required to
transform one string to the other, weighted by the position of
the keys on the keyboard.
Click Edit to specify the type of keyboard you are using: QWERTY (U.S.), QWERTZ (Austria and Germany), or AZERTY (France).
- Kullback-Liebler Distance
- It determines the similarity between two strings based on the differences between the distribution of words in the two strings.
- NGram Distance
-
It calculates in text or speech the probability of the next term based on the previous n terms, which can include phonemes, syllables, letters, words, or base pairs and can consist of any combination of letters.
Click Edit to enter the size of the NGram; the default is 2.
- NGram Similarity
- It determines the similarity between two strings based on the
length of the longest common subsequence of phonemes, syllables,
letters, words, or base pairs.
Click Edit to specify these options:
- Ngram size: Enter the size of the NGram. The default is 2.
- Drop Noise Characters: Select to replace punctuation with space.
- Drop Spaces: Select to merge words.
Date Algorithms
- Date
- It compares date fields regardless of the date format in the
input records. Click Edit to specify
these:
- General Options—Require Month: It prevents a date that consists only of a year from matching.
- General Options—Require Day: It prevents a date that consists only of a month and year from matching.
- General Options—Match Transposed MM/DD: Where month and day are provided in numeric format, it compares suspect month to candidate day and suspect day to candidate month as well as the standard comparison of the suspect month to candidate month and suspect day to candidate day.
- General Options—Prefer DD/MM/YYYY format over
MM/DD/YYYY: It contributes to date parsing in
cases where both month and day are provided in numeric
format, and their identification can not be determined
by context.
For example, given the numbers 5 and 13, the parser will automatically assign 5 to the month and 13 to the day because there are only 12 months in a year. However, given the numbers 5 and 12 (or any two numbers 12 and under), the parser will assume whichever number is first to be the month.
If you select this option, it ensures that the parser reads the first number as the day rather than the month.
- Range Options—Overall: It allows you to set the
maximum number of days between matching dates. See the
examples below.
- If you enter an overall range of 35 days and your candidate date is December 31, 2000, a suspect date of February 5, 2001, would be a match, but a suspect date of February 6 would not.
- If you enter an overall range of 1 day and your candidate date is January 2000, a suspect date of 1999 would be a match (comparing December 31, 1999), but a suspect date of January 2001 would not.
- Range Options—Year: It allows you to set the
number of years between matching dates, independent of
month, and day. See the examples below.
- If you enter a year range of 3 and your candidate date is January 31, 2000, a suspect date of January 31, 2003, would be a match, but a suspect date of February 2003 would not.
- If your candidate date is 2000, a suspect date of March 2003 would be a match because months are not in conflict, and it's within the three-year range.
- Range Options—Month: It allows you to set the
number of months between matching dates, independent of
year and day.
For example, if you enter a month range of 4 and your candidate date is January 1, 2000, a suspect date of May 2000 is a match because there is no day conflict and it's within the four-month range, but a suspect date of May 2, 2000, is not, because of the day's conflict.
- Range Options—Day: It allows you to set the
number of days between matching dates, independent of
year and month.
For example, if you enter a day range of 5 and your candidate date is January 1, 2000, a suspect date of January 2000 is a match because there is no day conflict but a suspect date of December 27, 1999, is not, because of the month's conflict.
The table below describes the logical relationship between the number of algorithms you can use based on the parent Scoring Method you select.
Scoring Method Algorithms Single Multiple Weighted Average NA Yes Average NA Yes Vector Summation Yes Yes Maximum NA Yes Minimum NA Yes - Click Ok.Note:
- If you define n number of parent and child elements, use Filter to selectively look for the elements.
- If you want to expand or collapse all the tree nodes, click the Expand all and Collapse all buttons.
- (Optional) If you want to store the match rule in the repository for re-use,
click Publish.
The match rule is stored in the Repository Rules panel located on the left corner of the page.
- On the top-right corner of the page, click Apply.
Your match rule applies to the Intraflow Match stage in the dataflow.
Evaluating a match rule
- In the match rule hierarchy, choose the node you want to test and click
Evaluate.
You see the Evaluate page.
- In the Input panel, enter the test data in these
ways:
-
To enter the test data manually:
- Enter a suspect record under the Suspect column and up to ten candidates under Candidate columns.
- Click Export to save the records to a file that you can import later instead of reentering the data manually.
- To import the test data from a file:
- Click Import.
- Select the file that contains the sample records. Delimited files can be comma, pipe, or tab-delimited and should have a header record with header fields that match the field names shown under Candidates. For example, a sample header record for household input would be Name,AddressLine1,City,StateProvince.
-
- Select any of these method to evaluate your match rule:
- Current Rule: This runs the rule defined on
the Match Rule panel. The results are
displayed for one suspect and candidate pair at a time. To cycle
through the results, select the arrows . Scores for fields and
algorithms are displayed in a tree format similar to the match rule
control. Note: If you make changes to the match rule and want to apply the changes to the stage's match rule, click Save on the top-right corner of the page.
- All Algorithms: This ignores the match rule and instead runs all algorithms against each field for suspect and candidate pairs. Results are displayed for one suspect and candidate pair at a time and can be cycled through using the arrows.
- Green: The rule resulted in a match
- Red: The rule did not result in a match
- Gray: The rule was ignored
- Purple: The results for individual algorithms within the rule
- Current Rule: This runs the rule defined on
the Match Rule panel. The results are
displayed for one suspect and candidate pair at a time. To cycle
through the results, select the arrows . Scores for fields and
algorithms are displayed in a tree format similar to the match rule
control.