Building a Match Rule
Match rules are used in Interflow Match, Intraflow Match, and Transactional Match to define the criteria that determine if one record matches another. Match rules specify the fields to compare, how to compare the fields, and a hierarchy of comparisons for complex matching rules.
You can build match rules in Interflow Match, Intraflow Match, and Transactional Match. You can also build match rules in the Enterprise Designer Match Rule Management tool. Building a rule in the Match Rule Management tool makes the rule available to use in any dataflow, and also makes it available to other users. Building a match rule in one of the matcher stages makes the rule available only for that stage, unless you save the rule by clicking the Save button, which makes it available to other stages and users.
- Open Enterprise Designer.
-
Do one of the following:
- If you want to define a match rule in Interflow Match, Intraflow Match, or Transactional Match, double-click the match stage for which you want to define a match rule. In the Load match rule field, choose a predefined match rule as a starting point. If you want to start with a blank match rule, click New.
- If you want to define a match rule in the Match Rule Management tool, select Tools > Match Rule Management. If you want to use an existing rule as a starting point for your rule, check the Copy from box and select the rule to use as a starting point.
-
Specify the dataflow fields you want to use in the match rule as well as the
match rule hierarchy.
- Click Add Parent.
- Type in a name for the parent. The name must be unique and it cannot be a field. The first parent in the hierarchy is used as the match rule name in the Load match rule field. All custom match rules that you create and predefined rules that you modify are saved with the word "Custom" prepended to the name.
-
Click Add Child. A drop-down menu appears in the
rule hierarchy. Select a field to add to the parent.
Note: All children under a parent must use the same logical operator. If you want to use different logical operators between fields you must first create intermediate parents.
- Repeat to complete your matching hierarchy.
-
Define parent options. Parent options are displayed to the right of the rule
hierarchy when a parent node is selected.
-
Click Match when not true to change the logical
operator for the parent from AND to AND NOT. If you select this option,
records will only match if they do not match the logic defined in this
parent.
Note: Checking the Match when not true option has the effect of negating the Matching Method options. For more information, see Negative Match Conditions.
-
In the Matching Method field, specify how to
determine if a parent is a match or a non-match. One of the
following:
- All true
- A parent is considered a match if all children are determined to match. This method creates an "AND" connector between children.
- Any true
- A parent is considered a match if at least one child is determined to match. This method creates an "OR" connector between children.
- Based on threshold
- A parent is considered a match if the score of the parent is
greater than or equal to the parent's threshold. When you
select this option, the Threshold
slider appears. Use this slider to specify a threshold. The
scoring method determines which logical connector to use.
Thresholds at the parent cannot be higher than the threshold
of the children.Note: The threshold set here can be overridden at runtime in the Dataflow Options dialog box. Go to Edit > Dataflow Options and click Add. Expand the stage, click Top level threshold, and enter the threshold in the Default value field.
-
In the Missing Data field, specify how to score
blank data in a field. One of the following:
- Ignore blanks
- Ignores the field if it contains blank data.
- Count as 0
- Scores the field as 0 if it contains blank data.
- Count as 100
- Scores the field as 100 if it contains blank data.
- Compare Blanks
-
Scores the suspect and candidate fields as 100 if they both contain blank data; otherwise, scores the suspect and candidate fields as 0.
-
In the Scoring method field, select the method
used for determining the matching score. One of the following:
- Weighted Average
- Uses the weight of each child to determine the average match score.
- Average
- Uses the average score of each child to determine the score of a parent.
- Maximum
- Uses the highest child score to determine the score of a parent.
- Minimum
- Uses the lowest child score to determine the score of a parent.
- Vector Summation
- Uses the vector summation of each child score to determine
the score of the parent. The formula for calculation
is:
sqrt(a^2 + b^2 + c^2) / sqrt(n), where: a, b, and c are the scores of three children and n is the number of children.
The following table shows the logical relationship between matching methods and scoring methods and how each combination changes the logic used during match processing.
Table 1. Matching Method-to-Scoring Method Matrix Scoring Method Matching Method Comments Any True All True Based on Threshold Weighted Average n/a AND AND Only available when All True or Based on Threshold are selected as the Matching Method. Average n/a AND AND Vector Summation n/a AND AND Maximum OR n/a OR Only available when Any True or Based on Threshold are selected as the Matching Method. Minimum OR n/a OR
-
Click Match when not true to change the logical
operator for the parent from AND to AND NOT. If you select this option,
records will only match if they do not match the logic defined in this
parent.
-
Define child options. Child options are displayed to the right of the rule
hierarchy when a child is selected.
- Check the option Candidate field to map the child record field selected to a field in the input file.
- Check the option Cross match against and select one or more items from the dropdown list to match different fields to one another between two records. If you are using the Match Rule Management tool to create or edit a match rule, there will be no dropdown and you will instead need to enter each field name, separated by commas.
-
Click Match when not true to change the logical
operator from AND to NOT. If you select this option, the match rule will
only evaluate to true if the records do not match the logic defined in
this child.
For example, if you want to identify individuals who are associated with multiple accounts, you could create a match rule that matches on name but where the account number does not match. You would use the Match when not true option for the child that matches the account number.
-
In the Missing Data field, specify how to score
blank data in a field. One of the following:
- Ignore blanks
- Ignores the field if it contains blank data.
- Count as 0
- Scores the field as 0 if it contains blank data.
- Count as 100
- Scores the field as 100 if it contains blank data.
- Compare Blanks
-
Scores the suspect and candidate fields as 100 if they both contain blank data; otherwise, scores the suspect and candidate fields as 0.
- In the Threshold field, specify the threshold that must be met at the individual field level in order for that field to be determined a match.
-
In the Scoring method field, select the method
used for determining the matching score. One of the following:
- Weighted Average
- Uses the weight of each algorithm to determine the average match score.
- Average
- Uses the average score of each algorithm to determine the match score.
- Maximum
- Uses the highest algorithm score to determine the match score.
- Minimum
- Uses the lowest algorithm score to determine the match score.
- Vector Summation
- Uses vector summation of the score of each algorithm to
determine the match score. This scoring method is useful if
you want a higher match score in one or more algorithms to
get proportionately represented in the final match score.
The formula used for calculating the final score is:
sqrt(a^2 + b^2 + c^2) / sqrt(n), where: a, b, and c are the scores of three different algorithms and n is the number of algorithms used.
-
Choose one or more algorithms to use to determine if the values in the
field match. One of the following.
- Acronym
- Determines whether a business name matches its acronym by looking for acronym data; otherwise it creates an acronym using the first character of every word. Example: Internal Revenue Service and its acronym IRS would be considered a match and return a match score of 100.
- Character Frequency
- Determines the frequency of occurrence of each character in a string and compares the overall frequencies between two strings.
- Daitch-Mokotoff Soundex
- Phoenetic algorithm that allows greater accuracy in matching of Slavic and Yiddish surnames with similar pronunciation but differences in spelling. Coded names are six digits long, and multiple possible encodings can be returned for a single name. This option was developed to respond to limitations of Soundex in the processing of Germanic or Slavic surnames.
- Date
- Compare date fields regardless of the date format in the
input records. Click Edit in the Options column to specify
the following:
- Require Month: prevents a date that consists only of a year from matching
- Require Day: prevents a date that consists only of a month and year from matching
- Match Transposed MM/DD: where month and day are provided in numeric format, compares suspect month to candidate day and suspect day to candidate month as well as the standard comparison of suspect month to candidate month and suspect day to candidate day
- Prefer DD/MM/YYYY format over MM/DD/YYYY: contributes to date parsing in cases where both month and day are provided in numeric format and their identification can not be determined by context. For example, given the numbers 5 and 13, the parser will automatically assign 5 to the month and 13 to the day because there are only 12 months in a year. However, given the numbers 5 and 12 (or any two numbers 12 and under), the parser will assume whichever number is first to be the month. Checking this option will ensure that the parser reads the first number as the day rather than the month.
- Range Options—Overall: allows you to set the maximum number of days between matching dates. For example, if you enter an overall range of 35 days and your candidate date is December 31st, 2000, a suspect date of February 5, 2001 would be a match, but a suspect date of February 6 would not. If you enter an overall range of 1 day and your candidate date is January 2000, a suspect date of 1999 would be a match (comparing December 31, 1999) but a suspect date of January 2001 would not.
- Range Options—Year: allows you to set the number of years between matching dates, independent of month and day. For example, if you enter a year range of 3 and your candidate date is January 31, 2000, a suspect date of January 31, 2003, would be a match but a suspect date of February 2003 would not. Similarly, if your candidate date is 2000, a suspect date of March 2003 would be a match because months are not in conflict and it's within the three-year range.
- Range Options—Month: allows you to set the number of months between matching dates, independent of year and day. For example, if you enter a month range of 4 and your candidate date is January 1, 2000, a suspect date of May 2000 is a match because there is no day conflict and it's within the four-month range, but a suspect date of May 2, 2000, is not, because the days conflict.
- Range Options—Day: allows you to set the number of days between matching dates, independent of year and month. For example, if you enter a day range of 5 and your candidate date is January 1, 2000, a suspect date of January 2000 is a match because there is no day conflict but a suspect date of December 27, 1999, is not, because the months conflict.
- Double Metaphone
- Determines the similarity between two strings based on a phonetic representation of their characters. Double Metaphone is an improved version of the Metaphone algorithm, and attempts to account for the many irregularities found in different languages.
- Edit Distance
- Determines the similarity between two strings based on the number of deletions, insertions, or substitutions required to transform one string into another.
- Euclidean Distance
- Provides a similarity measure between two strings using the vector space of combined terms as the dimensions. It also determines the greatest common divisor of two integers. It takes a pair of positive integers and forms a new pair that consists of the smaller number and the difference between the larger and smaller numbers. The process repeats until the numbers are equal. That number then is the greatest common divisor of the original pair. For example, 21 is the greatest common divisor of 252 and 105: (252 = 12 × 21; 105 = 5 × 21); since 252 − 105 = (12 − 5) × 21 = 147, the GCD of 147 and 105 is also 21.
- Exact Match
- Determines if two strings are the same.
- Initials
- Used to match initials for parsed personal names.
- Jaro-Winkler Distance
- Determines the similarity between two strings based on the number of character replacements it takes to transform one string into another. This option was developed for short strings, such as personal names.
- Keyboard Distance
- Determines the similarity between two strings based on the number of deletions, insertions, or substitutions required to transform one string to the other, weighted by the position of the keys on the keyboard. Click Edit in the Options column to specify the type of keyboard you are using: QWERTY (U.S.), QWERTZ (Austria and Germany), or AZERTY (France).
- Koeln
- Indexes names by sound as they are pronounced in German. Allows names with the same pronunciation to be encoded to the same representation so that they can be matched, despite minor differences in spelling. The result is always a sequence of numbers; special characters and white spaces are ignored. This option was developed to respond to limitations of Soundex.
- Kullback-Liebler Distance
- Determines the similarity between two strings based on the differences between the distribution of words in the two strings.
- Metaphone
- Determines the similarity between two English-language strings based on a phonetic representation of their characters. This option was developed to respond to limitations of Soundex.
- Metaphone (Spanish)
- Determines the similarity between two strings based on a phonetic representation of their characters. This option was developed to respond to limitations of Soundex.
- Metaphone 3
- Improves upon the Metaphone and Double Metaphone algorithms with more exact consonant and internal vowel settings that allow you to produce words or names more or less closely matched to search terms on a phonetic basis. Metaphone 3 increases the accuracy of phonetic encoding to 98%. This option was developed to respond to limitations of Soundex.
- Name Variant
- Determines whether two names are variants of each other. The algorithm returns a match score of 100 if two names are variations of each other, and a match score of 0 if two names are not variations of each other. For example, JOHN is a variation of JAKE and returns a match score of 100. JOHN is not a variant of HENRY and returns a match score of 0. Click Edit in the Options column to select Name Variant options. For more information, see Name Variant Finder.
- NGram Distance
-
Calculates in text or speech the probability of the next term based on the previous n terms, which can include phonemes, syllables, letters, words, or base pairs and can consist of any combination of letters. This algorithm includes an option to enter the size of the NGram; the default is 2.
- NGram Similarity
-
Determines similarity between two strings based on the length of the longest common subsequence of phonemes, syllables, letters, words or base pairs.
The algorithm includes the following options:
- Ngram size: Enter the size of the NGram. The default value is 2.
- Drop Noise Characters: Select the check-box to replace punctuation with space.
- Drop Spaces: Select the check-box to merge words.
- Numeric String
- Compares address lines by separating the numerical
attributes of an address line from the characters. For
example, in the string address 1234 Main Street Apt 567, the
numerical attributes of the string (1234567) are parsed and
handled differently from the remaining string value (Main
Street Apt). The algorithm first matches numeric data in the
string with the numeric algorithm. If the numeric data match
is 100, the alphabetic data is matched using Edit distance
and Character Frequency. The final match score is calculated
as follows:
(numericScore + (EditDistanceScore + CharacterFrequencyScore) / 2) / 2
For example, the match score of these two addresses is 95.5, calculated as follows:
123 Main St Apt 567
123 Maon St Apt 567Numeric Score = 100
Edit Distance = 91
Character Frequency = 9191 + 91 = 182
182/2 = 91
100 + 91 = 191
191/2 = 95.5 - Nysiis
- Phonetic code algorithm that matches an approximate
pronunciation to an exact spelling and indexes words that
are pronounced similarly. Part of the New York State
Identification and Intelligence System. Say, for example,
that you are looking for someone's information in a database
of people. You believe that the person's name sounds like
"John Smith", but it is in fact spelled "Jon Smath". If you
conducted a search looking for an exact match for "John
Smith" no results would be returned. However, if you index
the database using the NYSIIS algorithm and search using the
NYSIIS algorithm again, the correct match will be returned
because both "John Smith" and "Jon Smath" are indexed as
"JANSNATH" by the algorithm. This option was developed to
respond to limitations of Soundex; it handles some
multi-character n-grams and maintains relative vowel
positioning, whereas Soundex does not.Note: This algorithm does not process non-alpha characters; records containing them will fail during processing.
- Phonix
- Preprocesses name strings by applying more than 100 transformation rules to single characters or to sequences of several characters. 19 of those rules are applied only if the character(s) are at the beginning of the string, while 12 of the rules are applied only if they are at the middle of the string, and 28 of the rules are applied only if they are at the end of the string. The transformed name string is encoded into a code that is comprised by a starting letter followed by three digits (removing zeros and duplicate numbers). This option was developed to respond to limitations of Soundex; it is more complex and therefore slower than Soundex.
- Soundex
- Determines the similarity between two strings based on a phonetic representation of their characters.
- SubString
- Determines whether one string occurs within another.
- Syllable Alignment
- Combines phonetic information with edit distance-based calculations. Converts the strings to be compared into their corresponding sequences of syllables and calculates the number of edits required to convert one sequence of syllables to the other.
The following table describes the logical relationship between the number of algorithms you can use based on the parent scoring method selected.
Table 2. Matching Algorithm-to-Scoring Method Matrix Scoring Method Algorithms Single Multiple Weighted Average n/a Yes Average n/a Yes Maximum Yes Yes Minimum n/a Yes Vector Summation n/a Yes
- If you are defining a rule in Interflow Match, Intraflow Match, or Transactional Match, and you want to share the rule with other stages and/or users, click the Save button at the top of the window.