Match Definitions

Match definitions contain data and logic that can be used to generate a match code for a data string. A match code is an encrypted string that represents portions of the original input string, like tokens, that are semantically significant. Two data strings are said to match if the same match code is generated for each data string.

Match definitions can be applied at different sensitivity levels, which determine the information that is stored in the match codes. Use higher sensitivity levels when you need matches that are more exact. Use lower sensitivity levels to sort data into general categories, or to capture values that use different spellings to convey the same information.

For example, consider these two names:

Bill Smith
William L. Smith

A Name match definition can be applied to these strings that will generate the same match code for each string. The match codes can then be used to cluster the data. Strings with the same match code are assigned the same Cluster ID, therefore they match:

Input Match Code Cluster ID
Bill Smith 4B~2$$$LWB$$$$$ 1
William L. Smith 4B~2$$$LWB$$$$$ 1

Match codes have many potential uses. You can use them to identify and eliminate duplicate records in a table, or to do a fuzzy search for an item in a table. You can also use match codes as keys when joining two tables. In this way you are really doing a fuzzy join, which increases the effectiveness of your data integration efforts.

To learn more about the uses of match definitions, refer to your SAS data management product documentation.

Multiple Match Codes

With DataFlux Data Management Studio, match definitions have the option to generate one or more match codes for a single input string. This is done by using token combination rules or alternative suggestions for words.

Multiple match codes enable two strings to match even when portions of one or both strings differ. Consequently, a record can be assigned to more than one cluster. Therefore, special attention should be given to the entity resolution process when using a definition that enables multiple match codes.

The score assigned to each match code produced by the definition can be used as a factor when resolving conflicts between clusters. In addition, the Cluster Aggregation node in DataFlux Data Management Studio allows several methods for aggregating match code scores, such as minimum, maximum, or mean across instances of a record or all records in a cluster. See the latest DataFlux Data Management Studio User's Guide for more information about the Cluster Aggregation node.

Combination-Based Matching

Combination-based match definitions create multiple match codes for a single input string by manipulating the tokens of the input string with token combination rules. The input string can be manipulated in many ways to produce useful match codes. A rule can do one or more of the following:

Rules can be applied if and only if the input string meets specified conditions. When a rule is applied, it internally creates a copy of the input to produce the corresponding match code. Records can cluster together in more than one cluster depending on the rules.

Each token combination rule has a weight that is used with the sensitivity level to assign a score to the match code. Scores can be used to rank records within a cluster.

For example, suppose that you have a majority of calendar dates stored in the order of Day, Month, Year with some stored as Month, Day, Year by mistake. When the data is numeric and the day is less than 13, the mistake is not obvious. The goal is to match records based on date even if the dates were entered incorrectly. You can use a combination-based match definition for dates with the token combination rules that are described in the following table:

Rule Tokens included in the match code Weight Sensitivity Level
Default Year, Month, Day 100

All

1 Year, Month, Day (but the values in Month and Day are swapped if they are both numbers between 1 and 12) 40 50 to 85

If you use the match definition to process the following input data at the default sensitivity (85):

ID Data
1 Feb 1 2019
2 January 2 2019
3 1/2/19
4 2/1/2019

the output contains the following multi-record clusters:

ID Data Cluster ID Score
1 Feb 1 2019 0 100
3 1/2/19 0 100
4 2/1/19 0 40
2 January 2 2019 1 100
3 1/2/19 1 40
4 2/1/19 1 100

The records with a score of 100 were created by the default rule, which is equivalent to the logic of a regular match definition. There is always a default rule with all tokens applied at all sensitivities.

Records 3 and 4 occur twice because an additional match code was created for them by Rule 1. The other records do not get additional match codes because they do not satisfy the conditions of Rule 1. The records created by Rule 1 have a lower score because Rule 1 has a lower weight. If you executed the definition at sensitivity 90 or 95, you would not get the additional records from Rule 1 because Rule 1 is applied only from sensitivities 50 to 85.

Suggestion-Based Matching

Suggestion-based match definitions use an internal spell-checking mechanism and optional term frequencies that have been obtained from reference sources or through data mining. Potentially misspelled words generate spelling suggestions from a dictionary of known words. Suggested corrections are based on character deletions, insertions, replacements, and transpositions. Whitespace character insertion, lettercasing, and context-dependent pronunciation can also be taken into account.

The match code for each suggestion is associated with a score that reflects the closeness of that suggestion to the input word. The scores are based on the transformations that were made to the input and frequency counts from trusted reference data. If two suggestions are equally close to the input word, the more frequent suggestion will score higher.

For example, suppose you processed the following name data at the default sensitivity (85) using a Name suggestion-based match definition that uses census information to make suggestions:

ID Data
1 Tom Smith
2 Tim Smith
3 Kim Smyth
4 Thomas Smithe
5 Tommy Smith
6 Timmy Smythe

Each input can produce more than one suggested alternative spelling. These suggestions are processed as if they were additional input records. The resulting match codes from the original and suggested records are used to group the input into clusters. Since each input record can produce multiple match codes, records can appear in more than one cluster.

ID Data Cluster ID Score
1 Tom Smith 0 89.54
2 Tim Smith 0 84.89
4 Thomas Smithe 0 89.37
5 Tommy Smith 0 89.96
6 Timmy Smythe 0 81.4
1 Tom Smith 1 74.73
2 Tim Smith 1 70.08
5 Tommy Smith 1 75.15
3 Kim Smyth 12 80.39
6 Timmy Smythe 12 76.91
3 Kim Smyth 13 70.4
6 Timmy Smythe 13 75.58
4 Thomas Smithe 15 73.76
6 Timmy Smythe 15 80.06