Name (with Combinations)

SAS Quality Knowledge Base for Contact Information 27

Name (with Combinations)

Match Definition

Name (with Combinations)
Description	The Name (with Combinations) match definition generates match codes which can be used to cluster records containing names of individuals.
Max Length of Match Code	27 characters
Example 1	Input	Cluster ID	Conditions
	Mary Shafer	0	Sensitivities 50 - 100 Weight 100
	Shaffer, Mary	0	Sensitivities 50 - 100 Weight 100
Example 2	Input	Cluster ID	Conditions
	D Harrison Wood	1	Sensitivities 50 - 100 Weight 20
	Harrison Wood	1	Sensitivities 50 - 100 Weight 20
Example 3	Input	Cluster ID	Conditions
	Martha Newell-Warner	2	Sensitivities 50 - 100 Weight 80
	Martha Newell	2	Sensitivities 50 - 100 Weight 80
Example 4	Input	Cluster ID	Conditions
	John Elton	3	Sensitivities 50 - 100 Weight 50
	Elton John	3	Sensitivities 50 - 100 Weight 50
Remarks	This definition generates one or more match codes for each input string. The number of match codes generated for an input string depends on the content of the string. Each match code represents a combination of different parts of the input string; this enables two strings to be matched even when some parts of one or both of the strings differ. See the examples above for an illustration of clusters that might be produced using match codes generated by this definition. Note that a consequence of generating multiple match codes is that a record might be placed in more than one cluster by a subsequent clustering operation. Therefore, special attention should be given to the entity resolution process when using this definition. Generation of multiple match codes is achieved through the use of token-combination rules in the match definition. Each match code generated by the definition is associated with one token-combination rule. There is a weight assigned to each rule; each rule's weight is used to calculate a score that is assigned to the match code that is generated by that rule. The score for a match code is equal to the weight of the rule used to generate the match code times the sensitivity that is selected when the definition is executed. When a record is clustered, the score for the record’s match code represents the confidence with which we can assert that the record belongs in the cluster. Note that when different rules lead to identical clustering results, the scores of the match codes generated by the different rules might be aggregated using the Cluster Aggregation node in a Data Job. The Cluster Aggregation node allows several different methods for aggregating match code scores, such as minimum, maximum, or mean across instances of a record, or minimum, maximum, or mean across all records in a cluster. For information on the Cluster Aggregation node, refer to the documentation provided with the DataFlux Data Management Studio installation.