E-mail (with Combinations)

SAS Quality Knowledge Base for Contact Information 27

E-mail (with Combinations)

Match Definition

E-mail (with Combinations)
Description	The E-mail (with Combinations) match definition generates match codes which can be used to cluster records containing e-mail addresses. This definition generates one or more match codes for each input string with a score for each match code. Mailboxes are matched aggressively at certain weights and sensitivities noted in the examples. The types of matches produced by this definition include, but are not limited to, those shown below. Delimiter characters are hyphen, underscore, and period. The delimiters do not need to match as long as the rest of the conditions are satisfied.
Max Length of Match Code	54 characters
Example 1	Input	Cluster ID	Conditions
	info@dataflux.com	0	Sensitivities 50 - 100 Weight 100
	info1@dataflux.com	0
	info2@dataflux.com	0
Remarks	Single trailing digits in the mailbox do not affect the match.
Example 2	Input	Cluster ID	Conditions
	dave.wagner@acme.com	1	Sensitivities 50 - 89 Weight 100
	wagner.dave@acme.com	1
	DaveWagner@acme.com	1
	WagnerDave@acme.com	1
Remarks	As long as given and family names are delimited, they can occur in either order in the mailbox and still match. Casing can be used as a method of delimiting the names.
Example 3	Input	Cluster ID	Conditions
	john.doe@mailbox.com	2	Sensitivities 50 - 89 Weight 100
	john.doe+spam_tracker@mailbox.com	2
	john.doe+spam_tracker_2@mailbox.com	2
Remarks	An address tag (a sub-part of the mailbox delimited by the plus sign) does not affect the match.
Example 4	Input	Cluster ID	Conditions
	bepstein@acme.com	3	Sensitivities 50 - 89 Weight 25
	epstein@acme.com	3	Sensitivities 50 - 89 Weight 25
Remarks	A letter preceding a family name in the mailbox does not affect the match.
Example 5	Input	Cluster ID	Conditions
	davidw@acme.com	4	Sensitivities 50 - 89 Weight 50
	david@acme.com	4	Sensitivities 50 - 89 Weight 50
Remarks	A letter following a given name in the mailbox does not affect the match.
Example 6	Input	Cluster ID	Conditions
	br-epstein@acme.com	5	Sensitivities 50 - 89 Weight 75
	brian.epstein@acme.com	5	Sensitivities 50 - 89 Weight 75
Remarks	Two letters will match a full given name starting with those two letters as long as they are delimited from the family name.
Example 7	Input	Cluster ID	Conditions
	b-epstein@acme.com	6	Sensitivities 50 - 89 Weight 50
	brian.epstein@acme.com	6	Sensitivities 50 - 89 Weight 50
Remarks	One letter will match a full given name starting with that letter as long as they are delimited from the family name.
Example 8	Input	Cluster ID	Conditions
	dave-william.wagner@acme.com	7	Sensitivities 50 - 89 Weight 50
	dave.c.wagner@acme.com	7
	dave_wagner@acme.com	7
Remarks	Any delimited words between a given name and family name do not affect the match.
Example 9	Input	Cluster ID	Conditions
	dave-william.wagner@acme.com	8	Sensitivities 50 - 89 Weight 25
	dave@acme.com	8	Sensitivities 50 - 89 Weight 25
Remarks	Any delimited words following a given name do not affect the match.
Example 10	Input	Cluster ID	Conditions
	andersen@acme.com	9	Sensitivities 50 - 89 Weight 25
	n_rask_andersen@acme.com	9	Sensitivities 50 - 89 Weight 25
Remarks	Any delimited words preceding a family name do not affect the match.
Example 11	Input	Cluster ID	Conditions
	soon1923@lgphilips-lcd.com	10	Sensitivities 50 - 89 Weight 50
	soon1g23@lgphilips-lcd.com	10	Sensitivities 50 - 89 Weight 50
Remarks	In otherwise identical mailboxes, the lowercase "G" matches the digit 9.
Example 12	Input	Cluster ID	Conditions
	soonl923@lgphilips-lcd.com	11	Sensitivities 50 - 89 Weight 50
	soon1923@lgphilips-lcd.com	11	Sensitivities 50 - 89 Weight 50
Remarks	In otherwise identical mailboxes, the lowercase "L" matches the digit 1.
Example 13	Input	Cluster ID	Conditions
	abc1O23@lgphilips-lcd.com	12	Sensitivities 50 - 89 Weight 50
	abc1o23@lgphilips-lcd.com	12
	abc1023@lgphilips-lcd.com	12
Remarks	In otherwise identical mailboxes, the letter "O" matches the digit 0.

Remarks	The number of match codes generated for an input string depends on the content of the string. Each match code represents a combination of different parts of the input string; this enables two strings to be matched even when some parts of one or both of the strings differ. See the examples above for an illustration of clusters that may be produced using match codes generated by this definition. Note that a consequence of the generation of multiple match codes is that a record might be placed in more than one cluster by a subsequent clustering operation. Therefore, special attention should be given to the entity resolution process when using this definition. Generation of multiple match codes is achieved through the use of token-combination rules in the Match definition. Each match code generated by the definition is associated with one token-combination rule. There is a weight assigned to each rule; each rule's weight is used to calculate a score that is assigned to the match code that is generated by that rule. The score for a match code is equal to the weight of the rule used to generate the match code times the sensitivity that is selected when the definition is executed. When a record is clustered, the score for the record’s match code represents the confidence with which we can assert that the record belongs in the cluster. Note that when different rules lead to identical clustering results, the scores of the match codes generated by the different rules may be aggregated using the Cluster Aggregation node in a Data Job. The Cluster Aggregation node allows several different methods for aggregating match code scores, such as minimum, maximum, or mean across instances of a record, or minimum, maximum, or mean across all records in a cluster. For information on the Cluster Aggregation node, please refer to your DataFlux Data Management Studio documentation.