You are here: Definitions>Global Definitions

SAS Quality Knowledge Base for Contact Information 25

Global Definitions

In the Quality Knowledge Base, definitions designated as Global are shared by all locales. The Global definitions are described below.

Case Definitions
Gender Analysis Definitions

Identification Analysis Definitions

Match Definitions

Parse Definitions

Pattern Analysis Definitions

Standardization Definitions

Case Definitions

Lower
Description The Lower case definition lowercases general text.
Example Input Output
Lowercase General Text lowercase general text
Remarks  

 

Proper
Description The Proper case definition performs generic propercasing.
Example Input Output
propercase general text Propercase General Text
Remarks  

 

Proper (Address Number)
Description The Proper (Address Number) case definition propercases address numbers.
Examples Input Output
box 123a Box 123A
4201a 4201A
Remarks  

 

Upper
Description The Upper case definition uppercases general text.
Example Input Output
uppercase general text UPPERCASE GENERAL TEXT
Remarks  

Gender Analysis Definitions

None.

Identification Analysis Definitions

Date (DMY Validation - Numeric Only)
Description The Date (DMY Validation - Numeric Only) identification analysis definition identifies date strings as valid or invalid.
Examples Input String Identity
10/18/2007 INVALID
10/5/07 VALID
October 5, 2007 INVALID
07/10/05 VALID
5/10/07 VALID
Remarks  

 

Date (MDY Validation - Numeric Only)
Description The Date (MDY Validation - Numeric Only) identification analysis definition identifies date strings as valid or invalid.
Examples Input String Identity
10/18/2007 VALID
10/5/07 VALID
October 5, 2007 INVALID
07/10/05 VALID
5/10/07 VALID
Remarks  

 

Date (YMD Validation - Numeric Only)
Description The Date (YMD Validation - Numeric Only) identification analysis definition identifies date strings as valid or invalid.
Examples Input String Identity
10/18/2007 INVALID
10/5/07 VALID
October 5, 2007 INVALID
07/10/05 VALID
5/10/07 INVALID
Remarks  

 

E-mail (Country Identification)
Description The E-mail (Country Identification) identification analysis definition identifies the country of an e-mail address.
Examples Input Output
john.smith@hotmail.co.uk UNITED KINGDOM
john.smith@example.com UNKNOWN
john.smith@harvard.edu UNITED STATES
john.smith@example.co.jp JAPAN
Remarks  

 

E-mail (Validation)
Description The E-mail (Validation) identification analysis definition identifies whether or not an e-mail address is syntactically correct.
Examples Input Output
VIRGINIE.MAZOUINGMAIL.COM INVALID
WBARBER@EARTHLINK.NCT INVALID
wrobinson@helixtechnology INVALID
Ajari@evolvingedge.net VALID
al@crestnational.com VALID
ALIEMARATI@YAHOO.COM VALID
Remarks This definition performs a simple validation of the syntax of the e-mail input. It is not meant to be 100% correct according to international standards, but it is a practical approach to screen your e-mail addresses. Addresses with embedded comments are recognized as invalid, even though the address may be correct. Certain rarely-occurring but otherwise valid constructions (such as valid control characters in a quoted mailbox) are recognized as invalid. Certain rarely-occurring but otherwise invalid constructions (such as double-hyphens in a sub-domain name) are recognized as valid.

Match Definitions

Account Number
Description The Account Number match definition generates match codes which can be used to cluster records containing account numbers. Data values are rendered right-to-left in the match codes, so matches are found based on the least-significant positions in the input data.
Max Length of Match Code 16 characters
Examples Input Cluster ID
111-AB-3333-48-26656 0
211-AB-3333-48-26656 0
111-AB-3338-48-26651 1
Remarks

NoteNote: The results listed above reflect the default match sensitivity (85).

 

E-mail
Description

The E-mail match definition generates match codes which can be used to cluster records containing e-mail addresses.

Max Length of Match Code 38 Characters
  Input Cluster ID
Examples jane.smith@gmail.com 0
jane-smith@gmail.com 0
janesmith@gmail.com 0
<mailto:janesmith@gmail.com> 0

Remarks

NoteNote: The results listed above reflect the default match sensitivity (85).

The E-mail (v23) match definition is now deprecated and will be removed in a future release of the QKB.

The E-mail match definition has been replaced with a copy of the E-mail (v23) definition which takes advantage of updated processing. If you changed your jobs to use the E-mail (v23) definition it is suggested that you change them back.

 

E-mail (v23)
Description

The E-mail (v23) match definition generates match codes which can be used to cluster records containing e-mail addresses.

Max Length of Match Code 38 Characters
  Input Cluster ID
Examples jane.smith@gmail.com 0
jane-smith@gmail.com 0
janesmith@gmail.com 0
<mailto:janesmith@gmail.com> 0

Remarks

NoteNote: The results listed above reflect the default match sensitivity (85).

The E-mail (v23) match definition is now deprecated and will be removed in a future release of the QKB.

The E-mail match definition has been replaced with a copy of the E-mail (v23) definition which takes advantage of updated processing. If you changed your jobs to use the E-mail (v23) definition it is suggested that you change them back.

 

E-mail (with Combinations)
Description The E-mail (with Combinations) match definition generates match codes which can be used to cluster records containing e-mail addresses, with a score for each match code. The types of matches produced by this definition include, but are not limited to, those shown in the examples below.
Max Length of Match Code 54 characters
Example 1 Input Cluster ID
Sensitivities
50 - 100
Weight 100
info@dataflux.com 0
info1@dataflux.com 0
info2@dataflux.com 0
Remarks An e-mail address with a single trailing digit in the mailbox shall match the same email address with no trailing digit in the mailbox.
Example 2 Input Cluster ID
Sensitivities
50 - 89
Weight 100
dave.wagner@acme.com 1
wagner.dave@acme.com 1
DaveWagner@acme.com 1
WagnerDave@acme.com 1
Remarks Two e-mail addresses shall match if their sub-domains match and their mailboxes contain a two-word name (delimited by hyphen, underscore, full stop, or change of case) in the same order or opposite order.
Example 3 Input Cluster ID
Sensitivities
50 - 89
Weight 100
john.doe@mailbox.com 2
john.doe+spam_tracker@mailbox.com 2
john.doe+spam_tracker_2@mailbox.com 2
Remarks Two e-mail addresses shall match if their sub-domains match and their mailboxes differ by the inclusion, omission, or value of a sub-part of the e-mail address (also called an address tag) delimited by the plus sign.
Example 4 Input Cluster ID
Sensitivities
50 - 89
Weight 25
bepstein@acme.com 3
epstein@acme.com 3
Remarks Two e-mail addresses shall match if their sub-domains match and a mailbox containing a family name and another containing the same family name preceded by the first letter of a given name.
Example 5 Input Cluster ID
Sensitivities
50 - 89
Weight 50
davidw@acme.com 4
david@acme.com 4
Remarks Two e-mail addresses shall match if their sub-domains match and a mailbox containing a given name and another containing the same given name followed by the first letter of a given name.
Example 6 Input Cluster ID
Sensitivities
50 - 89
Weight 75
br-epstein@acme.com 5
brian.epstein@acme.com 5
Remarks Two e-mail addresses shall match if their sub-domains match and the first two letters of the given name and the delimited family name (delimited by hyphen, underscore, or full stop) match.
Example 7 Input Cluster ID
Sensitivities
50 - 89
Weight 50
b-epstein@acme.com 6
brian.epstein@acme.com 6
Remarks Two e-mail addresses shall match if their sub-domains match and the first letter of the given name and the delimited family name (delimited by hyphen, underscore, or full stop) match.
Example 8 Input Cluster ID
Sensitivities
50 - 89
Weight 50
dave-william.wagner@acme.com 7
dave.c.wagner@acme.com 7
dave_wagner@acme.com 7
Remarks Two e-mail addresses shall match if their sub-domains match and the first and last parts of the mailbox (delimited by hyphen, underscore, or full stop) match.
Example 9 Input Cluster ID
Sensitivities
50 - 89
Weight 25
dave-william.wagner@acme.com 8
dave@acme.com 8
Remarks Two e-mail addresses shall match if their sub-domains match and the first part of the mailbox (delimited by hyphen, underscore, or full stop and recognized as a family name) match.
Example 10 Input Cluster ID
Sensitivities
50 - 89
Weight 25
andersen@acme.com 9
n_rask_andersen@acme.com 9
Remarks Two e-mail addresses shall match if their sub-domains match and the last part of the mailbox (delimited by hyphen, underscore or full stop and recognized as a family name) match.
Example 11 Input Cluster ID
Sensitivities
50 - 89
Weight 50
soon1923@lgphilips-lcd.com 10
soon1g23@lgphilips-lcd.com 10
Remarks Two e-mail addresses shall match if one contains lowercase "G" and another contains the digit 9 at the corresponding position.
Example 12 Input Cluster ID
Sensitivities
50 - 89
Weight 50
soonl923@lgphilips-lcd.com 11
soon1923@lgphilips-lcd.com 11
Remarks Two e-mail addresses shall match if one contains lowercase "L" and another contains the digit 1 at the corresponding position.
Example 13 Input Cluster ID
Sensitivities
50 - 89
Weight 50
abc1O23@lgphilips-lcd.com 12
abc1o23@lgphilips-lcd.com 12
abc1023@lgphilips-lcd.com 12
Remarks Two e-mail addresses shall match if one contains letter "O" and another contains the digit 0 at the corresponding position.
 
Remarks

This definition generates one or more match codes for each input string. The number of match codes generated for an input string depends on the content of the string. Each match code represents a combination of different parts of the input string; this enables two strings to be matched even when some parts of one or both of the strings differ. See the examples above for an illustration of clusters that may be produced using match codes generated by this definition.

Note that a consequence of the generation of multiple match codes is that a record may be placed in more than one cluster by a subsequent clustering operation. Therefore, special attention should be given to the entity resolution process when using this definition.

Generation of multiple match codes is achieved through the use of token-combination rules in the Match definition. Each match code generated by the definition is associated with one token-combination rule. There is a weight assigned to each rule; each rule's weight is used to calculate a score that is assigned to the match code that is generated by that rule. The score for a match code is equal to the weight of the rule used to generate the match code times the sensitivity that is selected when the definition is executed.

When a record is clustered, the score for the record’s match code represents the confidence with which we can assert that the record belongs in the cluster. Note that when different rules lead to identical clustering results, the scores of the match codes generated by the different rules may be aggregated using the Cluster Aggregation node in a Data Job. The Cluster Aggregation node allows several different methods for aggregating match code scores, such as minimum, maximum, or mean across instances of a record, or minimum, maximum, or mean across all records in a cluster. For information on the Cluster Aggregation node, please refer to your DataFlux Data Management Studio documentation.

Parse Definitions

E-mail
Description The E-mail parse definition parses e-mail addresses into a set of tokens.
Output Tokens Mailbox
Sub-Domain
Top-Level Domain
Additional Info
Example 1 Input Output
info@dataflux.com Mailbox info
Sub-Domain dataflux
Top-Level Domain com
Additional Info  
Example 2 Input Output
John Smith <johnsmith@dataflux.com> Mailbox johnsmith
Sub-Domain dataflux
Top-Level Domain com
Additional Info John Smith
Remarks  

 

IBAN
Description The IBAN parse definition parses International Bank Account Numbers into a general set of tokens.
Output Tokens Country Code
Basic Bank Account Number
Key
Example 1 Input Output
NL91ABNA0417164300 Country Code NL
Basic Bank Account Number ABNA0417164300
Key 91
Example 2 Input Output
CH6906470016006671002 Country Code CH
Basic Bank Account Number 06470016006671002
Key 69
Example 3 Input Output
FR7030002005500000157845Z02 Country Code FR
Basic Bank Account Number 30002005500000157845Z02
Key 70
Remarks This parse definition has been configured for 59 country codes according to the ECBS definition. The ISO 13616 standard specifies the structure of an ISO-compliant national IBAN format. A copy of the ISO 13616 standard can be obtained through the ISO home page, http://www.iso.org/.

 

IBAN (Detailed)
Description The IBAN (Detailed) parse definition parses International Bank Account Numbers into a set of tokens.
Output Tokens Country Code
Control Key
Basic Bank Account Number
Bank Code
Sort Code
Account Number
Key
Example 1 Input Output
NL91ABNA0417164300 Country Code NL
Control Key 91
Basic Bank Account Number
Bank Code ABNA
Sort Code  
Account Number 0417164300
Key  
Example 2 Input Output
CH6906470016006671002 Country Code CH
Control Key 69
Basic Bank Account Number  
Bank Code 06470
Sort Code  
Account Number 016006671002
Key  
Example 3 Input Output
FR7030002005500000157845Z02 Country Code FR
Control Key 70
Basic Bank Account Number  
Bank Code 30002
Sort Code 00550
Account Number 0000157845Z
Key 02
Remarks This parse definition has been configured for 59 country codes according to the ECBS definition. The ISO 13616 standard specifies the structure of an ISO-compliant national IBAN format. A copy of the ISO 13616 standard can be obtained through the ISO home page, http://www.iso.org/.

 

Website
Description The Website parse definition parses Web sites into a set of tokens.
Output Tokens Scheme
Hostname
Project
Example Input Output
http://www.dataflux.com/News-and-Events/ Scheme http://
Hostname www.dataflux.com
Project News-and-Events
Remarks  

Pattern Analysis Definitions

Character
Description The Character pattern analysis definition determines the pattern of characters in the input string.
Output Symbols Symbol Meaning
A uppercase letter
a lowercase letter
9 numeric digit
* other (punctuation, and so on)
Examples Input Output
1 877-846-Flux 9 999*999*Aaaa
JND 5134 AAA 9999
Remarks Whitespace in the input string is represented as whitespace in the output.

 

Character (Script Identification)
Description The Character (Script Identification) pattern analysis definition determines the Unicode script of each character in the input, and outputs a character representing that script.
  Symbol Meaning
Output Symbols L Uppercase Latin character
l Lowercase Latin character
Kanji/Han
Katakana
Hiragana
Hangul
Я Uppercase Cyrillic character
я Lowercase Cyrillic character
Θ Uppercase Greek character
θ Lowercase Greek character
Thai
أ Arabic character
א Hebrew character
9 Numeric digit
* other (punctuation, and so on)
  Input Output
Examples 1ー13ー1 イヌイビル・カチドキ8F 501号室 9*99*9 アアアアア*アアアア9L 999漢漢
JOHN DOE LLLL LLL
(7F, SAS Institute)スズキイチロウ *9L* LLL Lllllllll*アアアアアアア
李大伟 赛仕(北京) 漢漢漢 漢漢*漢漢*
爱新觉罗·溥仪 漢漢漢漢*漢漢
陈耀昌(Chan,Ed Yiu-Cheong) 漢漢漢*Llll*Ll Lll*Llllll*
星光大道62号海王星科技大厦A座6楼 漢漢漢漢99漢漢漢漢漢漢漢漢L漢9漢
珠海市 245400(玫瑰楼) 漢漢漢 999999*漢漢漢*
二零零九年十月二十一日 漢漢漢漢漢漢漢漢漢漢漢
14Mar, 2001 99Lll* 9999
2009/10/21 9999*99*99
H134981(5)------ L999999*9*******
0174685503(D) 9999999999*L*
22020319691106184X 99999999999999999L
碧丽服装(北京)有限公司 漢漢漢漢*漢漢*漢漢漢漢
电话(+86)10-12345678 漢漢**99*99*99999999
Fax:01082741510 Lll*99999999999
(010)82741510-345 *999*99999999*999
Αθήνα Θθθθθ
Банк Яяяя
רודיה סקאלה כשאני אוהב (הערות Liner) Sonotone (1990) אאאאא אאאאא אאאאא אאאא אאאאא Lllll Llllllll 9999
Remarks  

 

Word
Description The Word pattern analysis definition determines the pattern of words in the input string.
Output Symbols Symbol Meaning
A alphabetic
9 numeric digit
M mixed alphabetic/numeric
* other (punctuation, and so on)
Examples Input Output
1 877-846-Flux 9 9*9*A
JND 5134 A 9
216 E 116th St 9 A M A
Remarks Whitespace in the input string is represented as whitespace in the output.

 

Word (Script Identification)
Description The Word (Script Identification) pattern analysis definition determines the Unicode script of each word in the input, and outputs a character representing that script.
  Symbol Meaning
Output Symbols L Latin character
Kanji/Han
Katakana
Hiragana
Hangul
Я Cyrillic
Θ Greek
Thai
أ Arabic
א Hebrew
9 Numeric digit
* other (punctuation, and so on)
  Input Output
Examples 1ー13ー1 イヌイビル・カチドキ8F 501号室 9*9*9 ア*ア9L 9漢
JOHN DOE L L
(7F, SAS Institute)スズキイチロウ *9L* L L*ア
ΑNDREΑS ZIΑKΑS W W
李大伟 赛仕(北京) 漢 漢*漢*
爱新觉罗·溥仪 漢*漢
陈耀昌(Chan,Ed Yiu-Cheong) 漢*L*L L*L*
星光大道62号海王星科技大厦A座6楼 漢9漢L漢9漢
珠海市 245400(玫瑰楼) 漢 9*漢*
二零零九年十月二十一日
14Mar, 2001 9L* 9
2009/10/21 9*9*9
H134981(5)------ L9*9*
0174685503(D) 9*L*
22020319691106184X 9L
碧丽服装(北京)有限公司 漢*漢*漢
电话(+86)10-12345678 漢*9*9*9
Fax:01082741510 L*9
(010)82741510-345 *9*9*9
ΑNDREΑS ZIΑKΑS W W
רודיה סקאלה כשאני אוהב (הערות Liner) Sonotone (1990) א א א א א L L 9
Remarks If a word contains a mix of Greek and Cyrillic, Latin and Cyrillic, or Latin and Greek glyphs (as in the final example, wherein the character Α is the Greek "Alpha" glyph), this definition will output a W, indicating a warning of potentially fraudulent data.

Standardization Definitions

ASCII Non-Printable Character Removal
Description The ASCII Non-Printable Character Removal standardization definition removes control characters and other non-printable characters.
Examples Input Output
Mr. John Smith[DELETE] Mr. John Smith
2004 Honda Accord[NEXT LINE] 2004 Honda Accord
Remarks  

 

E-mail
Description The E-mail standardization definition standardizes e-mail addresses.
Examples Input Output
John Smith <john.smith@dataflux.com> john.smith@dataflux.com
JOHN.SMITH@DATAFLUX.COM john.smith@dataflux.com
mail: john.smith@dataflux.com john.smith@dataflux.com
"john.smith@dataflux.com" john.smith@dataflux.com
"john.smith@hotmail.com" john.smith@hotmail.com
john.Smith.@hotmail ..com. john.smith@hotmail.com
Remarks The E-mail standardization definition removes unnecessary additional information. In some cases, it is also able to correct typos.

 

Hyphen/Dash Removal
Description The Hyphen/Dash Removal standardization definition removes hyphen and dash characters.
Examples Input Output
Mary-Ann MaryAnn
12-12-2000 12122000
Remarks  

 

Hyphen/Dash Space Replacement
Description The Hyphen/Dash Space Replacement standardization definition replaces hyphen and dash characters with a space character.
Examples Input Output
North-Carolina North Carolina
12-12-2000 12 12 2000
Remarks  

 

IBAN (Electronic)
Description The IBAN (Electronic) standardization definition standardizes International Bank Account Numbers for electronic storage.
Examples Input Output
NL91ABNA0417164300 NL91ABNA0417164300
CH6906470016006671002 CH6906470016006671002
FR7030002005500000157845Z02 FR7030002005500000157845Z02
MT84 MALT 0110 0001 2345 MTLC AST0 01S MT84MALT011000012345MTLCAST001S
Remarks  

 

IBAN (Printed)
Description The IBAN (Printed) standardization definition standardizes International Bank Account Numbers for printout.
Examples Input Output
NL91ABNA0417164300 IBAN NL91 ABNA 0417 1643 00
CH6906470016006671002 IBAN CH69 0647 0016 0066 7100 2
FR7030002005500000157845Z02 IBAN FR70 3000 2005 5000 0015 7845 Z02
MT84 MALT 0110 0001 2345 MTLC AST0 01S IBAN MT84 MALT 0110 0001 2345 MTLC AST0 01S
Remarks  

 

Multiple Space Collapse
Description The Multiple Space Collapse standardization definition collapses multiple space characters to one space character.
Examples Input Output
Jack    Miller Jack Miller
12   12 12 12 12 12
Remarks  

 

Non-Alphanumeric Removal
Description The Non-Alphanumeric Removal standardization definition removes all non-alphanumeric characters including spaces.
Examples Input Output
Cary.NC.27513 CARYNC27513
#AA-456-A12 AA456A12
Remarks Output will be in uppercase.

 

Non-Number Removal
Description The Non-Number Removal standardization definition removes all non-number characters.
Examples Input Output
John Smith 123 123
AAA111 111 111111
Remarks  

 

Number Removal
Description The Number Removal standardization definition removes all number characters.
Examples Input Output
John Smith 123 John Smith
AAA111 111 AAA
Remarks  

 

Phone Country Code to Country Name
Description The Phone Country Code to Country Name standardization definition transforms a phone country code into its corresponding country name.
Examples Input Output
+1 United States/Canada
+49 Germany
33 France
0034 Spain
Remarks  

 

Punctuation Removal
Description The Punctuation Removal standardization definition removes all punctuation except hyphen/dash characters.
Examples Input Output
100 Main St. Apt. #100 100 Main St Apt 100
Joan Allen:Steve Allen Joan AllenSteve Allen
Remarks  

 

Punctuation Space Replacement
Description The Punctuation Space Replacement standardization definition replaces all punctuation except hyphen/dash characters with a space character.
Examples Input Output
100 Main St. Apt. #100 100 Main St Apt 100
Joan Allen:Steve Allen Joan Allen Steve Allen
Remarks  

 

Space Removal
Description The Space Removal standardization definition removes all space characters.
Examples Input Output
10 : 10 10:10
N A NA
Remarks  

 

Surrounding Quote Removal
Description The Surrounding Quote Removal standardization definition removes quote characters surrounding an entire string.
Examples Input Output
"1" Steel Tube" 1" Steel Tube
"John O'Malley" John O'Malley
Remarks  

 

URL
Description The URL standardization definition standardizes URLs.
Examples Input Output
http://www.dataflux.com/News-and-Events/ http://www.dataflux.com/news-and-events
ftp:/file.txt ftp://file.txt
Remarks  

 

Website
Description The Website standardization definition standardizes Web sites.
Examples Input Output
WWW.DATAFLUX.COM www.dataflux.com
http://www.dataflux.com/News-and-Events/ www.dataflux.com/news-and-events
Remarks