DataFlux Data Management Studio 2.7: User Guide
Match definitions generate match codes for input strings. The match codes represent the character content of the tokens in the input strings. If two input strings generate the same match code, then those two strings are intended to represent the same entity.
The value of a match code is provided by the fact that two strings do not need to be exact character matches in order to generate the same match codes. For example, the string Washington DC should generate the same match code as Washington D.C. Groups of similar strings can be grouped by match code and standardized into a single, consistent representation.
The degree of similarity between strings that will in turn generate the same match code is set by a numeric sensitivity value. By changing the sensitivity value, you change the number of matches between match codes. In the range of 50-100, a low sensitivity value generates a simpler match code. Simpler match codes generate more matches, for strings that are less similar. Higher sensitivity values generate more complex match codes, for fewer matches, which reflects a higher degree of similarity between strings.
In a match definition, input strings are parsed into component substrings called tokens. The tokens are structured to remove unimportant content, then they are combined into a string. The token string is used to generates the match code.
The match definition is executed at a specified level of sensitivity. Internally, the match definition is configured by 10 sensitivity ranges. The input sensitivity maps into a range, and the range determines the content of the token string that generates the match code. For each sensitivity range, the match definition specifies how the tokens are assembled into the token string. At lower sensitivities, tokens may be truncated or removed to enable matches with less similarity.
The match code generation process is summarized as follows:
receive input string
parse tokens from string
assemble token string based on sensitivity
generate a match code
Using Customize, you can create or edit match definitions to fine-tune your match codes for your applications. One way to customize match definitions is to implement suggestion-based matching, which helps you detect and correct errors in spelling and typography. You can also use Customize to implement token-based matching, which helps you detect and correct tokens that are missing or out of position.
Input: a string, or pre-parsed input, and a sensitivity level
Example:
"John James McDonald"
Output: one or more match codes (an encoded string of characters)
Hierarchy | Node/Group | Container Group | Count |
---|---|---|---|
1 | Match Definition Head Node | 1 | |
2 | Preprocessing Regex Libraries Group | 1 | |
2.1 | Preprocessing Regex Library Node | Preprocessing Regex Libraries Group | 0 or more |
3 | Preprocessing Schemes Group | 1 | |
3.1 | (Preprocessing) Transformation Scheme Node | Preprocessing Schemes Group | 0 or more |
4 | Tokenization Node | 1 | |
5 | Match Morph Analysis Token Group | 1 or more (* and **) | |
5.1 | Morph Analysis Group | Match Morph Analysis Token Group | 1 (*) |
5.1.1 | Lookup Group | Morph Analysis Group | 1 (*) |
5.1.1.1 | Uppercasing Node | Lookup Group | 1 (*) |
5.1.1.2 | Normalization Regex Library Group | Lookup Group | 1 (*) |
5.1.1.2.1 | Regex Library Node | Normalization Regex Library Group | 0 or more (*) |
5.1.2 | Vocabularies Group | Lookup Group | 1 (*) |
5.1.2.1 | Vocabulary Node | Vocabularies Group | 0 or more (*) |
5.1.3 | Categorization Regex Libraries Group | Morph Analysis Group | 1 (*) |
5.1.3.1 | Categorization Regex Library Node | Categorization Regex Libraries Group | 0 or more (*) |
5.1.4 | Number Check Node | Morph Analysis Group | 1 (*) |
5.1.5 | Default Categories Node | Morph Analysis Group | 1 (*) |
6 | Token Combination Rules Group | 1 (*) | |
6.1 | Token Combination Rule Node | Token Combination Rule Group | 0 or more (*) |
7 | Match Token Group | 1 or more (* and **) | |
7.1 | Suggestions Group | Match Token Group | 1 |
7.1.1 | Suggestions Node | Suggestions Group | 0 or 1 |
7.2 | Match Normalization Group | Match Token Group | 1 |
7.2.1 | Uppercasing Node | Match Normalization Group | 1 |
7.2.2 | Match Normalization Group | 1 | |
7.2.2.1 | Regex Library Node | Pre-Scheme Regex Libraries Group | 0 or more |
7.3 | Noise Word Removal Group | Match Token Group | 1 |
7.3.1 | Noise Word Vocabulary Node | Noise Word Removal Group | 0 or more |
7.4 | Table-Based Transformations Group | Match Token Group | 1 |
7.4.1 | Transformation Scheme Node | Table-Based Transformations Group | 0 or more |
7.5 | Post-Scheme Regex Libraries Group | Match Token Group | 1 |
7.5.1 | Regex Library Node | Post-Scheme Regex Libraries Group | 0 or more |
7.6 | Phonetics Group | Match Token Group | 1 |
7.6.1 | Phonetics Library Node | Phonetics Group | 0 or more |
8 | Matchcode Layout Node | 1 | |
9 | Match Score Threshold Node | 1 | |
10 | Match Encoding Node | 1 |
(*) not displayed when there is no Tokenization definition
(**) up to maximum number of tokens in data type
Documentation Feedback: yourturn@sas.com
|
Doc ID: DMCust_12800.html |