DataFlux Data Management Studio 2.7: User Guide

Match Definitions

Match definitions generate match codes for input strings. The match codes represent the character content of the tokens in the input strings. If two input strings generate the same match code, then those two strings are intended to represent the same entity.

The value of a match code is provided by the fact that two strings do not need to be exact character matches in order to generate the same match codes. For example, the string Washington DC should generate the same match code as Washington D.C. Groups of similar strings can be grouped by match code and standardized into a single, consistent representation.

The degree of similarity between strings that will in turn generate the same match code is set by a numeric sensitivity value. By changing the sensitivity value, you change the number of matches between match codes. In the range of 50-100, a low sensitivity value generates a simpler match code. Simpler match codes generate more matches, for strings that are less similar. Higher sensitivity values generate more complex match codes, for fewer matches, which reflects a higher degree of similarity between strings.

In a match definition, input strings are parsed into component substrings called tokens. The tokens are structured to remove unimportant content, then they are combined into a string. The token string is used to generates the match code.

The match definition is executed at a specified level of sensitivity. Internally, the match definition is configured by 10 sensitivity ranges. The input sensitivity maps into a range, and the range determines the content of the token string that generates the match code. For each sensitivity range, the match definition specifies how the tokens are assembled into the token string. At lower sensitivities, tokens may be truncated or removed to enable matches with less similarity.

The match code generation process is summarized as follows:
receive input string
    parse tokens from string
        assemble token string based on sensitivity
            generate a match code

Using Customize, you can create or edit match definitions to fine-tune your match codes for your applications. One way to customize match definitions is to implement suggestion-based matching, which helps you detect and correct errors in spelling and typography. You can also use Customize to implement token-based matching, which helps you detect and correct tokens that are missing or out of position.

Input: a string, or pre-parsed input, and a sensitivity level

Example:

"John James McDonald"

Output: one or more match codes (an encoded string of characters)

Nodes

Hierarchy Node/Group Container Group Count
1 Match Definition Head Node   1
2 Preprocessing Regex Libraries Group   1
2.1 Preprocessing Regex Library Node Preprocessing Regex Libraries Group 0 or more
3 Preprocessing Schemes Group   1
3.1 (Preprocessing) Transformation Scheme Node Preprocessing Schemes Group 0 or more
4 Tokenization Node   1
5 Match Morph Analysis Token Group   1 or more (* and **)
5.1 Morph Analysis Group Match Morph Analysis Token Group 1 (*)
5.1.1 Lookup Group Morph Analysis Group 1 (*)
5.1.1.1 Uppercasing Node Lookup Group 1 (*)
5.1.1.2 Normalization Regex Library Group Lookup Group 1 (*)
5.1.1.2.1 Regex Library Node Normalization Regex Library Group 0 or more (*)
5.1.2 Vocabularies Group Lookup Group 1 (*)
5.1.2.1 Vocabulary Node Vocabularies Group 0 or more (*)
5.1.3 Categorization Regex Libraries Group Morph Analysis Group 1 (*)
5.1.3.1 Categorization Regex Library Node Categorization Regex Libraries Group 0 or more (*)
5.1.4 Number Check Node Morph Analysis Group 1 (*)
5.1.5 Default Categories Node Morph Analysis Group 1 (*)
6 Token Combination Rules Group   1 (*)
6.1 Token Combination Rule Node Token Combination Rule Group 0 or more (*)
7 Match Token Group   1 or more (* and **)
7.1 Suggestions Group Match Token Group 1
7.1.1 Suggestions Node Suggestions Group 0 or 1
7.2 Match Normalization Group Match Token Group 1
7.2.1 Uppercasing Node Match Normalization Group 1
7.2.2

Pre-Scheme Regex Libraries Group

Match Normalization Group 1
7.2.2.1 Regex Library Node Pre-Scheme Regex Libraries Group 0 or more
7.3 Noise Word Removal Group Match Token Group 1
7.3.1 Noise Word Vocabulary Node Noise Word Removal Group 0 or more
7.4 Table-Based Transformations Group Match Token Group 1
7.4.1 Transformation Scheme Node Table-Based Transformations Group 0 or more
7.5 Post-Scheme Regex Libraries Group Match Token Group 1
7.5.1 Regex Library Node Post-Scheme Regex Libraries Group 0 or more
7.6 Phonetics Group Match Token Group 1
7.6.1 Phonetics Library Node Phonetics Group 0 or more
8 Matchcode Layout Node   1
9 Match Score Threshold Node   1
10 Match Encoding Node   1

(*) not displayed when there is no Tokenization definition

(**) up to maximum number of tokens in data type

Documentation Feedback: yourturn@sas.com
Note: Always include the Doc ID when providing documentation feedback.

Doc ID: DMCust_12800.html