DataFlux Data Management Studio 2.6: User Guide

Match Definitions

Match definitions generate match codes for input strings. The match codes represent the character content of the tokens in the input strings. If two input strings generate the same match code, then those two strings are intended to represent the same entity.

The value of a match code is provided by the fact that two strings do not need to be exact character matches in order to generate the same match codes. For example, the string Washington DC should generate the same match code as Washington D.C. Groups of similar strings can be grouped by match code and standardized into a single, consistent representation.

The degree of similarity between strings that will in turn generate the same match code is set by a numeric sensitivity value. By changing the sensitivity value, you change the number of matches between match codes. In the range of 50-100, a low sensitivity value generates a simpler match code. Simpler match codes generate more matches, for strings that are less similar. Higher sensitivity values generate more complex match codes, for fewer matches, which reflects a higher degree of similarity between strings.

In a match definition, input strings are parsed into component substrings called tokens. The tokens are structured to remove unimportant content, then they are combined into a string. The token string is used to generates the match code.

The match definition is executed at a specified level of sensitivity. Internally, the match definition is configured by 10 sensitivity ranges. The input sensitivity maps into a range, and the range determines the content of the token string that generates the match code. For each sensitivity range, the match definition specifies how the tokens are assembled into the token string. At lower sensitivities, tokens may be truncated or removed to enable matches with less similarity.

The match code generation process is summarized as follows:
receive input string
    parse tokens from string
        assemble token string based on sensitivity
            generate a match code

Using Customize, you can create or edit match definitions to fine-tune your match codes for your applications. One way to customize match definitions is to implement suggestion-based matching, which helps you detect and correct errors in spelling and typography. You can also use Customize to implement token-based matching, which helps you detect and correct tokens that are missing or out of position.

Input: a string, or pre-parsed input, and a sensitivity level

Example:

"John James McDonald"

Output: one or more match codes (an encoded string of characters)

Nodes

Hierarchy Node/Group Container Group Count
1 Match Definition Head Node   1
2 Preprocessing Group   1
2.1 Preprocessing Scheme Node Preprocessing Group 0 or more
3 Parsing Node   1
4 Match Token Group   1 or more (*)
4.1 Suggestions Node Suggestions Group 1 or more
4.1.1 Lookup Normalization Group   1
4.1.2 Uppercasing Node   1
4.1.2.1 Normalization Regex Libraries Group   1
4.1.2.2 Normalization Regex Library Node   0 or more
4.2 Noise Word Removal Group   1
4.2.1 Noise Word Vocabulary Node Noise Word Removal Group 0 or more
4.3 Transformations Schemes Group   1
4.3.1 Transformation Scheme Node Transformations Schemes Group 0 or more
4.4 Phonetics Group   1
4.4.1 Phonetics Library Node Phonetics Group 0 or more
5 Matchcode Layout Node   1
6 Match Score Threshold Node   1
7 Match Encoding Node   1

(*) up to maximum number of tokens in data type

Documentation Feedback: yourturn@sas.com
Note: Always include the Doc ID when providing documentation feedback.

Doc ID: DMCust_12800.html