Match Definitions

Match definitions generate match codes for input strings. The match codes represent the character content of the tokens in the input strings. If two input strings generate the same match code, then those two strings are intended to represent the same entity.

The value of a match code is provided by the fact that two strings do not need to be exact character matches in order to generate the same match codes. For example, the string Washington DC should generate the same match code as Washington D.C. Groups of similar strings can be grouped by match code and standardized into a single, consistent representation.

The degree of similarity between strings that will in turn generate the same match code is set by a numeric sensitivity value. By changing the sensitivity value, you change the number of matches between match codes. In the range of 50-100, a low sensitivity value generates a simpler match code. Simpler match codes generate more matches, for strings that are less similar. Higher sensitivity values generate more complex match codes, for fewer matches, which reflects a higher degree of similarity between strings.

In a match definition, input strings are parsed into component substrings called tokens. The tokens are structured to remove unimportant content, then they are combined into a string. The token string is used to generates the match code.

The match definition is executed at a specified level of sensitivity. Internally, the match definition is configured by 10 sensitivity ranges. The input sensitivity maps into a range, and the range determines the content of the token string that generates the match code. For each sensitivity range, the match definition specifies how the tokens are assembled into the token string. At lower sensitivities, tokens may be truncated or removed to enable matches with less similarity.

The match code generation process is summarized as follows:
receive input string
    parse tokens from string
        assemble token string based on sensitivity
            generate a match code

Using Customize, you can create or edit match definitions to fine-tune your match codes for your applications. One way to customize match definitions is to implement suggestion-based matching, which helps you detect and correct errors in spelling and typography. You can also use Customize to implement token-based matching, which helps you detect and correct tokens that are missing or out of position.

Input: a string, or preparsed input, and a sensitivity level

Example:

"John James McDonald"

Output: one or more match codes (an encoded string of characters)

Nodes

Hierarchy	Node/Group	Container Group	Count
1	Match Definition Head Node		1
2	Preprocessing Regex Libraries Group		1
2.1	Preprocessing Regex Library Node	Preprocessing Regex Libraries Group	0 or more
3	Preprocessing Schemes Group		1
3.1	(Preprocessing) Transformation Scheme Node	Preprocessing Schemes Group	0 or more
4	Tokenization Node		1
5	Match Morph Analysis Token Group		1 or more (* and **)
5.1	Morph Analysis Group	Match Morph Analysis Token Group	1 (*)
5.1.1	Lookup Group	Morph Analysis Group	1 (*)
5.1.1.1	Uppercasing Node	Lookup Group	1 (*)
5.1.1.2	Normalization Regex Library Group	Lookup Group	1 (*)
5.1.1.2.1	Regex Library Node	Normalization Regex Library Group	0 or more (*)
5.1.2	Vocabularies Group	Lookup Group	1 (*)
5.1.2.1	Vocabulary Node	Vocabularies Group	0 or more (*)
5.1.3	Categorization Regex Libraries Group	Morph Analysis Group	1 (*)
5.1.3.1	Categorization Regex Library Node	Categorization Regex Libraries Group	0 or more (*)
5.1.4	Number Check Node	Morph Analysis Group	1 (*)
5.1.5	Default Categories Node	Morph Analysis Group	1 (*)
6	Token Combination Rules Group		1 (*)
6.1	Token Combination Rule Node	Token Combination Rule Group	0 or more (*)
7	Match Token Group		1 or more (* and **)
7.1	Suggestions Group	Match Token Group	1
7.1.1	Suggestions Node	Suggestions Group	0 or 1
7.2	Match Normalization Group	Match Token Group	1
7.2.1	Uppercasing Node	Match Normalization Group	1
7.2.2	Pre-Scheme Regex Libraries Group	Match Normalization Group	1
7.2.2.1	Regex Library Node	Pre-Scheme Regex Libraries Group	0 or more
7.3	Noise Word Removal Group	Match Token Group	1
7.3.1	Noise Word Vocabulary Node	Noise Word Removal Group	0 or more
7.4	Table-Based Transformations Group	Match Token Group	1
7.4.1	Transformation Scheme Node	Table-Based Transformations Group	0 or more
7.5	Post-Scheme Regex Libraries Group	Match Token Group	1
7.5.1	Regex Library Node	Post-Scheme Regex Libraries Group	0 or more
7.6	Phonetics Group	Match Token Group	1
7.6.1	Phonetics Library Node	Phonetics Group	0 or more
8	Matchcode Layout Node		1
9	Match Score Threshold Node		1
10	Match Encoding Node		1

(*) not displayed when there is no Tokenization definition

(**) up to maximum number of tokens in data type