N-Gram Scheme Node

DataFlux Data Management Studio 2.7: User Guide

N-Gram Scheme Node

The N-Gram Scheme Node holds training data known to be in the language of interest. The data is stored in the form of N-Grams, short segments of the input text produced by sliding a window of size N, moving one character at a time, over the text. N refers to the number of characters in the segment (for example, 2, 3, or 4).

For example:

The N-Grams of size 3 in the string "Hello Bob" are:

Hel

ell

llo

lo[space]

o[space]B

[space]Bo

Bob

If bookends are used, there are two additional 3-Grams:

[bookend]He

ob[bookend]

The bookends represent the beginnings and ends of lines.

Used in:

Language Guess Definitions

Properties

Scheme

Select an N-Gram Scheme to use. By default, only files from the locale and its ancestors appear in the drop-down. If you do not see the desired library, click Tools > Options. Click Display and select Show files for all locales under the Library file selection drop-down lists to view QKB files from all locales.

Be sure to select an N-Gram Scheme that contains N-Grams that have the same Window Size as the value of the Window Size property in the Language Guess Definition Head Node. For more information on Window Sizes, see Using N-Grams in your Language Guess Definition.

Output

Individual N-Gram Scheme Nodes have no output, because all the schemes are combined to produce the output.

Usage Note

The scheme for the N-Gram contained within Language Guess definition is designed to always perform a case insensitive match and also to trim multiple white spaces to a single white space. The options contained within the scheme which are accessible via the Options button on the lower right hand side of the Scheme Builder are not used in the context of the N-Gram processing.