DataFlux Data Management Studio 2.6: User Guide
The N-Gram Scheme Node holds training data known to be in the language of interest. The data is stored in the form of N-Grams, short segments of the input text produced by sliding a window of size N, moving one character at a time, over the text. N refers to the number of characters in the segment (for example, 2, 3, or 4).
For example:
The N-Grams of size 3 in the string "Hello Bob" are:
- Hel
- ell
- llo
- lo[space]
- o[space]B
- [space]Bo
- Bob
If bookends are used, there are two additional 3-Grams:
- [bookend]He
- ob[bookend]
The bookends represent the beginnings and ends of lines.
Used in:
Select an N-Gram Scheme to use. By default, only files from the locale and its ancestors appear in the drop-down. If you do not see the desired library, click Tools > Options. Click Display and select Show files for all locales under the Library file selection drop-down lists to view QKB files from all locales.
Be sure to select an N-Gram Scheme that contains N-Grams that have the same Window Size as the value of the Window Size property in the Language Guess Definition Head Node. For more information on Window Sizes, see Using N-Grams in your Language Guess Definition.
Individual N-Gram Scheme Nodes have no output, because all the schemes are combined to produce the output.
Documentation Feedback: yourturn@sas.com
|
Doc ID: dfU_Cstm_12336.html |