DataFlux Data Management Studio 2.6: User Guide

Using N-Grams in Language Guess Definitions

Overview

If you are using Locale Guess definitions to determine which QKB locale settings should be used when processing your data, you might want to use a Language Guess definition with n-gram analysis to improve the accuracy of your locale guess results.

A Language Guess definition is a QKB definition that analyzes an input string and computes a score indicating the confidence that the string is rendered in the language of the current QKB locale. If you embed a Language Guess definition in your Locale Guess definition, your Locale Guess definition will use the score returned by the Language Guess definition as a factor when determining the confidence that the input string belongs to the current QKB locale.

The next display shows a Locale Guess definition, Address, that contains a Language Guess definition, LangGuessDef_01.


Locale Guess Definition That Includes a Language Guess Definition

A Language Guess definition uses one or more n-gram schemes to analyze an input string. For example, if you were to click Open language definition in the right panel of the previous display, the editing window for the LangGuessDef_01 definition would open. The next display shows the editing window for the LangGuessDef_01 definition. This definition includes the n-gram scheme, NGram_Address_USA.


Language Guess Definition with an N-Gram Scheme

An n-gram scheme is a QKB scheme that contains patterns called n-grams that are derived from a body of text that is known to be in the language of the current QKB locale. The n-grams in the scheme are generated from a text file that is imported with the Import N-Grams dialog in the Scheme Builder.

For example, if you were to click Open map in the right panel of the previous display, the Scheme Builder would open and display the contents of the n-gram scheme, NGram_Address_USA.


N-Gram Scheme in the Scheme Builder

At run-time, the Language Guess definition generates a set of n-grams when it analyzes an input string. Each n-gram is then looked up in the n-gram scheme. If the n-gram exists in the scheme, the Language Guess definition adds points to the confidence score for the input string. This means that the confidence that the string belongs to the current QKB locale’s language is increased.

Create an N-Gram Scheme

To create an n-gram scheme, use the Import N-Grams dialog to import an appropriate text file, as shown in the next display.


Import N-Grams Dialog in the Scheme Builder

The text file must be in the language of the current locale. You will get the best results if the file contains the same type of data that will be analyzed by the Language Guess definition. In this example, assume that you want to create an n-gram scheme that will be used to analyze address data in the English (United States) locale. The following steps describe one way to create such a scheme.

  1. Select ToolsOther QKB EditorsScheme Builder from the main menu in DataFlux Data Management Studio. The Scheme Builder displays. You are prompted to select a locale.
  2. Select a locale that matches the data that you want to analyze. For this example, you would select the English (United States) locale.
  3. Select FileImport N-Grams in the Scheme Builder. The Import N-Grams dialog displays.
  4. Select an appropriate text file in the Import file name field. For this example, you might select a file that contained address data from the English (United States) locale.
  5. Specify the Encoding of the data or accept the default.
  6. Specify the Input Format for the data. Select Free Text if each record in the file is not limited to one line. Select Line by Line if each record in the file is limited to one line.
  7. Specify Window Sizes. A window size determines the length of an n-gram. For example, if you select a window size of 3, the Scheme Builder will generate three-character n-grams, or 3-grams, from the text in the input file.

    If you select more than one window size, the Scheme Builder will generate n-grams for each selected window size. While you can use only one window size at a time in your Language Guess definition, it might be useful to generate an n-gram scheme with multiple window sizes so that you can experiment with different window sizes in your Language Guess definition without needing to import the same text file multiple times.

    Note that you might want to experiment with different window sizes before deciding which window size is best for your Language Guess definition. In principle, a larger window size (such as 3 or 4) might produce results with fewer false positives, while a smaller window size (such as 2 or 3) will produce more false positives but fewer false negatives.
  8. In the Retain field, specify whether you want to retain all generated n-grams in your scheme or set a limit on the number to retain. If you want to retain all n-grams, select All. If you want to retain only a certain number of n-grams, enter the number in the edit field. The Scheme Builder will then retain only the most frequently occurring n-grams, up to the number of n-grams that you have specified. This option might be useful if you are importing a very large text file and you want to limit the size of your scheme.
  9. Review your selections. When ready, click OK. The Scheme Builder will scan the import text file and generate n-grams from the text that is stored in the file. The n-grams will appear in the Data column in the Scheme Builder. The number of times the n-gram appeared in the import text file (the frequency of the n-gram) will appear in the Standard column.
  10. After importing, save the Scheme file.

Use an N-Gram Scheme

After a n-gram scheme has been saved, you can select the scheme in an n-gram scheme node in your Language Guess definition. In your Language Guess definition, be sure to select a window size that was used when you imported your n-grams. When analyzing strings, the Language Guess definition will use only n-grams of the specified window size, regardless of what window sizes were used to generate the n-grams contained in the n-gram scheme.

Documentation Feedback: yourturn@sas.com
Note: Always include the Doc ID when providing documentation feedback.

Doc ID: DMCust_Ngram_Using.html