DataFlux Data Management Studio 2.6: User Guide
If you are using Locale Guess definitions to determine which QKB locale settings should be used when processing your data, you might want to use a Language Guess definition with n-gram analysis to improve the accuracy of your locale guess results.
A Language Guess definition is a QKB definition that analyzes an input string and computes a score indicating the confidence that the string is rendered in the language of the current QKB locale. If you embed a Language Guess definition in your Locale Guess definition, your Locale Guess definition will use the score returned by the Language Guess definition as a factor when determining the confidence that the input string belongs to the current QKB locale.
The next display shows a Locale Guess definition, Address, that contains a Language Guess definition, LangGuessDef_01.
Locale Guess Definition That Includes a Language Guess Definition
A Language Guess definition uses one or more n-gram schemes to analyze an input string. For example, if you were to click Open language definition in the right panel of the previous display, the editing window for the LangGuessDef_01 definition would open. The next display shows the editing window for the LangGuessDef_01 definition. This definition includes the n-gram scheme, NGram_Address_USA.
Language Guess Definition with an N-Gram Scheme
An n-gram scheme is a QKB scheme that contains patterns called n-grams that are derived from a body of text that is known to be in the language of the current QKB locale. The n-grams in the scheme are generated from a text file that is imported with the Import N-Grams dialog in the Scheme Builder.
For example, if you were to click Open map in the right panel of the previous display, the Scheme Builder would open and display the contents of the n-gram scheme, NGram_Address_USA.
N-Gram Scheme in the Scheme Builder
At run-time, the Language Guess definition generates a set of n-grams when it analyzes an input string. Each n-gram is then looked up in the n-gram scheme. If the n-gram exists in the scheme, the Language Guess definition adds points to the confidence score for the input string. This means that the confidence that the string belongs to the current QKB locale’s language is increased.
To create an n-gram scheme, use the Import N-Grams dialog to import an appropriate text file, as shown in the next display.
Import N-Grams Dialog in the Scheme Builder
The text file must be in the language of the current locale. You will get the best results if the file contains the same type of data that will be analyzed by the Language Guess definition. In this example, assume that you want to create an n-gram scheme that will be used to analyze address data in the English (United States) locale. The following steps describe one way to create such a scheme.
After a n-gram scheme has been saved, you can select the scheme in an n-gram scheme node in your Language Guess definition. In your Language Guess definition, be sure to select a window size that was used when you imported your n-grams. When analyzing strings, the Language Guess definition will use only n-grams of the specified window size, regardless of what window sizes were used to generate the n-grams contained in the n-gram scheme.
Documentation Feedback: yourturn@sas.com
|
Doc ID: DMCust_Ngram_Using.html |