DataFlux Data Management Studio 2.7: User Guide
The Vocabulary Editor allows you to build a vocabulary. When the parsing system needs to categorize a word, it can then easily search a single Vocabulary rather than multiple text files. We recommend you build one Vocabulary per parse definition. Before using the Customize Vocabulary Editor, you should define all of your basic categories and implement them in a Grammar, create your parse definitions, and create a text file for each basic category.
Each word in the Vocabulary is defined as belonging to one or more categories, which are defined in an associated Grammar. Each word is also assigned a likelihood, which is a score indicating a level of confidence (i.e. VERY HIGH, HIGH, MEDIUM, LOW, VERY LOW) that a word belongs to a certain category. For example, the next display shows the likelihood associated with each category for the word "Kim," a word in the EN Gender Analysis Vocabulary.
In the example above, the word "Kim" has two categories: FGW (Female Given Name Word) and MGW (Male Given Name Word). Both of these categories have a likelihood of MEDIUM. In some contexts, both the category and likelihood values are used to make a determination about a word, such as the gender associated with the word. In other contexts, only one of these values might be used to make a determination.
Within the Vocabulary Editor, you must specify which input text sources you want to combine to develop a Vocabulary, and indicate which category's data each library represents. Note that a Vocabulary is stored in a proprietary format. To help ensure Vocabulary integrity, you should not attempt to create or edit a Vocabulary directly, but rather through the Vocabulary Editor.
You can use the Vocabulary Editor to build Vocabularies. We recommend that one Vocabulary be created for each parse definition. The following steps assume that you have registered a QKB as described in Registering (Adding) a QKB.
Perform these steps to add a new vocabulary file:
You can import categories and likelihoods so that you can associate those values with the words that you add to your vocabulary. Each word must be associated with one or more categories from the Grammar.
After you import categories, you can import the words to create a vocabulary that fits those categories.
Before you import a vocabulary file, review its contents. Make sure that the file meets the criteria described in this section.
You can import vocabularies in the following formats:
For the Delimited text file format, the file to be imported must have the following format:
<word ><delimiter><category><delimiter><likelihood>
For example, the following is an example of a delimited text file which uses commas as the delimiter and contains a header row.
Delimited Text File in CSV Format
The delimiter character is specified as part of the import action.
If the chosen delimiter character occurs as part of a word, category, or likelihood value, you must encapsulate the word, category, and likelihood values with a text qualifier in the import action. For example, if the delimiter is a space value and the word, category, and likelihood values contain space values, you might use a double-quotation mark to demarcate the values as follows:
“This is the word” “This is the category” “Very High”
If the text identifier occurs naturally within a word or category value, it must be escaped as well. For example:
"This is an escaped ""quotation"" in a word" "Quoted Words category" "Very Low"
The valid likelihood values are: Very Low, Low, Medium, High, and Very High. All other values are considered an error and will cause the import to fail.
All spaces before and after the delimiter will be removed during the import.
The following describes the parameters on the Import Words dialog.
Note: Some parameters are enabled or disabled depending on choices made on other parameters.
Type - Specify the input file type.
Set QKB to Import From - Specify which Quality Knowledge Base to import the Vocabulary or Scheme from if it is not the current QKB.
File name - Specify the name of the input file to import. Tip: Select the proper file extension filter in the file chooser dialog to find the desired input file (i.e. *.csv for comma separated files, *.tab for tab delimited files).
Encoding - Specify the encoding of the input file.
Delimiter - Specify the delimiter used in the delimited text file.
Text qualifier - Specify the text qualifier used in the delimited text file.
Rows to skip - Specify the number of rows to skip at the beginning of the delimited text file.
Filter Word List - Specify a category filter so that only words that have a category which matches the filter will be imported.
Use categories from imported vocabulary words - Specify if the categories for a given imported vocabulary word should be added when the word already exists in the vocabulary being imported into.
In case of likelihood conflict during merge - When the imported word already exists in the vocabulary being imported into, and if both words share categories, and if the likelihood values differ, you need to decide how to resolve the likelihood conflict. If this situation is encountered during the import, use this option to specify whether the existing likelihood should be preserved, whether the likelihood from the imported word should be used, or whether a prompt should be displayed to reconcile the conflict.
Categories to add - Specify the categories and likelihoods to assign to every word imported. For each category that you select, you can select Overwrite likelihood. Choosing this option adds your specified likelihood in place of a different value that may already exist for that word and category in the vocabulary being imported into.
When the import is complete, a status message about the import displays. Click OK to dismiss the status message.
Now that you are looking at the imported words and categories in your new Vocabulary, you can add or delete individual categories, or change likelihood values.
Here is an example of when you might want to update a likelihood value. If your vocabulary contains the name Scott, then that word might have the categories Family Name Word (FNW) and Given Name Word (GNW). You might determine that Scott is more likely to be a Given Name Word than a Family Name Word. You could then increase the likelihood value of the Given Name category for the word Scott.
To change a likelihood value, select the word and click Edit.
Although there may be some adjustments that you want to make to the likelihoods at this point, later testing with the Parse Test Tool will probably reveal other necessary adjustments to give the desired result.
Now that your Vocabulary is built, you need to save it. Select File > Save. If this is a newly built Vocabulary, the Vocabulary Editor will prompt you for a name.
Other than altering the likelihood for specific words in a Vocabulary, we recommend you not make many other modifications. However, certain situations may warrant it, so the Vocabulary Editor does allow these operations. This may provide a good way to temporarily make changes for testing purposes.
Note: The Vocabulary Editor will alert you if you try to add a word that already exists in the Vocabulary.
Documentation Feedback: yourturn@sas.com
|
Doc ID: dfU_Cstm_Vocab_14000.html |