DataFlux Data Management Studio 2.7: User Guide

Vocabulary Editor

The Vocabulary Editor allows you to build a vocabulary. When the parsing system needs to categorize a word, it can then easily search a single Vocabulary rather than multiple text files. We recommend you build one Vocabulary per parse definition. Before using the Customize Vocabulary Editor, you should define all of your basic categories and implement them in a Grammar, create your parse definitions, and create a text file for each basic category.

Each word in the Vocabulary is defined as belonging to one or more categories, which are defined in an associated Grammar. Each word is also assigned a likelihood, which is a score indicating a level of confidence (i.e. VERY HIGH, HIGH, MEDIUM, LOW, VERY LOW) that a word belongs to a certain category. For example, the next display shows the likelihood associated with each category for the word "Kim," a word in the EN Gender Analysis Vocabulary.

In the example above, the word "Kim" has two categories: FGW (Female Given Name Word) and MGW (Male Given Name Word). Both of these categories have a likelihood of MEDIUM. In some contexts, both the category and likelihood values are used to make a determination about a word, such as the gender associated with the word. In other contexts, only one of these values might be used to make a determination.

Within the Vocabulary Editor, you must specify which input text sources you want to combine to develop a Vocabulary, and indicate which category's data each library represents. Note that a Vocabulary is stored in a proprietary format. To help ensure Vocabulary integrity, you should not attempt to create or edit a Vocabulary directly, but rather through the Vocabulary Editor.

Building Vocabularies

Create a New Vocabulary File

You can use the Vocabulary Editor to build Vocabularies. We recommend that one Vocabulary be created for each parse definition. The following steps assume that you have registered a QKB as described in Registering (Adding) a QKB.

Perform these steps to add a new vocabulary file:

  1. Select Tools > Other QKB Editors > Vocabulary Editor from the main Data Management Studio window. The Vocabulary Editor displays, and you are prompted to specify a locale.
  2. Specify a locale in the Select Locale(s) dialog and click OK. Typically, this locale would be one with vocabularies that you want to view or maintain in the Vocabulary Editor.
  3. In the Vocabulary Editor, select File > New. You are prompted to specify a locale where the new vocabulary file should be stored.
  4. Specify a locale in the Select Locale(s) dialog and click OK. A new, empty vocabulary file opens in the Vocabulary Editor.

Import Categories from a Grammar

You can import categories and likelihoods so that you can associate those values with the words that you add to your vocabulary. Each word must be associated with one or more categories from the Grammar.

  1. To import categories into the Vocabulary Editor, choose Options > Categories. The Categories dialog appears.
  2. Click Import to display the Select Grammar dialog.
  3. Select the Grammar that you want to associate with your new Vocabulary, and then click OK.
  4. On the Categories screen, the Grammar's standard category abbreviations and descriptions appear. (Derived categories are not imported.)
  5. Use the Delete button to delete any unwanted categories, and then click Close.

Import Words Into the Vocabulary

After you import categories, you can import the words to create a vocabulary that fits those categories.

Planning Your Import

Before you import a vocabulary file, review its contents. Make sure that the file meets the criteria described in this section.

You can import vocabularies in the following formats: 

For the Delimited text file format, the file to be imported must have the following format: 

<word ><delimiter><category><delimiter><likelihood>

For example, the following is an example of a delimited text file which uses commas as the delimiter and contains a header row.


Delimited Text File in CSV Format

The delimiter character is specified as part of the import action.

If the chosen delimiter character occurs as part of a word, category, or likelihood value, you must encapsulate the word, category, and likelihood values with a text qualifier in the import action. For example, if the delimiter is a space value and the word, category, and likelihood values contain space values, you might use a double-quotation mark to demarcate the values as follows:

“This is the word” “This is the category” “Very High”

If the text identifier occurs naturally within a word or category value, it must be escaped as well. For example:

"This is an escaped ""quotation"" in a word" "Quoted Words category" "Very Low"

The valid likelihood values are: Very Low, Low, Medium, High, and Very High. All other values are considered an error and will cause the import to fail.

All spaces before and after the delimiter will be removed during the import.

Performing the Import

  1. In the Vocabulary Editor, select to display the Import Words dialog. A window similar to the following displays:

The following describes the parameters on the Import Words dialog.

Note: Some parameters are enabled or disabled depending on choices made on other parameters.

Type - Specify the input file type.

Set QKB to Import From - Specify which Quality Knowledge Base to import the Vocabulary or Scheme from if it is not the current QKB.

File name - Specify the name of the input file to import. Tip: Select the proper file extension filter in the file chooser dialog to find the desired input file (i.e. *.csv for comma separated files, *.tab for tab delimited files).

Encoding - Specify the encoding of the input file.

Delimiter - Specify the delimiter used in the delimited text file.

Text qualifier - Specify the text qualifier used in the delimited text file.

Rows to skip - Specify the number of rows to skip at the beginning of the delimited text file.

Filter Word List - Specify a category filter so that only words that have a category which matches the filter will be imported.

Use categories from imported vocabulary words - Specify if the categories for a given imported vocabulary word should be added when the word already exists in the vocabulary being imported into.

In case of likelihood conflict during merge - When the imported word already exists in the vocabulary being imported into, and if both words share categories, and if the likelihood values differ, you need to decide how to resolve the likelihood conflict. If this situation is encountered during the import, use this option to specify whether the existing likelihood should be preserved, whether the likelihood from the imported word should be used, or whether a prompt should be displayed to reconcile the conflict.

Categories to add - Specify the categories and likelihoods to assign to every word imported. For each category that you select, you can select Overwrite likelihood. Choosing this option adds your specified likelihood in place of a different value that may already exist for that word and category in the vocabulary being imported into.

  1. Once all inputs have been provided, click Import to import words from the specified file. If you are importing categories which do not already exist in the vocabulary being imported to, you will be prompted to enter a description of each new category being imported. Enter a description for each new category and click OK to continue.
  2. When the import is complete, a status message about the import displays. Click OK to dismiss the status message.

  3. Click Close to close the Import Words dialog. The imported words display as shown below.


Review Categories and Likelihoods

Now that you are looking at the imported words and categories in your new Vocabulary, you can add or delete individual categories, or change likelihood values.

Here is an example of when you might want to update a likelihood value. If your vocabulary contains the name Scott, then that word might have the categories Family Name Word (FNW) and Given Name Word (GNW). You might determine that Scott is more likely to be a Given Name Word than a Family Name Word. You could then increase the likelihood value of the Given Name category for the word Scott.

To change a likelihood value, select the word and click Edit.

Although there may be some adjustments that you want to make to the likelihoods at this point, later testing with the Parse Test Tool will probably reveal other necessary adjustments to give the desired result.

Save the Vocabulary

Now that your Vocabulary is built, you need to save it. Select File > Save. If this is a newly built Vocabulary, the Vocabulary Editor will prompt you for a name.

Modifying Vocabularies

Other than altering the likelihood for specific words in a Vocabulary, we recommend you not make many other modifications. However, certain situations may warrant it, so the Vocabulary Editor does allow these operations. This may provide a good way to temporarily make changes for testing purposes.

Add a word to a Vocabulary

  1. On the Vocabulary Editor dialog, select File > Open. The Open dialog opens.
  2. Select the Vocabulary to which you want to add a word, and then click Open. The Vocabulary's details appear on the Vocabulary Editor dialog.
  3. Select Edit > Add Word. The Add Word dialog appears.
  4. Enter your new word, and then click OK. The word now appears selected under Word on the Vocabulary Editor dialog.
  5. On the right side of the screen, add at least one category with a likelihood value.

NoteNote: The Vocabulary Editor will alert you if you try to add a word that already exists in the Vocabulary.

Modify a word in a Vocabulary

  1. On the Vocabulary Editor dialog, select File > Open. The Open dialog appears.
  2. Select the Vocabulary that contains the word you want to modify, and then click Open. The Vocabulary's details appear on the Vocabulary Editor dialog.
  3. Under Word, select the word or words that you want to modify. The word's categories appear on the right side of the dialog.
  4. Click Add to display the Add Word Category dialog. If you change the likelihood of a category that already belongs to the selected word or words, you will receive the Overwrite Category dialog. The Overwrite Category dialog enables you to accept or refuse one or more changed likelihood values.
  5. Select a category and click Edit to change category settings and likelihood values.
  6. Select a category and click Delete to remove that category from the selected word or words.

Delete a word from a Vocabulary

  1. On the Vocabulary Editor dialog, select File > Open. The Open dialog appears.
  2. Select the Vocabulary that contains the word you want to delete, and then click Open. The Vocabulary's details appear on the Vocabulary Editor dialog.
  3. Under Word, select the word you want to delete.
  4. Select Edit > Delete Word.

Documentation Feedback: yourturn@sas.com
Note: Always include the Doc ID when providing documentation feedback.

Doc ID: dfU_Cstm_Vocab_14000.html