DataFlux Data Management Studio 2.7: User Guide
The Vocabulary Node performs a lookup in the vocabulary file to retrieve the categories and likelihoods for each input substring.
Used in:
Select a Vocabulary to use.
Click Open Vocabulary to open the selected vocabulary in the Vocabulary Editor.
If this check box is selected and the input substring is found in the vocabulary of the current node, subsequent Vocabulary Nodes will not process that word. This allows you to decide whether two (or more) vocabularies which contain the same word should contribute all the categories, or if only the first vocabulary categories should be considered.
Perform fuzzy lookups for all categories (if no selection) or specified categories
A word must match an entry in a vocabulary in order to be called a match. When fuzzy lookups are activated, words that are somewhat similar to those in the specified vocabulary will match. This option lets you select a specific category or categories of words from the vocabulary to match instead of using all categories.
Threshold
The threshold specifies the degree of similarity that must exist for the word to match. The higher the number, the more similarity must exist for the match. If you want an exact match, select 100. This turns off fuzzy lookups.
If any word in the Vocabulary matches the input, the message is "Changes applied". Otherwise, the message will be, "No changes applied".
A table with four columns:
Vocabulary outputs are cumulative, so a single node's output includes the outputs from previous vocabularies (if applicable).
A vocabulary is a table containing a list of words. For each word, one or more categories are assigned, and a likelihood is attached to each assignment.
A category indicates the function of the word in the context for which the vocabulary is intended.
For example:
In the context of people's names, some possible categories might be:
- Prefix Word (PW), for example, "Mr"
- Given Name Word (GNW) , for example, "John"
- Family Name Word (FNW), for example, "Smith"
A likelihood indicates the presumptive probability of that word belonging to that category.
For example:
In the context of people's names, assuming the English language, you could say that, given no other information than our general knowledge of English, "Judy" has:
- a very high likelihood of being a GNW
- a very low likelihood of being an FNW
- no possibility of being a PW
In many definitions, vocabularies (using Vocabulary Nodes) are used in conjunction with Grammars (through various Pattern Nodes). For this reason, the categories of a vocabulary that is intended for use in morphological analysis generally correspond to the categories defined in a related grammar.
If you have two vocabularies and the same word is in each vocabulary with different categories, does it assign both categories?
Yes, assuming the "stop if found" flag is not set on the first vocabulary.
What if the word appears in two vocabularies with the same category but different likelihoods?
That category will appear one time and the last likelihood encountered will be used (that is, the duplicate overwrites the original).
Note: This situation tends to cause confusion; it should be avoided, if possible.
Documentation Feedback: yourturn@sas.com
|
Doc ID: DMCust_12321.html |