SAS Quality Knowledge Base for Contact Information 26
Extraction definitions specify data and logic that are used to extract information from a data string. The output of an extraction definition is a set of tokens. A token is a semantically atomic component of a data value. For example, the set of tokens defined for the Contact Info extraction definition might be:
NAME
PHONE
When an extraction definition is applied to a data string, the string is analyzed and if appropriate substrings are found, they are assigned to the output tokens. As an example, consider the results of applying the Contact Info extraction definition to the following string:
C/O Mr John Smith, john.smith@dataflux.com
When the extraction definition is applied, any names, phone numbers, and email addresses are placed in tokens as follows:
Token Name | Token Value |
---|---|
NAME: | Mr John Smith |
E-MAIL: | john.smith@dataflux.com |
PHONE: |
In this example, notice that not all of the output tokens are populated with values. Also, every word in the string is not placed into a token. Only the relevant words are extracted from the string and placed in a token.
Extraction definitions are useful when you want to determine the attributes of an item in order to group, compare, or analyze items with similar features. For instance, the Contact Info extraction definition can be useful if you have a field containing mixed or unknown data.
Documentation Feedback: yourturn@sas.com |
Doc ID: QKBCI_extraction_defs.html |