Extraction Definitions

You are here: Definition Types>Extraction Definitions

SAS Quality Knowledge Base for Contact Information 26

Extraction Definitions

Extraction definitions specify data and logic that are used to extract information from a data string. The output of an extraction definition is a set of tokens. A token is a semantically atomic component of a data value. For example, the set of tokens defined for the Contact Info extraction definition might be:

NAME
E-MAIL
PHONE

When an extraction definition is applied to a data string, the string is analyzed and if appropriate substrings are found, they are assigned to the output tokens. As an example, consider the results of applying the Contact Info extraction definition to the following string:

C/O Mr John Smith, john.smith@dataflux.com

When the extraction definition is applied, any names, phone numbers, and email addresses are placed in tokens as follows:

Token Name	Token Value
NAME:	Mr John Smith
E-MAIL:	john.smith@dataflux.com
PHONE:

In this example, notice that not all of the output tokens are populated with values. Also, every word in the string is not placed into a token. Only the relevant words are extracted from the string and placed in a token.

Extraction definitions are useful when you want to determine the attributes of an item in order to group, compare, or analyze items with similar features. For instance, the Contact Info extraction definition can be useful if you have a field containing mixed or unknown data.