Document Extraction Node

DataFlux Data Management Studio 2.6: User Guide

Document Extraction Node

You can add a Document Extraction node to identify textual entities and their usage within a given text input. The node will identify the terms (words) found in the input text and the term’s usage categorizations, such as vehicle, person, title, or company. For an example of how this node can be used, see Converting and Extracting a Document.

Once you have added the node, you can double-click it to open its properties dialog. The properties dialog includes the following elements:

Name - Specifies a name for the node.

Notes - Enables you to open the Notes dialog. You use the dialog to enter optional details or any other relevant information for the input.

Source Field - Specifies the name of the input field from the parent node.

Language - Specifies the language of the input text. The default is EN.

Output null rows - Specifies whether an output row should be generated if the input field value contained no terms.

Number of rows to read - Specifies the maximum number of rows to read.

Exclude source field from output - Specifies whether to include the source field contents in the node's output.

You can access the following advanced properties by right-clicking the Document Extraction node:

SOURCE_FIELD
SOURCE_LANGUAGE
DICTIONARY_ENTRIES (see Usage Notes below)
MAX_OUTPUT_ROWS
OUTPUT_NULL_ROWS
EXCLUDE_SOURCE_FIELD

Usage Notes

Using a Custom LITI File from SAS Content Categorization Studio

Some sites have team members who specialize in text analytics. These analysts might have access to SAS Content Categorization Studio. This application can be used to create custom LITI files. LITI files can be referenced by a Document Extraction node in a job, if the LITI file has been saved to a location accessible to the job. In the Document Extraction node, you can use the Advanced property DICTIONARY_ENTRIES to specify a full path to the LITI file.

If you are using a release of DataFlux Data Management Studio prior to 2.6, and you want to specify a custom LITI file in a job, tell the analyst who uses SAS Content Categorization Studio that the Compatibility Date in the project file must be set to 2009/09/21. For more information about LITI files, see the documentation page on support.sas.com for the enterprise version of SAS Content Categorization Studio.

Using "Exclude source field from output" checkbox

If you create a job flow in which a Document Extraction node is followed by an Expression node, deselect the "Exclude source field from output" checkbox in the Document Extraction node. Otherwise you might get an output error.

Documentation Feedback: yourturn@sas.com
Note: Always include the Doc ID when providing documentation feedback.

Doc ID: dfDMStd_CF_ContextualExtract.html