The Text Mining Process

Whether you intend to use textual data for descriptive purposes, predictive purposes, or both, the same processing steps take place, as shown in the following table:

Action	Result	Tool
File preprocessing	Creates a single SAS data set from your document collection. The SAS data set is used as input for the Text Miner node or the Text Parsing node, and might contain the actual text or paths to the actual text.	%TMFILTER macro — a SAS macro for extracting text from documents and creating a predefined SAS data set with a text variable
Text parsing	Decomposes textual data and generates a quantitative representation suitable for data mining purposes.	Text Miner node, Text Parsing node
Transformation (dimension reduction)	Transforms the quantitative representation into a compact and informative format.	Text Miner node, Text Filter node
Document analysis	Performs classification, prediction, or concept linking of the document collection. Creates clusters or topics from the data.	Text Miner node, Text Topic node, or SAS Enterprise Miner predictive modeling nodes

Finally, the rules for clustering or predictions can be used to score a new collection of documents at any time.

You might not need to include all of these steps in your analysis. Also, it might be necessary to try a different combination of text-parsing options before you are satisfied with the results.