Using the Text Parsing Node

This example shows you how to use the Text Parsing node to identify terms and their instances in a data set that contains text. This example assumes that SAS Enterprise Miner is running, and that a diagram workspace has been opened in a project. For information about creating a project and a diagram, see Setting Up Your Project.

The SAS data set SAMPSIO.ABSTRACT contains the titles and text of abstracts from conferences. Create the ABSTRACT data source and add it to your diagram workspace. Set the Role value of the TEXT and TITLE variables to Text.
Select the Text Mining tab on the toolbar, and drag a Text Parsing node into the diagram workspace.
Connect the ABSTRACT data source to the Text Parsing node.
In the diagram workspace, right-click the Text Parsing node and select Run. Click Yes in the Confirmation dialog box that appears.
Click Results in the Run Status dialog box when the node finishes running. The Results window displays a variety of tabular and graphical output to help you analyze the terms and their instances in the ABSTRACT data source.
Sort the terms in the Terms table by frequency, and then select the term “software.” As the Terms table illustrates, the term “software” is a noun that occurs in 494 documents in the ABSTRACT data source, and appears a total number of 881 times.

When you select a term in the Terms table, the point corresponding to that term in the Text Parsing Results plots is highlighted.
Select the Number of Documents by Frequency plot, and position the cursor over the highlighted point for information about the term “software.”

Similar information is also presented in a ZIPF plot.

The Attribute by Frequency chart shows that Alpha has the highest frequency among attributes in the document collection.

The Role by Freq chart illustrates that Noun has the highest frequency among roles in the document collection.
Return to the Terms table, and notice that the term “software” is kept in the text parsing analysis. This is illustrated by the value of Y in the Keep column. Notice that not all terms are kept when you run the Text Parsing node with default settings.

The Text Parsing node not only enables you to gather statistical data about the terms in a document collection. It also enables you to modify your output set of parsed terms by dropping terms that are a certain part of speech, type of entity, or attribute. Scroll down the list of terms in the Terms table and notice that many of the terms with a role other than Noun are kept. Assume that you want to limit your text parsing results to terms with a role of Noun.
Close the Results window.
Select the Text Parsing node, and then select the for the Ignore Parts of Speech property.
In the Ignore Parts of Speech dialog box, select all parts of speech except for Noun by holding down CTRL and clicking on each option. Click OK. Notice that the value for the Ignore Parts of Speech property is updated with your selection.
In addition to nouns, also keep noun groups. Set the Noun Groups property to Yes.
Right-click the Text Parsing node and select Run. Click Yes in the Confirmation dialog box that appears. Select Results in the Run Status dialog box when the node has finished running. Notice that the term “software” has a higher rank among terms with a role of just “noun” or “noun group” than it did when other roles were included. If you scroll down in the Terms table, you can see that just terms with a Noun or Noun Group role are included.

As we would expect, there are fewer terms plotted in the Number of Documents by Frequency plot:

Similarly, the total number of terms in the output results with an attribute of Alpha has decreased, as can be seen in the Attribute by Frequency chart: