Using the Text Cluster Node

This example uses the Text Cluster node to cluster SAS Users Group International (SUGI) abstracts. This example assumes that SAS Enterprise Miner is running, and that a diagram workspace has been opened in a project. For information about creating a project and a diagram, see Setting Up Your Project.

Note: SAS Users Group International is now SAS Global Forum.

Perform the following steps:

Create a data source for SAMPSIO.ABSTRACT. Change the Role of the variable TITLE to ID.

Note: The SAMPSIO.ABSTRACT data set contains information about 1,238 papers that were prepared for meetings of SUGI from 1998 through 2001 (SUGI 23 through 26). The variable TITLE is the title of the SUGI paper. The variable TEXT contains the abstract of the SUGI paper.
Add the SAMPSIO.ABSTRACT data source to the diagram workspace.
Select the Text Mining tab on the Toolbar, and drag a Text Parsing node into the diagram workspace.
Connect the Input Data node to the Text Parsing node.
Select the Text Parsing node, and then click the for the Stop List property.
Click the Import button, browse to select SAMPSIO.SUGISTOP as the stop list, and then click OK. Click OK to exit the dialog box for the Stop List property.
Set the Find Entities property to Standard.
Click the for the Ignore Types of Entities property to open the Ignore Types of Entities dialog box.
Select all entity types except for: Location, Organization, Person, and Product. Click OK.
Select the Text Mining tab, and drag a Text Filter node into the diagram workspace.
Connect the Text Parsing node to the Text Filter node.
Select the Text Mining tab, and drag a Text Cluster node into the diagram workspace.
Connect the Text Filter node to the Text Cluster node. Your process flow diagram should resemble the following:
Right-click the Text Cluster node and select Run. Click Yes in the Confirmation dialog box.
Click Results in the Run Status dialog box when the node has finished running.
Select the Clusters table.

The Clusters table contains an ID for each cluster, the descriptive terms that make up that cluster, and statistics for each cluster.
Select the first cluster in the Clusters table.
Select the Cluster Frequencies window to see a pie chart of the clusters by frequency. Position the mouse pointer over a section to see the frequency for that cluster in a tooltip.
Select the Cluster Frequency by RMS window, and then position the mouse pointer over the highlighted cluster.

The frequency of the first cluster is the highest, but how does it compare to the other clusters in terms of distance?
Select the Distance Between Clusters window. Then position the mouse pointer over the highlighted cluster to see the position of the first cluster in an X and Y coordinate grid.

Position the mouse pointer over other clusters to compare distances.
Close the Results window.

Now compare the clustering results that were obtained with the Expectation-Maximization clustering algorithm with using a Hierarchical clustering algorithm.
Select the Text Cluster node.
Select Exact for the Exact or Maximum Number property.
Specify 10 for the Number of Clusters property.
Select Hierarchical for the Cluster Algorithm property.
Right-click the Text Cluster node and select Run. Click Yes in the Confirmation dialog box.
Click Results in the Run Status dialog box when the node has finished running.
Select the Clusters table.

Notice that while there are 10 clusters in the table, the Cluster IDs do not range from 1 to 10.
Select the Hierarchy Data table for more information about the clusters that appear in the Clusters table.
Select the Cluster Hierarchy table for a hierarchical graphical representation of the clusters.
Close the Results window.