Using the Text Rule Builder Node

This example uses the SAMPSIO.NEWS data set to show you how to predict a categorical target variable with the Text Rule Builder node. The results will also show that the model is highly interpretable and useful for explanatory and summary purposes as well. This example assumes that SAS Enterprise Miner is running, and a that diagram workspace has been opened in a project. For information about creating a project and a diagram, see Setting Up Your Project.

The SAMPSIO.NEWS data set consists of 600 brief news articles. Most of the news articles fall into one of these categories: computer graphics, hockey, and medical issues.

The SAMPSIO.NEWS data set contains 600 observations and the following variables:

TEXT is a nominal variable that contains the text of the news article.
graphics is a binary variable that indicates whether the document belongs to the computer graphics category (1-yes, 0-no).
hockey is a binary variable that indicates whether the document belongs to the hockey category (1-yes, 0-no).
medical is a binary variable that indicates whether the document is related to medical issues (1-yes, 0-no).
newsgroup is a nominal variable that contains the group that a news article fits into.

To use the Text Rule Builder node to predict the categorical target variable, newsgroup, in the SAMPSIO.NEWS data set:

Use the Data Source Wizard to define a data source for the data set SAMPSIO.NEWS.
1. Set the measurement levels of the variables graphics, hockey, and medical to Binary.
2. Set the model role of the variable newsgroup to Target and leave the roles of graphics, hockey, and medical as Input.
3. Set the variable text to have a role of Text.
4. Select No in the Data Source Wizard — Decision Configuration dialog box.
5. Use the default target profile for the target newsgroup.
After you create the NEWS data source, drag it to the diagram workspace.

The Text Rule Builder node must be preceded by Text Parsing and Text Filter nodes.
Select the Text Mining tab on the toolbar, and drag a Text Parsing node into the diagram workspace.
Connect the NEWS data source to the Text Parsing node.
Select the Text Mining tab on the toolbar, and drag a Text Filter node into the diagram workspace.
Connect the Text Parsing node to the Text Filter node.
Select the Text Mining tab on the toolbar, and drag a Text Rule Builder node into the diagram workspace.
Connect the Text Filter node to the Text Rule Builder node.

Your process flow diagram should resemble the following:
Select the Text Rule Builder node in the process flow diagram.
Click the value for the Generalization Error property, and select Very Low.
Click the value for the Purity of Rules property, and select Very Low.
Click the value for the Exhaustiveness property, and select Very Low.
In the diagram workspace, right-click the Text Rule Builder node and select Run. Click Yes in the Confirmation dialog box that appears.
Click Results in the Run Status dialog box when the node finishes running.
Select the Rules Obtained table to see information about the rules that were obtained.

The words in the Rule column have the corresponding estimated precision at implying the target, newsgroup.

In the second column above, the True Positive (the first number) is the number of documents that were correctly assigned to the rule. The Total (the second number) is the total positive.

In the third column above, the Remaining Positive (the first number) is the total number of remaining documents in the category. The Total (the second number) is the total number of documents remaining.

In the above example, in the first row, 200 documents have been assigned to the MEDICAL newsgroup, and 600 total documents exist in the data set. Fifty-eight of the documents were assigned to the rule “gordon” (58 were correctly assigned). This means that if a document contains the word “gordon,” and you assign all those documents to the MEDICAL newsgroup, 58 out of 58 will be assigned correctly. In the next row, there are 200 – 58 = 142 MEDICAL newsgroup documents left that can be evaluated for rule assignment, out of a total of 600 – 58 = 542 documents. In this second row, 17 documents are correctly assigned to the rule “msg.” This means that if a document contains the term “msg,” and you assign all those documents to the MEDICAL newsgroup, 17 out of 17 will be assigned correctly.

Most of the rules are single term rules because the NEWS data set is limited in size. However, there is one multiple term rule above. In the 16th row, the rule “amount & ~team” means that if a document contains the word “amount” and does not contain the word “team,” then 4 of the remaining documents will be correctly assigned to the MEDICAL newsgroup.

Note: ~ means logical not.
Select the Score Rankings Overlay graph to view the following types of information about the target variable:
- Cumulative Lift
- Lift
- Gain
- % Response
- Cumulative % Response
- % Captured Response
- Cumulative % Captured Response
Note: To change the statistic, select one of the above options from the drop-down menu.
Select the Fit Statistics window for statistical information about the target variable, newsgroup.
Close the Results window.
Click the value for the Generalization Error property, and select Medium.
Click the value for the Purity of Rules property, and select Medium.
Click the value for the Exhaustiveness property, and select Medium.
Select the NEWS data source.
Click the for the Variables property.
Change the role of the HOCKEY variable to Target, and change the role of the NEWSGROUP variable to Input.
Click OK.
In the diagram workspace, right-click the Text Rule Builder node and select Run. Click Yes in the Confirmation dialog box that appears.
Click Results in the Run Status dialog box when the node finishes running.
Select the Rules Obtained table to see information about the rules that predicted the target — the HOCKEY newsgroup.

The words in the Rule column have the corresponding estimated precision at implying the hockey target.

In the above example, in the first row, 200 documents have been assigned to the HOCKEY newsgroup, and 600 total documents exist in the data set. The target value is 1, instead of “HOCKEY,” because you set the hockey variable to be the target instead of the newsgroup variable. 70 of the documents were assigned to the rule “team” (69 were correctly assigned). This means that if a document contains the word “team,” and you assign all those documents to the HOCKEY newsgroup, 69 out of 70 will be assigned correctly. In the next row, there are 200 – 69 = 131 HOCKEY documents left that can be evaluated for rule assignment, out of a total of 600 – 70 = 530 documents. In this second row, 23 documents are correctly assigned to the rule “hockey.” This means that if a document contains the word “hockey,” and you assign all those documents to the HOCKEY newsgroup, 23 out of 23 will be assigned correctly.
Select the Score Rankings Overlay graph to view the following types of information about the target variable:
- Cumulative Lift
- Lift
- Gain
- % Response
- Cumulative % Response
- % Captured Response
- Cumulative % Captured Response
Note: To change the statistic, select one of the above options from the drop-down menu.
Select the Fit Statistics table for statistical information about the hockey target variable.
Close the Results window.
Click the for the Content Categorization Code property.

The Content Categorization Code window appears. The code provided in this window is the code that is output for SAS Content Categorization and is ready for compilation.
Click Cancel.
Click the for the Change Target Values property.

The Change Target Values window appears.

You can use the Change Target Values window to improve the model.
Select one or more cells in the Assigned Target column, and select a new target value.
Click OK.
Rerun the Text Rule Builder node, and then check whether your model has been improved.