This example uses the
SAMPSIO.NEWS data set to show you how to predict a categorical target
variable with the
Text Rule Builder node.
The results will also show that the model is highly interpretable
and useful for explanatory and summary purposes as well. This example
assumes that SAS Enterprise Miner is running, and a that diagram workspace
has been opened in a project. For information about creating a project
and a diagram, see
Setting Up Your Project.
The SAMPSIO.NEWS data
set consists of 600 brief news articles. Most of the news articles
fall into one of these categories: computer graphics, hockey, and
medical issues.
The SAMPSIO.NEWS data
set contains 600 observations and the following variables:
-
TEXT is
a nominal variable that contains the text of the news article.
-
graphics is
a binary variable that indicates whether the document belongs to the
computer graphics category (1-yes, 0-no).
-
hockey is
a binary variable that indicates whether the document belongs to the
hockey category (1-yes, 0-no).
-
medical is
a binary variable that indicates whether the document is related to
medical issues (1-yes, 0-no).
-
newsgroup is
a nominal variable that contains the group that a news article fits
into.
To use the
Text
Rule Builder node to predict the categorical target variable,
newsgroup,
in the SAMPSIO.NEWS data set:
-
Use the Data Source
Wizard to define a data source for the data set SAMPSIO.NEWS.
-
Set the measurement
levels of the variables
graphics,
hockey,
and
medical to
Binary
.
-
Set the model role of
the variable
newsgroup to
Target
and
leave the roles of
graphics,
hockey,
and
medical as
Input
.
-
Set the variable
text to
have a role of
Text
.
-
Select
No
in
the
Data Source Wizard — Decision Configuration dialog
box.
-
Use the default target
profile for the target
newsgroup.
-
After you create the
NEWS data
source, drag it to the diagram workspace.
The
Text
Rule Builder node must be preceded by
Text
Parsing and
Text Filter nodes.
-
Select the
Text
Mining tab on the toolbar, and drag a
Text
Parsing node into the diagram workspace.
-
Connect the
NEWS data
source to the
Text Parsing node.
-
Select the
Text
Mining tab on the toolbar, and drag a
Text
Filter node into the diagram workspace.
-
Connect the
Text
Parsing node to the
Text Filter node.
-
Select the
Text
Mining tab on the toolbar, and drag a
Text
Rule Builder node into the diagram workspace.
-
Connect the
Text
Filter node to the
Text Rule Builder node.
Your process flow diagram
should resemble the following:
-
Select the
Text
Rule Builder node in the process flow diagram.
-
Click the value for
the
Generalization Error property, and select
Very
Low
.
-
Click the value for
the
Purity of Rules property, and select
Very
Low
.
-
Click the value for
the
Exhaustiveness property, and select
Very
Low
.
-
In the diagram workspace,
right-click the
Text Rule Builder node and
select
Run. Click
Yes in
the
Confirmation dialog box that appears.
-
Click
Results in
the
Run Status dialog box when the node finishes
running.
-
Select the
Rules
Obtained table to see information about the rules that
were obtained.
The words in the Rule
column have the corresponding estimated precision at implying the
target,
newsgroup.
In the second column
above, the True Positive (the first number) is the number of documents
that were correctly assigned to the rule. The Total (the second number)
is the total positive.
In the third column
above, the Remaining Positive (the first number) is the total number
of remaining documents in the category. The Total (the second number)
is the total number of documents remaining.
In the above example,
in the first row, 200 documents have been assigned to the MEDICAL
newsgroup, and 600 total documents exist in the data set. Fifty-eight
of the documents were assigned to the rule “gordon”
(58 were correctly assigned). This means that if a document contains
the word “gordon,” and you assign all those documents
to the MEDICAL newsgroup, 58 out of 58 will be assigned correctly.
In the next row, there are 200 – 58 = 142 MEDICAL newsgroup
documents left that can be evaluated for rule assignment, out of a
total of 600 – 58 = 542 documents. In this second row, 17 documents
are correctly assigned to the rule “msg.” This means
that if a document contains the term “msg,” and you
assign all those documents to the MEDICAL newsgroup, 17 out of 17
will be assigned correctly.
Most of the rules are
single term rules because the NEWS data set is limited in size. However,
there is one multiple term rule above. In the 16th row, the rule “amount
& ~team” means that if a document contains the word “amount”
and does not contain the word “team,” then 4 of the
remaining documents will be correctly assigned to the MEDICAL newsgroup.
Note: ~ means logical not.
-
Select the
Score
Rankings Overlay graph to view the following types of
information about the target variable:
-
-
-
-
-
-
-
Cumulative % Captured Response
Note: To change the statistic,
select one of the above options from the drop-down menu.
-
Select the
Fit
Statistics window for statistical information about the
target variable,
newsgroup.
-
Close the
Results window.
-
Click the value for
the
Generalization Error property, and select
Medium
.
-
Click the value for
the
Purity of Rules property, and select
Medium
.
-
Click the value for
the
Exhaustiveness property, and select
Medium
.
-
Select the
NEWS data
source.
-
Click the
for the
Variables property.
-
Change the role of the
HOCKEY variable to
Target
, and change
the role of the NEWSGROUP variable to
Input
.
-
-
In the diagram workspace,
right-click the
Text Rule Builder node and
select
Run. Click
Yes in
the
Confirmation dialog box that appears.
-
Click
Results in
the
Run Status dialog box when the node finishes
running.
-
Select the
Rules
Obtained table to see information about the rules that
predicted the target — the HOCKEY newsgroup.
The words in the Rule
column have the corresponding estimated precision at implying the
hockey target.
In the above example,
in the first row, 200 documents have been assigned to the HOCKEY newsgroup,
and 600 total documents exist in the data set. The target value is
1
,
instead of “HOCKEY,” because you set the
hockey variable
to be the target instead of the
newsgroup variable.
70 of the documents were assigned to the rule “team”
(69 were correctly assigned). This means that if a document contains
the word “team,” and you assign all those documents
to the HOCKEY newsgroup, 69 out of 70 will be assigned correctly.
In the next row, there are 200 – 69 = 131 HOCKEY documents
left that can be evaluated for rule assignment, out of a total of
600 – 70 = 530 documents. In this second row, 23 documents
are correctly assigned to the rule “hockey.” This means
that if a document contains the word “hockey,” and you
assign all those documents to the HOCKEY newsgroup, 23 out of 23 will
be assigned correctly.
-
Select the
Score
Rankings Overlay graph to view the following types of
information about the target variable:
-
-
-
-
-
-
-
Cumulative % Captured Response
Note: To change the statistic,
select one of the above options from the drop-down menu.
-
Select the
Fit
Statistics table for statistical information about the
hockey target variable.
-
Close the
Results window.
-
Click the
for the
Content Categorization Code property.
The
Content
Categorization Code window appears. The code provided
in this window is the code that is output for SAS Content Categorization
and is ready for compilation.
-
-
Click the
for the
Change Target Values property.
The
Change
Target Values window appears.
You can use the
Change
Target Values window to improve the model.
-
Select one or more cells
in the
Assigned Target column, and select
a new target value.
-
-
Rerun the
Text
Rule Builder node, and then check whether your model
has been improved.