This example shows you
how to use the
Text Parsing node to identify
terms and their instances in a data set that contains text. This example
assumes that SAS Enterprise Miner is running, and that a diagram workspace
has been opened in a project. For information about creating a project
and a diagram, see
Setting Up Your Project.
-
The SAS data set SAMPSIO.ABSTRACT
contains the titles and text of abstracts from conferences. Create
the ABSTRACT data source and add it to your diagram workspace. Set
the Role value of the TEXT and TITLE variables to
Text
.
-
Select the
Text
Mining tab on the toolbar, and drag a
Text
Parsing node into the diagram workspace.
-
Connect the ABSTRACT
data source to the
Text Parsing node.
-
In the diagram workspace,
right-click the
Text Parsing node and select
Run.
Click
Yes in the
Confirmation dialog
box that appears.
-
Click
Results in
the
Run Status dialog box when the node finishes
running. The
Results window displays a variety
of tabular and graphical output to help you analyze the terms and
their instances in the ABSTRACT data source.
-
Sort the terms in the
Terms table
by frequency, and then select the term “software.” As
the
Terms table illustrates, the term “software”
is a noun that occurs in 494 documents in the ABSTRACT data source,
and appears a total number of 881 times.
When you select a term
in the Terms table, the point corresponding to that term in the Text
Parsing Results plots is highlighted.
-
Select the Number of
Documents by Frequency plot, and position the cursor over the highlighted
point for information about the term “software.”
Similar information
is also presented in a ZIPF plot.
The Attribute by Frequency
chart shows that
Alpha
has the highest
frequency among attributes in the document collection.
The Role by Freq chart
illustrates that
Noun
has the highest
frequency among roles in the document collection.
-
Return to the Terms
table, and notice that the term “software” is kept in
the text parsing analysis. This is illustrated by the value of
Y
in
the Keep column. Notice that not all terms are kept when you run the
Text
Parsing node with default settings.
The
Text
Parsing node not only enables you to gather statistical
data about the terms in a document collection. It also enables you
to modify your output set of parsed terms by dropping terms that are
a certain part of speech, type of entity, or attribute. Scroll down
the list of terms in the Terms table and notice that many of the terms
with a role other than
Noun
are kept.
Assume that you want to limit your text parsing results to terms with
a role of
Noun
.
-
Close the
Results window.
-
Select the
Text
Parsing node, and then select the
for the
Ignore Parts of Speech property.
-
In the
Ignore
Parts of Speech dialog box, select all parts of speech
except for
Noun
by holding down CTRL
and clicking on each option. Click
OK. Notice
that the value for the
Ignore Parts of Speech property
is updated with your selection.
-
In addition to nouns,
also keep noun groups. Set the
Noun Groups property
to
Yes
.
-
Right-click the
Text
Parsing node and select
Run.
Click
Yes in the
Confirmation dialog
box that appears. Select
Results in the
Run
Status dialog box when the node has finished running.
Notice that the term “software” has a higher rank among
terms with a role of just “noun” or “noun group”
than it did when other roles were included. If you scroll down in
the
Terms table, you can see that just terms
with a
Noun
or
Noun
Group
role are included.
As we would expect,
there are fewer terms plotted in the
Number of Documents
by Frequency plot:
Similarly, the total
number of terms in the output results with an attribute of
Alpha
has
decreased, as can be seen in the
Attribute by Frequency chart: