You can use the SAS Text Miner TMSPELL procedure to create
a new synonym data set. The TMSPELL procedure evaluates all the terms,
automatically identifies which terms are misspellings, and creates
synonyms that map correctly spelled terms to misspelled terms.
To create
a new synonym data set:
-
Select
the
Utility tab and drag a SAS Code node
into the diagram workspace. Connect the Text Miner — Symptom
Text node to the SAS Code node. Right-click the SAS Code node, and
select
Rename. Type
SAS
Code — TMSPELL
in the Node Name box. Click
OK.
-
Select
the arrow that connects the Text Miner — Symptom Text node
to the SAS Code — TMSPELL node. Note the value of the Terms
export
Table property. You will use this
value in the TERMDS= parameter in the next step.
Note: The libref
EMWS in the TERMS Table property is dependent upon the diagram number
within your SAS Enterprise Miner project. If your diagram is the first
one created, then the libref will be EMWS, the second diagram will
be EMWS1, the third will be EMWS2, and so on.
-
Select
the SAS Code — TMSPELL node, and click the
button for the
Code Editor property in the Properties panel.
-
Enter
the following code in the Code Editor:
proc tmspell
data=emws.text2_terms
out=mylib.vaerextsyns
dict=mylib.engdict
maxchildren=10
minparents=8
maxspedis=15
;
-
Click
the
button to save the changes.
-
Click
the
button to run the SAS Code — TMSPELL node.
Click
Yes in the Confirmation dialog box.
-
Click
OK in the dialog box that indicates that the node has
finished running.
-
Close
the Training Code — Code Node window.
-
From the
SAS Enterprise Miner window, select
View Explorer . The Explorer window
opens.
-
Click
Mylib, and then select
Vaerextsyns.
Note: If the
Mylib library is already selected and you do not see
the Vaerextsyns data set, you might need to click
Get
Details or refresh the Explorer window to see the
Vaerextsyns data set.
-
Double-click
the Mylib.Vaerextsyns table to examine it.
Here is
a list of what the Vaerextsyns columns provide:
-
Term is the misspelled word.
-
Parent is a guess at the word that
was meant.
-
Childndocs is the number of documents
that contained that term.
-
# Documents is the number of documents
that contained the parent.
-
Minsped is an indication of how
close the terms are.
-
Dict indicates whether the term
is a legitimate English word. Legitimate words can still be deemed
misspellings, but only if they occur rarely and are very close in
spelling to a frequent target term.
For example,
Observation 52 shows
abdomin to be a misspelling
of
abdominal. Three documents contain
abdomin, while 77 documents contain the parent,
abdominal (this is not shown in the image). The term
abdomin is not a legitimate English word, and an example
text that contains that misspelling is
20 mins later,
upper !!abdomin!!. Note that double exclamation marks
(!!) both precede and succeed the child term in the example text so
you can see the term in context.
-
Examine
the Vaerextsyns table to see whether you disagree with some of the
choices made. For this example, however, assume that the TMSPELL macro
has done a good enough job detecting misspellings.
Note: The Vaerextsyns
table can be edited using any SAS table editor. You cannot edit this
table in the SAS Enterprise Miner GUI. You can change a parent for
any misspellings that appear incorrect or delete a row if the Term
column contains a valid term.
-
Close
the Mylib.Vaerextsyns table and the Explorer window.