You can use the SAS Text Miner %TEXTSYN macro to create
a new synonym data set. The %TEXTSYN macro evaluates all the terms,
automatically identifies which terms are misspellings, and creates
synonyms that map correctly spelled terms to misspelled terms.
To create a new synonym
data set:
-
Select the
Utility tab
on the node toolbar and drag a
SAS Code node
into the diagram workspace.
-
Right-click the
SAS
Code node, and select
Rename.
-
Enter
SAS
Code — %TEXTSYN in the
Node
Name field, and then click
OK.
-
Connect the
Text
Filter — Symptom Text node to the
SAS
Code — %TEXTSYN node.
-
Select the
SAS
Code — %TEXTSYN node, and then click the
for the
Code Editor property
in the Properties Panel.
The
Code
Editor window appears.
-
Enter the following
code in the
Code Editor:
%textsyn( termds=<libref>.<nodeID>_terms
, docds=&em_import_data
, outds=&em_import_transaction
, textvar=symptom_text
, mnpardoc=8
, mxchddoc=10
, synds=mylib.vaerextsyns
, dict=mylib.engdict
, maxsped=15
) ;
Note: You will need to replace
<libref>
and
<nodeID>
in
the first line in the above code with the correct library name and
node ID. To determine what these values are, close the
Code
Editor window, and then select the arrow that connects
the
Text Filter — Symptom Text node
to the
SAS Code — %TEXTSYN node. The
value for <libref> will be the first part of the table name
that appears in the Properties panel, such as
emws
,
emws2
,
and so on. The node ID will appear after the value for <libref>,
and will be
TextFilter
,
TextFilter2
,
and so on. After you determine the value for <libref> and <nodeID>,
a possible first line might be
termds=emws2.textfilter2_terms
.
Your libref and node ID values could differ depending on how many
Text
Filter nodes and diagrams have been created in your workspace.
For details about the
%TEXTSYN macro, see SAS Text Miner Help documentation.
-
After you have added
the %TEXTSYN macro code to the
Code Editor window,
and modified it to add values for
<libref>
and
<nodeID>
,
click the
to save the changes.
-
Click the
to run the
SAS Code — %TEXTSYN node.
-
Click
Yes in
the
Confirmation dialog box.
-
Click
OK in
the dialog box that indicates that the node has finished running.
-
Close the
Code
Editor window.
-
Select
View Explorer from the main menu.
The
Explorer window
appears.
-
Click
Mylib in
the SAS Libraries tree, and then select
Vaerextsyns.
Note: If the
Mylib library
is already selected and you do not see the Vaerextsyns data set, you
might need to click
Show Project Data or
refresh the
Explorer window to see the
Vaerextsyns data
set.
-
Double-click
Vaerextsyns to
see its contents.
Here is a list of what
the Vaerextsyns columns provide:
-
Term is
the misspelled word.
-
parent is
a guess at the word that was meant.
-
example1 and
example2 are
two examples of the term in a document.
-
childndocs is
the number of documents that contained that term.
-
numdocs is
the number of documents that contained the parent.
-
minsped is
an indication of how close the terms are.
-
dict indicates
whether the term is a legitimate English word. Legitimate words can
still be deemed misspellings, but only if they occur rarely and are
very close in spelling to a frequent target term.
For example, Observation
117 shows
antibotics to be a misspelling
of
antibiotics. Four documents contain
antibotics,
and 745 documents contain the parent. Note that double exclamation
marks (!!) both precede and follow the child term in the example text
so that you can see the term in context.
-
Examine the Vaerextsyns
table to see whether you disagree with some of the choices made. For
this example, however, assume that the %TEXTSYN macro has done a good
enough job of detecting misspellings.
Note: The Vaerextsyns table can
be edited using any SAS table editor. You cannot edit this table in
the SAS Enterprise Miner GUI. You can change a parent for any misspellings
that appear incorrect or delete a row if the Term column contains
a valid term.
-
Close the Mylib.Vaerextsyns
table and the
Explorer window.