As demonstrated in the previous chapter, SAS Text Miner
does a good job of finding themes that are clear in the data. But
when the data needs cleaning, SAS Text Miner can be less effective
at uncovering useful themes. In this chapter, you will encounter manually
edited data that contains many misspellings and abbreviations, and
you will clean the data to get better results.
The README.TXT
file provided on the VAERS site contains a list of abbreviations commonly
used in the adverse event reports. SAS Text Miner enables you to specify
a synonym list. A VAER_ABBREV synonym list is provided for you in
the Getting Started with SAS Text Miner 4.2 zip file. So that you
can create such a synonym list, the abbreviations list from README.TXT
was copied into a Microsoft Excel file. The list was manually edited
in the Microsoft Excel file and then imported into a SAS data set.
For example, CT/CAT was marked as equivalent to computerized axial
tomography. For more information about the preprocessing steps, see
Vaccine Adverse Event Reporting System Data Preprocessing.
For more information about importing data into a SAS
data set, see the following documentation resource:
http://support.sas.com/documentation/
.
You will
perform the following tasks to clean the text and examine the results:
-
Use a
synonym data set from the Getting Started with SAS Text Miner 4.2
zip file.
-
Create
a new synonym data set using the SAS Code node and the TMSPELL procedure.
The TMSPELL procedure will make a pass through all the terms, automatically
identify which ones are misspellings, and create synonyms that map
correctly spelled terms to the misspelled terms.
-
Examine
results using merged synonym data sets.
-
Create a stop list to define which words are removed
from the analysis. A
stop list is a collection
of low-information or extraneous words—previously saved as
a SAS data set—that you want to remove from the text.
-
Explore
whether cleaning the text improved the clustering results.