As demonstrated in the previous chapter, SAS Text Miner
does a good job of finding themes that are clear in the data. But,
when the data needs cleaning, SAS Text Miner can be less effective
at uncovering useful themes. In this chapter, you will encounter manually
edited data that contains many misspellings and abbreviations, and
you will work on cleaning the data to get better results.
The README.TXT
file provided on the VAERS site contains a list of abbreviations commonly
used in the adverse event reports. SAS Text Miner enables you to specify
a synonym list. A VAER_ABBREV synonym list is provided for you in
the Getting Started with SAS Text Miner 4.1 zip file. So that you
can create such a synonym list, the abbreviations list from README.TXT
was copied into a Microsoft Excel file. The list was manually edited
in the Microsoft Excel file and then imported into a SAS data set.
For example, CT/CAT was marked as equivalent to computerized axial
tomography. For more information about the preprocessing steps, see
Vaccine Adverse Event Reporting System Data Preprocessing.
For more information about importing data into a SAS
data set, see the following documentation resource:
http://support.sas.com/documentation/
You will
perform the following tasks to clean the text and examine
the results:
-
Use a
synonym data set from the Getting Started with SAS Text Miner 4.1
zip file.
-
Create
a new synonym data set using the SAS Code node and the %TEXTSYN macro.
The %TEXTSYN macro will run through all the terms, automatically identify
which ones are misspellings, and create synonyms that map correctly
spelled terms to the misspelled terms.
-
Examine
results using merged synonym data sets.
-
Create a stop list to define which words are removed
from the analysis. A
stop list is a collection
of low-information or extraneous words that you want to remove from
the text, which has been saved as a SAS data set.
-
Explore
whether cleaning the text improved the clustering results.