About the Tasks That You Will Perform

As demonstrated in the previous chapter, SAS Text Miner does a good job of finding themes that are clear in the data. But when the data needs cleaning, SAS Text Miner can be less effective at uncovering useful themes. In this chapter, you will encounter manually edited data that contains many misspellings and abbreviations, and you will clean the data to get better results.

The README.TXT file provided on the VAERS site contains a list of abbreviations commonly used in the adverse event reports. SAS Text Miner enables you to specify a synonym list. A VAER_ABBREV synonym list is provided for you in the Getting Started with SAS Text Miner 4.2 zip file. So that you can create such a synonym list, the abbreviations list from README.TXT was copied into a Microsoft Excel file. The list was manually edited in the Microsoft Excel file and then imported into a SAS data set. For example, CT/CAT was marked as equivalent to computerized axial tomography. For more information about the preprocessing steps, see Vaccine Adverse Event Reporting System Data Preprocessing.

For more information about importing data into a SAS data set, see the following documentation resource: http://support.sas.com/documentation/.

You will perform the following tasks to clean the text and examine the results:

Use a synonym data set from the Getting Started with SAS Text Miner 4.2 zip file.
Create a new synonym data set using the SAS Code node and the TMSPELL procedure. The TMSPELL procedure will make a pass through all the terms, automatically identify which ones are misspellings, and create synonyms that map correctly spelled terms to the misspelled terms.
Examine results using merged synonym data sets.
Create a stop list to define which words are removed from the analysis. A stop list is a collection of low-information or extraneous words—previously saved as a SAS data set—that you want to remove from the text.
Explore whether cleaning the text improved the clustering results.