This book
describes an extended example that is intended to familiarize you
with the many features of SAS Text Miner. Each topic in this book
builds on the previous topic, so you must work through the chapters
in sequence. Several key components of the SAS Text Miner process
flow diagram are covered. In this step-by-step example, you learn
to do basic tasks in SAS Text Miner, such as how to create a project
and build a process flow diagram. In your diagram, you perform tasks
such as accessing data, preparing the data, building multiple predictive
models using text variables, and comparing the models. The extended
example in this book is designed to be used in conjunction with SAS
Text Miner software.
The Vaccine Adverse Event Reporting System (VAERS) data
is publicly available from the U.S. Department of Health and Human
Services (HHS). Anyone can download this data in comma-separated value
(CSV) format from
http://vaers.hhs.gov. There
are separate CSV files for every year since the U.S. started collecting
the data in 1990. This data is collected from anybody, but most reports
come from vaccine manufacturers (42%) and health care providers (30%).
Providers are required to report any contraindicated events for a
vaccine or any very serious complications. In the context of a vaccine,
a contraindication event would be a condition or a factor that increases
the risk of using the vaccine. Please see the “Guide to Interpreting
Case Report Information Obtained from the Vaccine Adverse Event Reporting
System (VAERS)” available from HHS (
http://vaers.hhs.gov/info.htm
).
See the
following in the
Getting Started Examples zip file:
-
ReportableEventsTable.pdf for a
complete list of reportable events for each vaccine
-
VAERS README file for a data dictionary
and list of abbreviations used
The following
figure shows the first 8 columns in the first 10 rows in the table
of VAERS data for 2005. Included is a unique identifier, the state
of residence, and the recipient's age. Additional columns (not in
the following figure) include an unstructured text string SYMPTOM_TEXT
that contains the reported problem, specific symptoms, and a symptom
counter.
In analyzing adverse reactions to medications,
both in clinical trials and in post-release monitoring of reactions,
keyword or word-spotting techniques combined with a thesaurus are
most often used to characterize the symptoms. The Coding Symbols for
Thesaurus of Adverse Reaction Terms (COSTART) has traditionally been
the categorization technique of choice, but it has been largely replaced
by the Medical Dictionary for Regulatory Affairs (MedDRA). COSTART
is a term developed by the U.S. Food and Drug Administration (FDA)
for the coding, filing, and retrieving of post-marketing adverse reports.
It provides a keyword-spotting technique that deals with the variations
in terms used by those who submit adverse event reports to the FDA.
In the
case of vaccinations, the COSTART system has been used. The FDA has
used a program to extract COSTART categories from the SYMPTOM_TEXT
column. Here are some of the variables used by the program:
-
SYMPTOM_TEXT — reported
symptom text
-
SYM01- SYM20 — extracted
COSTART categories
-
SYM_CNT — number of SYM
fields that are populated for a particular vaccination
-
VAERS_ID — VAERS identification
number
If you
open the VAERS data for 2005 you can see that VAERS_ID
231844
has SYMPTOM_TEXT of
101 fever,
stiff neck, cold
— the program has automatically
extracted the COSTART terms that appear in column SYM01 to column
SYM20 in the data file.
The VAERS
table contains other columns, including a variety of flags that indicate
the seriousness of the event (life-threatening illness, emergency
room or doctor visit, hospitalized, disability, recovered), the number
of days after the vaccine that the event occurred, how many different
vaccinations were given, and a list of codes (VAX1-VAX8) for each
of the shots given. There are also columns indicating where the shots
were given, who funded them, what medications the patient was taking,
and so on.
The README
file taken from the VAERS Web site decodes the vaccine abbreviations.
Note that some vaccinations contain multiple vaccines (for example,
DTP contains diphtheria, tetanus, and pertussis). Here is a portion
of the README file:
As you
go through this example, imagine you are a researcher
trying to discover what information is contained within this data
set and how you can use it to better understand the adverse reactions
that children and adults are experiencing from their vaccination shots.
These adverse reactions might be caused by one or more of the vaccinations
they are given, or they might be induced by an improper procedure
from the administering lab (for example, a non-sanitized needle).
Some of them will be totally unrelated. For example, perhaps someone
happened to get a cold just after receiving a flu vaccine and reported
it. You might want to investigate serious reactions that required
a hospital stay or caused a lifetime disability or death, and find
answers to the following questions:
-
What are some categories of reactions
that people are experiencing?
-
How do these relate to the vaccination
that was given, the age of the recipient, the place they received
the vaccine, or other pertinent information?
-
What factors influence whether
a reaction becomes serious?
-
How well are these factors captured
by the automatically extracted COSTART terms?
-
Is there any important information
contained in the adverse reaction text that is not represented by
the COSTART terms?
When you
are finished with this example, your process flow diagram should resemble
the one shown here: