Text
mining uncovers the underlying themes or concepts that are contained
in large document collections. Text mining applications have two phases:
exploring the textual data for its content and then using discovered
information to improve the existing processes. Both are important
and can be referred to as descriptive mining and predictive mining.
Descriptive mining involves discovering
the themes and concepts that exist in a textual collection. For example,
many companies collect customers' comments from sources that include
the Web, e-mail, and contact centers. Mining the textual comments
includes providing detailed information about the terms, phrases,
and other entities in the textual collection; clustering the documents
into meaningful groups; and reporting the concepts that are discovered
in the clusters. Results from descriptive mining enable you to better
understand the textual collection.
Predictive mining involves classifying
the documents into categories and using the information that is implicit
in the text for decision making. For example, you might want to identify
the customers who ask standard questions so that they receive an automated
answer. Additionally, you might want to predict whether a customer
is likely to buy again, or even if you should spend more effort to
keep the customer.
Predictive
modeling involves examining past data to predict results. Consider
that you have a customer data set that contains information about
past buying behaviors, along with customer comments. You could build
a predictive model that can be used to score new customers—that is, to analyze new customers based on the data from past
customers. For example, if you are a researcher for a pharmaceutical
company, you know that hand-coding adverse reactions from doctors'
reports in a clinical study is a laborious, error-prone job. Instead,
you could create a model by using all your historical textual data,
noting which doctors' reports correspond to which adverse reactions.
When the model is constructed, processing the textual data can be
done automatically by scoring new records that come in. You would
just have to examine the "hard-to-classify" examples, and let the
computer handle the rest.
Both of
these aspects of text mining share some of the same requirements.
Namely, textual documents that human beings can easily understand
must first be represented in a form that can be mined by the software.
The raw documents need processing before the patterns and relationships
that they contain can be discovered. Although the human mind comprehends
chapters, paragraphs, and sentences, computers require structured
(quantitative or qualitative) data. As a result, an unstructured document
must be converted into a structured form before it can be mined.