Text mining uncovers the underlying themes or concepts
that are contained in large document collections. Text mining applications
have two phases: exploring the textual data for its content and then
using discovered information to improve the existing processes. Both
are important and can be referred to as descriptive mining and predictive
mining.
Descriptive mining
involves discovering the themes and concepts that exist in a textual
collection. For example, many companies collect customers' comments
from sources that include the Web, e-mail, and contact centers. Mining
the textual comments includes providing detailed information about
the terms, phrases, and other entities in the textual collection;
clustering the documents into meaningful groups; and reporting the
concepts that are discovered in the clusters. Results from descriptive
mining enable you to better understand the textual collection.
Predictive mining
involves classifying the documents into categories and using the information
that is implicit in the text for decision making. For example, you
might want to identify the customers who ask standard questions so
that they receive an automated answer. In addition, you might want
to predict whether a customer is likely to buy again, or even if you
should spend more effort to keep the customer.
Predictive modeling involves examining
past data to predict results. Consider that you have a customer data
set that contains information about past buying behaviors, along with
customer comments. You could build a predictive model that can be
used to score new customers—that is, to analyze new customers
based on the data from past customers. For example, if you are a researcher
for a pharmaceutical company, you know that hand-coding adverse reactions
from doctors' reports in a clinical study is a laborious, error-prone
job. Instead, you could create a model by using all your historical
textual data, noting which doctors' reports correspond to which adverse
reactions. When the model is constructed, processing the textual data
can be done automatically by scoring new records that come in. You
would just have to examine the "hard-to-classify" examples, and let
the computer handle the rest.
Both of these aspects
of text mining share some of the same requirements. Namely, textual
documents that human beings can easily understand must first be represented
in a form that can be mined by the software. The raw documents need
processing before the patterns and relationships that they contain
can be discovered. Although the human mind comprehends chapters, paragraphs,
and sentences, computers require structured (quantitative or qualitative)
data. As a result, an unstructured document must be converted into
a structured form before it can be mined.