Predictive modeling
tries to find good rules (models) for guessing (predicting) the values
of one or more variables in a data set from the values of other variables
in the data set. After a good rule has been found, it can be applied
to new data sets (scoring) that might or might not contain the variable(s)
that are being predicted. The various methods that find prediction
rules go by different names in different areas of research, such as
regression, function mapping, classification, discriminant analysis,
pattern recognition, concept learning, supervised learning, and so
on.
In the present context,
prediction does not mean forecasting time series. In time series analysis,
an entity is observed repeatedly over time, and past values are used
to forecast future values. For the predictive modeling methods in
Enterprise Miner, each case in a data set represents a different entity,
independent of the other cases in the data set. If the entities in
question are, for example, customers, then all of the information
pertaining to any one customer must be contained in a single case
in the data set. If you have a data set in which each customer is
described by multiple cases, you must first rearrange the data to
place all of the information about any one customer into the same
case. It is possible to fit some simple autoregressive models by preprocessing
the data using the LAG and DIF functions in the SAS Code node, but
Enterprise Miner has no convenient interface for making forecasts.
Enterprise Miner provides
a number of tools for predictive modeling. Three of these tools are
the Regression node, the Decision Tree node, and the Neural Network
node. The methods used in these nodes come from several areas of research,
including statistics, pattern recognition, and machine learning. These
different areas use different terminology, so before discussing predictive
modeling methods, it will be helpful to clarify the terms used in
Enterprise Miner. The following list of terms is in logical, not alphabetical
order. A more extensive alphabetical glossary can be found in the
Glossary.
A word having a meaning
similar to but not necessarily identical to that of another word in
at least one sense.
A collection of information
about one of numerous entities represented in a data set. Synonyms:
observation, record, example, pattern, sample, instance, row, vector,
pair, tuple, fact.
One of the items of
information represented in numeric or character form for each case
in a data set. Synonyms: column, feature, attribute, coordinate, measurement.
A variable whose value
is known in some currently available data, but will be unknown in
some future/fresh/operational data set. You want to be able to predict
or guess the values of the target variable(s) from other known variables.
Synonyms: dependent variable, response, observed values, training
values, desired output, correct output, outcome.
A variable used to
predict or guess the value of the target variable(s). Synonyms: independent
variable, predictor, regressor, explanatory variable, carrier, factor,
covariate.
A variable computed
from the inputs as a prediction or guess of the value of the target
variable(s) Synonyms: predicted value, estimate, y-hat.
A class of formulas
or algorithms used to compute outputs from inputs. A statistical model
also includes information about the conditional distribution of the
targets given the inputs. See also trained model below. Synonyms:
architecture (for neural nets), classifier, expert, equation, function.
Numeric values used
in a model that are usually unknown or unspecified prior to the analysis.
Synonyms: estimated parameters, estimates, regression coefficients,
standardized regression coefficients, betas.
A nonnegative numeric
variable that indicates the importance of each case. There are three
types of case weights: frequencies, sampling weights, and variance
weights. Enterprise Miner supports only frequencies.
The true or optimal
values of the weights or other quantities (such as standard deviations)
in a model.
The process of computing
good values for the weights in a model, or, for tree-based models,
choosing good split variables and split values. Synonyms: estimation,
fitting, learning, adaptation, induction, growing (trees, that is).
A specific formula
or algorithm for computing outputs from inputs, with all weights or
parameter estimates in the model chosen via a training algorithm from
a class of such formulas or algorithms designated by the model. Synonyms:
fitted model.
The ability of a model
to compute good outputs from input data not used during training.
Synonyms: interpolation and extrapolation, prediction.
The set of all cases
that you want to be able to generalize to. The data to be analyzed
in data mining are usually a subset of the population.
A subset of the population
that is available for analysis.
Unpredictable variation,
usually in a target variable. For example, if two cases have identical
input values but different target values, the variation in those different
target values is not predictable from any model using only those inputs.
Hence that variation is noise. Noise is often assumed to be random,
in which case it is inherently unpredictable. Since noise prevents
target values from being accurately predicted, the distribution of
the noise can be estimated statistically given enough data. Synonym:
error.
Predictable variation
in a target variable. It is often assumed that target values are the
sum of signal and noise, where the signal is a function of the input
variables. Synonyms: Function, systematic component.
Data containing input
and target values, used for training to estimate weights or other
parameters. Synonyms: Training set, design set.
Data containing input
and target values, not used during training in any way, but instead
used to estimate generalization error. Synonyms: Test set (often confused
with validation data).
Data containing input
and target values, used indirectly during training for model selection
or early stopping. Synonyms: Validation set (often confused with test
data).
Applying a trained
model to data to compute outputs. Synonyms: running (for neural nets),
simulating (for neural nets), filtering (for trees), interpolating
or extrapolating.
Scoring or generalization
for cases on or within the convex hull of the training set in the
space of the input variables.
Scoring or generalization
for cases outside the convex hull of the training set in the space
of the input variables.
Data to be scored in
a practical application, containing inputs but not target values.
Scoring operational data is the main purpose of training models in
data mining. Synonyms: scoring data.
A variable which for
all practical purposes has only a limited number of possible values.
Synonyms: class variable, label.
One of the possible
values of a categorical variable. Synonyms: class, level, label.
In data mining, pattern
recognition, knowledge discovery, neural networks, and so on, a class
variable means a categorical target variable, and classification means
assigning cases to categories of a target variable. In traditional
SAS procedures, class variable means simply categorical variable,
either an input or a target.
The process of assigning
numbers to things such that the properties of the numbers reflect
some attribute of the things.
One of several ways
in which properties of numbers can reflect attributes of things. The
most common measurement levels are nominal, ordinal, interval, log-interval,
ratio, and absolute. For details, see the Measurement Theory FAQ at ftp://ftp.sas.com/pub/neural/measurement.html
.
A numeric or character
categorical variable in which the categories are unordered, and the
category values convey no additional information beyond category membership.
A numeric or character
categorical variable in which the categories are ordered, but the
category values convey no additional information beyond membership
and order. In particular, the number of levels between two categories
is not informative, and for numeric variables, the difference between
category values is not informative. The results of an analysis that
includes ordinal variables will typically be unchanged if you replace
all the values of an ordinal variable by different numeric or character
values as long as the order is maintained, although some algorithms
might use the numeric values for initialization. Enterprise Miner
provides no explicit support for continuous ordinal variables, although
some procedures in other SAS products do so, such as TRANSREG and
PRINQUAL.
A numeric variable
for which differences of values are informative.
A numeric variable
for which ratios of values are informative. In Enterprise Miner, ratio
and higher-level variables are not generally distinguished from interval
variables, since the analytical methods are the same. However, ratio
measurements are required for some computations in model assessment,
such as profit and ROI measures.
A variable that takes
only two distinct values. A binary variable can be legitimately treated
as nominal, ordinal, interval, or sometimes ratio.