Predictive modeling
tries to find good rules (models) for guessing (predicting) the values
of one or more variables in a data set from the values of other variables
in the data set. After a good rule has been found, it can be applied
to new data sets (scoring) that might or might not contain the variable(s)
that are being predicted. The various methods that find prediction
rules go by different names in different areas of research, such as
regression, function mapping, classification, discriminant analysis,
pattern recognition, concept learning, supervised learning, and so
on.
In the present context,
prediction does not mean forecasting time series. In time series analysis,
an entity is observed repeatedly over time, and past values are used
to forecast future values. For the predictive modeling methods in
SAS Enterprise Miner, each case in a data set represents a different
entity, independent of the other cases in the data set. If the entities
in question are, for example, customers, then all of the information
pertaining to any one customer must be contained in a single case
in the data set. If you have a data set in which each customer is
described by multiple cases, you must first rearrange the data to
place all of the information about any one customer into the same
case. It is possible to fit some simple autoregressive models by preprocessing
the data using the LAG and DIF functions in the SAS Code node, but
SAS Enterprise Miner has no convenient interface for making forecasts.
SAS Enterprise Miner
provides a number of tools for predictive modeling. Three of these
tools are the Regression node, the Decision Tree node, and the Neural
Network node. The methods used in these nodes come from several areas
of research, including statistics, pattern recognition, and machine
learning. These different areas use different terminology, so before
discussing predictive modeling methods, it will be helpful to clarify
the terms used in SAS Enterprise Miner. The following list of terms
is in logical, not alphabetical order. A more extensive alphabetical
glossary can be found in the Glossary.
Synonym
A word having a meaning
similar to but not necessarily identical to that of another word in
at least one sense.
Case
A collection of information
about one of numerous entities represented in a data set. Synonyms:
observation, record, example, pattern, sample, instance, row, vector,
pair, tuple, fact.
Variable
One of the items of
information represented in numeric or character form for each case
in a data set. Synonyms: column, feature, attribute, coordinate, measurement.
Target
A variable whose value
is known in some currently available data, but will be unknown in
some future/fresh/operational data set. You want to be able to predict
or guess the values of the target variable(s) from other known variables.
Synonyms: dependent variable, response, observed values, training
values, desired output, correct output, outcome.
Input
A variable used to
predict or guess the value of the target variable(s). Synonyms: independent
variable, predictor, regressor, explanatory variable, carrier, factor,
covariate.
Output
A variable computed
from the inputs as a prediction or guess of the value of the target
variable(s) Synonyms: predicted value, estimate, y-hat.
Model
A class of formulas
or algorithms used to compute outputs from inputs. A statistical model
also includes information about the conditional distribution of the
targets given the inputs. See also trained model below. Synonyms:
architecture (for neural nets), classifier, expert, equation, function.
Weights
Numeric values used
in a model that are usually unknown or unspecified prior to the analysis.
Synonyms: estimated parameters, estimates, regression coefficients,
standardized regression coefficients, betas.
Case Weight
A nonnegative numeric
variable that indicates the importance of each case. There are three
types of case weights: frequencies, sampling weights, and variance
weights. SAS Enterprise Miner supports only frequencies.
Parameters
The true or optimal
values of the weights or other quantities (such as standard deviations)
in a model.
Training
The process of computing
good values for the weights in a model, or, for tree-based models,
choosing good split variables and split values. Synonyms: estimation,
fitting, learning, adaptation, induction, growing (trees, that is).
Trained Model
A specific formula
or algorithm for computing outputs from inputs, with all weights or
parameter estimates in the model chosen via a training algorithm from
a class of such formulas or algorithms designated by the model. Synonyms:
fitted model.
Generalization
The ability of a model
to compute good outputs from input data not used during training.
Synonyms: interpolation and extrapolation, prediction.
Population
The set of all cases
that you want to be able to generalize to. The data to be analyzed
in data mining are usually a subset of the population.
Sample
A subset of the population
that is available for analysis.
Noise
Unpredictable variation,
usually in a target variable. For example, if two cases have identical
input values but different target values, the variation in those different
target values is not predictable from any model using only those inputs.
Hence, that variation is noise. Noise is often assumed to be random.
In that case, it is inherently unpredictable. Since noise prevents
target values from being accurately predicted, the distribution of
the noise can be estimated statistically given enough data. Synonym:
error.
Signal
Predictable variation
in a target variable. It is often assumed that target values are the
sum of signal and noise, where the signal is a function of the input
variables. Synonyms: Function, systematic component.
Training Data
Data containing input
and target values, used for training to estimate weights or other
parameters. Synonyms: Training set, design set.
Test Data
Data containing input
and target values, not used during training in any way, but instead
used to estimate generalization error. Synonyms: Test set (often confused
with validation data).
Validation Data
Data containing input
and target values, used indirectly during training for model selection
or early stopping. Synonyms: Validation set (often confused with test
data).
Scoring
Applying a trained
model to data to compute outputs. Synonyms: running (for neural nets),
simulating (for neural nets), filtering (for trees), interpolating
or extrapolating.
Interpolation
Scoring or generalization
for cases on or within the convex hull of the training set in the
space of the input variables.
Extrapolation
Scoring or generalization
for cases outside the convex hull of the training set in the space
of the input variables.
Operational Data
Data to be scored in
a practical application, containing inputs but not target values.
Scoring operational data is the main purpose of training models in
data mining. Synonyms: scoring data.
Categorical Variable
A variable which for
all practical purposes has only a limited number of possible values.
Synonyms: class variable, label.
Category
One of the possible
values of a categorical variable. Synonyms: class, level, label.
Class Variable
In data mining, pattern
recognition, knowledge discovery, neural networks, and so on, a class
variable means a categorical target variable, and classification means
assigning cases to categories of a target variable. In traditional
SAS procedures, class variable means simply categorical variable,
either an input or a target.
Measurement
The process of assigning
numbers to things such that the properties of the numbers reflect
some attribute of the things.
Measurement Level
One of several ways
in which properties of numbers can reflect attributes of things. The
most common measurement levels are nominal, ordinal, interval, log-interval,
ratio, and absolute. For details, see the Measurement Theory FAQ at ftp://ftp.sas.com/pub/neural/measurement.html
.
Nominal Variable
A numeric or character
categorical variable in which the categories are unordered, and the
category values convey no additional information beyond category membership.
Ordinal Variable
A numeric or character
categorical variable in which the categories are ordered, but the
category values convey no additional information beyond membership
and order. In particular, the number of levels between two categories
is not informative, and for numeric variables, the difference between
category values is not informative. The results of an analysis that
includes ordinal variables will typically be unchanged if you replace
all the values of an ordinal variable by different numeric or character
values as long as the order is maintained, although some algorithms
might use the numeric values for initialization. SAS Enterprise Miner
provides no explicit support for continuous ordinal variables, although
some procedures in other SAS products do so, such as TRANSREG and
PRINQUAL.
Interval Variable
A numeric variable
for which differences of values are informative.
Ratio Variable
A numeric variable
for which ratios of values are informative. In SAS Enterprise Miner,
ratio and higher-level variables are not generally distinguished from
interval variables, since the analytical methods are the same. However,
ratio measurements are required for some computations in model assessment,
such as profit and ROI measures.
Binary Variable
A variable that takes
only two distinct values. A binary variable can be legitimately treated
as nominal, ordinal, interval, or sometimes ratio.