Output data sets containing
scores have new variables with names usually formed by adding prefixes
to the name of the target variable or variables and, in some situations,
the input variables or the decision data set.
Prefixes Commonly Used in Scored Data Sets
|
|
|
|
|
|
Best possible loss of
any of the decisions, –B(i)
|
|
|
|
Best possible profit
of any of the decisions, B(i)
|
|
|
|
Loss computed from the
target value, –C(i)
|
|
|
|
Profit computed from
the target value, C(i)
|
|
|
|
Label of the decision
chosen by the model
|
|
|
|
|
|
|
|
Expected loss for the
decision chosen by the model, –E(i)
|
|
|
|
Expected profit for
the decision chosen by the model, E(i)
|
|
|
|
Normalized category
that the case comes from
|
|
|
|
Normalized category
that the case is classified into
|
|
|
|
|
|
|
|
Missing indicator dummy
variable
|
|
|
|
Outputs (predicted values
and posterior probabilities)
|
|
|
|
Plain residuals: target
minus output
|
|
|
|
|
|
|
|
Standardized Anscombe
residuals
|
|
|
|
Studentized Anscombe
residuals
|
|
|
|
|
|
|
|
Standardized deviance
residuals
|
|
|
|
Studentized deviance
residuals
|
|
|
|
Return on investment,
ROI(i)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Unformatted category
that the case is classified into
|
|
Usually, for categorical
targets, the actual target values are dummy 0/1 variables. Hence,
the outputs (P_) are estimates of posterior probabilities. Some modeling
nodes might allow other ways of fitting categorical targets. For example,
when the Regression node fits an ordinal target by linear least squares,
it uses the index of the category as the actual target value. Hence,
it does not produce posterior probabilities.
Outputs (P_) are always
predictions of the actual target variable, even if the target variable
is standardized or otherwise rescaled during modeling computations.
Similarly, plain residuals (R_) are always the actual target value
minus the output. Plain residuals are not multiplied by error weights
or by frequencies.
For least squares estimation,
the error function variable (E_) contains the squared error for each
case. For generalized linear models or other methods based on minimizing
deviance, the E_ variable is the deviance. For other types of maximum
likelihood estimation, the E_ variable is the negative log likelihood.
In other words, the E_ variable is whatever the training method is
trying to minimize the sum of.
The deviance residual
is the signed square root of the value of the error function for a
given case. In other words, if you square the deviance residuals,
multiply them by the frequency values, and add them up, you get the
value of the error function for the entire data set. Hence, if the
target variable is rescaled, the deviance residuals are based on the
rescaled target values, not on the actual target values. However,
deviance residuals cannot be computed for categorical target variables.
For categorical target
variables, names for dummy target variables are created by concatenating
the target name with the formatted target values, with invalid characters
replaced by underscores. Output and residual names are created by
adding the appropriate prefix (P_, R_, and so on) to the dummy target
variable names. The F_ variable is the formatted value of the target
variable. The I_ variable is the category that the case is classified
into--also a formatted value. The I_ value is the category with the
highest posterior probability. If a decision matrix is used, the D_
value is the decision with the largest estimated profit or smallest
estimated loss. The D_ value might differ from the I_ value for two
reasons:
-
The decision alternatives do not
necessarily correspond to the target categories, and
-
The I_ depends directly on the
posterior probabilities, not on estimated profit or loss.
However, the I_ value
can depend indirectly on the decision matrix when the decision matrix
is used in model estimation or selection.
Predicted values are
computed for all cases. The model is used to compute predicted values
whenever possible, regardless of whether the target variable is missing,
inputs excluded from the model (for example, by stepwise selection)
are missing, the frequency variable is missing, and so on. When predicted
values cannot be computed using the model — for example, when
required inputs are missing — the P_ variables are set according
to an intercept-only model:
-
For an interval target, the P_
variable is the unconditional mean of the target variable.
-
For categorical targets, the P_
variables are set to the prior probabilities.
Scored output data sets
also contain a variable named _WARN_ that indicates problems computing
predicted values or making decisions. _WARN_ is a character variable
that either is blank, indicating there were no problems, or that contains
one or more of the following character codes:
_WARN_ Codes
|
|
|
|
|
|
|
Invalid posterior probability
(for example, <0 or >1)
|
|
Unrecognized input category
|
Regardless of how the
P_ variables are computed, the I_ variables as well as the residuals
and errors are computed exactly the same way given the values of the
P_ variables. All cases with nonmissing targets and positive frequencies
contribute to the fit statistics. It is important that all such cases
be included in the computation of fit statistics because model comparisons
must be based on exactly the same sets of cases for every model under
consideration, regardless of which modeling nodes are used.