Input and Output Data Sets

Train, Validate, and Test Data Sets

Since SAS Enterprise Miner is intended especially for the analysis of large data sets, all of the predictive modeling nodes are designed to work with separate training, validation, and test sets. The Data Partition node provides a convenient way to split a single data set into the three subsets, using simple random sampling, stratified random sampling, or user-defined sampling. Each predictive modeling node also enables you to specify a fourth scoring data set that is not required to contain the target variable. These four different uses for data sets are called the roles of the data sets. For the training, validation and test sets, the predictive modeling nodes can produce two output data sets: one containing the original data plus scores (predicted values, residuals, classification results, and so on) and the other containing various statistics pertaining to the fit of the model (the error function, misclassification rate, and so on). For scoring sets, only the output data set containing scores can be produced.

Scored Data Sets

Output data sets containing scores have new variables with names usually formed by adding prefixes to the name of the target variable or variables and, in some situations, the input variables or the decision data set.
Prefixes Commonly Used in Scored Data Sets
Prefix
Root
Description
Target Needed?
BL_
Decision data set
Best possible loss of any of the decisions, –B(i)
Yes
BP_
Decision data set
Best possible profit of any of the decisions, B(i)
Yes
CL_
Decision data set
Loss computed from the target value, –C(i)
Yes
CP_
Decision data set
Profit computed from the target value, C(i)
Yes
D_
Decision data set
Label of the decision chosen by the model
No
E__
Target
Error function
Yes
EL__
Decision data set
Expected loss for the decision chosen by the model, –E(i)
No
EP__
Decision data set
Expected profit for the decision chosen by the model, E(i)
No
F_
Target
Normalized category that the case comes from
Yes
I__
Target
Normalized category that the case is classified into
No
IC_
Decision data set
Investment cost IC(i)
No
M__
Variable
Missing indicator dummy variable
P__
Target or dummy
Outputs (predicted values and posterior probabilities)
No
R__
Target or dummy
Plain residuals: target minus output
Yes
RA__
Target
Anscombe residuals
Yes
RAS_
Target
Standardized Anscombe residuals
Yes
RAT_
Target
Studentized Anscombe residuals
Yes
RD_
Target
Deviance residuals
Yes
RDS_
Target
Standardized deviance residuals
Yes
RDT_
Target
Studentized deviance residuals
Yes
ROI_
Decision data set
Return on investment, ROI(i)
Yes
RS_
Target
Standardized residuals
Yes
RT_
Target
Studentized residuals
Yes
S_
Variable
Standardized variable
T__
Variable
Studentized variable
U__
Target
Unformatted category that the case is classified into
No
Usually, for categorical targets, the actual target values are dummy 0/1 variables. Hence, the outputs (P_) are estimates of posterior probabilities. Some modeling nodes might allow other ways of fitting categorical targets. For example, when the Regression node fits an ordinal target by linear least squares, it uses the index of the category as the actual target value. Hence, it does not produce posterior probabilities.
Outputs (P_) are always predictions of the actual target variable, even if the target variable is standardized or otherwise rescaled during modeling computations. Similarly, plain residuals (R_) are always the actual target value minus the output. Plain residuals are not multiplied by error weights or by frequencies.
For least squares estimation, the error function variable (E_) contains the squared error for each case. For generalized linear models or other methods based on minimizing deviance, the E_ variable is the deviance. For other types of maximum likelihood estimation, the E_ variable is the negative log likelihood. In other words, the E_ variable is whatever the training method is trying to minimize the sum of.
The deviance residual is the signed square root of the value of the error function for a given case. In other words, if you square the deviance residuals, multiply them by the frequency values, and add them up, you get the value of the error function for the entire data set. Hence, if the target variable is rescaled, the deviance residuals are based on the rescaled target values, not on the actual target values. However, deviance residuals cannot be computed for categorical target variables.
For categorical target variables, names for dummy target variables are created by concatenating the target name with the formatted target values, with invalid characters replaced by underscores. Output and residual names are created by adding the appropriate prefix (P_, R_, and so on) to the dummy target variable names. The F_ variable is the formatted value of the target variable. The I_ variable is the category that the case is classified into--also a formatted value. The I_ value is the category with the highest posterior probability. If a decision matrix is used, the D_ value is the decision with the largest estimated profit or smallest estimated loss. The D_ value might differ from the I_ value for two reasons:
  • The decision alternatives do not necessarily correspond to the target categories, and
  • The I_ depends directly on the posterior probabilities, not on estimated profit or loss.
However, the I_ value can depend indirectly on the decision matrix when the decision matrix is used in model estimation or selection.
Predicted values are computed for all cases. The model is used to compute predicted values whenever possible, regardless of whether the target variable is missing, inputs excluded from the model (for example, by stepwise selection) are missing, the frequency variable is missing, and so on. When predicted values cannot be computed using the model — for example, when required inputs are missing — the P_ variables are set according to an intercept-only model:
  • For an interval target, the P_ variable is the unconditional mean of the target variable.
  • For categorical targets, the P_ variables are set to the prior probabilities.
Scored output data sets also contain a variable named _WARN_ that indicates problems computing predicted values or making decisions. _WARN_ is a character variable that either is blank, indicating there were no problems, or that contains one or more of the following character codes:
_WARN_ Codes
Code
Meaning
C
Missing cost variable
M
Missing inputs
P
Invalid posterior probability (for example, <0 or >1)
U
Unrecognized input category
Regardless of how the P_ variables are computed, the I_ variables as well as the residuals and errors are computed exactly the same way given the values of the P_ variables. All cases with nonmissing targets and positive frequencies contribute to the fit statistics. It is important that all such cases be included in the computation of fit statistics because model comparisons must be based on exactly the same sets of cases for every model under consideration, regardless of which modeling nodes are used.

Fit Statistics

The output data sets containing fit statistics produced by the Regression node and the Decision Tree node have only one record. Since the Neural Network node can analyze multiple target variables, it produces one record for each target variable and one record for the overall fit; the variable called _NAME_ indicates which target variable the statistics are for.
The fit statistics for training data generally include the following variables, computed from the sum of frequencies and ordinary residuals:
Variables Included in Fit Statistics for Training Data
Name
Label
_NOBS_
Sum of Frequencies
_DFT_
Total Degrees of Freedom
_DIV_
Divisor for ASE
_ASE_
Train: Average Squared Error
_MAX_
Train: Maximum Absolute Error
_RASE_
Train: Root Average Squared Error
_SSE_
Train: Sum of Squared Error
Note that _DFT_, _DIV_, and _NOBS_ can all be different when the target variable is categorical.
The following fit statistics are computed according to the error function (such as squared error, deviance, or negative log likelihood) that was minimized:
Fit Statistics Computed According to the Error Function
Name
Label
_AIC_
Sum of Frequencies
_AVERR_
Total Degrees of Freedom
_ERR_
Divisor for ASE
_SBC_
Train: Average Squared Error
For a categorical target variable, the following statistics are also computed:
Additional Statistics Computed for a Categorical Target Variable
Name
Label
_MISC_
Train: Misclassification Rate
_WRONG_
Train: Number of Wrong Classifications
When decision processing is done, the statistics in the following table are also computed for the training set. The profit variables are computed for a profit or revenue matrix, and the loss variables are computed for a loss matrix:
Additional Statistics Computed for Decision Processing
Name
Label
_PROF_
Train: Total Profit
_APROF_
Train: Average Profit
_LOSS_
Train: Total Loss
_ALOSS_
Train: Average Loss
For a validation data set, the variable names contain a V following the first underscore. For a test data set, the variable names contain a T following the first underscore. Not all the fit statistics are appropriate for validation and test sets, and adjustments for model degrees of freedom are not applicable. Hence, ASE and MSE become the same. For a validation set, the following fit statistics are computed:
Fit Statistics Computed for a Validation Data Set
Name
Label
_VASE_
Valid: Average Squared Error
_VAVERR_
Valid: Average Error Function
_VDIV_
Valid: Divisor for ASE
_VERR_
Valid: Error Function
Fit Statistics Computed for a Validation Data Set
_VMAX_
Valid: Maximum Absolute Error
_VMSE_
Valid: Mean Squared Error
Fit Statistics Computed for a Validation Data Set
_VNOBS_
Valid: Sum of Frequencies
_VRASE_
Valid: Root Average Squared Error
Fit Statistics Computed for a Validation Data Set
_VRMSE_
Valid: Root Mean Square Error
_VSSE_
Valid: Sum of Squared Errors
For a validation set and a categorical target variable, the following fit statistics are computed:
Fit Statistics Computed for a Validation Data Set with a Categorical Target Variable
Name
Label
_VMISC_
Valid: Misclassification Rate
_VWRONG_
Valid: Number of Wrong Classifications
When decision processing is done, the following statistics are also computed for the validation set:
Additional Statistics Computed for Decision Processing
Name
Label
_VPROF_
Valid: Total Profit
_VAPROF_
Valid: Average Profit
_VLOSS_
Valid: Total Loss
_VALOSS_
Valid: Average Loss
Cross validation statistics are similar to the above except that the letter X appears instead of V. These statistics appear in the same data set or data sets as fit statistics for the training data. For a test set, the following fit statistics are computed:
Fit Statistics Computed for a Test Data Set
Name
Label
_TASE_
Test: Average Squared Error
_TAVERR_
Test: Average Error Function
_TDIV_
Test: Divisor for ASE
_TERR_
Test: Error Function
Fit Statistics Computed for a Test Data Set
_TMAX_
Test: Maximum Absolute Error
_TMSE_
Test: Mean Squared Error
Fit Statistics Computed for a Test Data Set
_TNOBS_
Test: Sum of Frequencies
_TRASE_
Test: Root Average Squared Error
Fit Statistics Computed for a Test Data Set
_TRMSE_
Test: Root Mean Square Error
_TSSE_
Test: Sum of Squared Errors
For a test data set and a categorical target variable, the following fit statistics are computed:
Fit Statistics Computed for a Test Data Set with a Categorical Target Variable
Name
Label
_TMISC_
Test: Misclassification Rate
_TMISL_
Test: Lower 95% Confidence Limit for TMISC
_TMISU_
Test: Upper 95% Confidence Limit for TMISC
_TWRONG_
Test: Number of Wrong Classifications
When decision processing is done, the following statistics are also computed for the test set:
Fit Statistics Computed for Test Data Sets Using Decision Processing
Name
Label
_TPROF_
Test: Total Profit
_TAPROF_
Test: Average Profit
_TLOSS_
Test: Total Loss
_TALOSS_
Test: Average Loss