Input and Output Data Sets

Train, Validate, and Test Data Sets

Since SAS Enterprise Miner is intended especially for the analysis of large data sets, all of the predictive modeling nodes are designed to work with separate training, validation, and test sets. The Data Partition node provides a convenient way to split a single data set into the three subsets, using simple random sampling, stratified random sampling, or user-defined sampling. Each predictive modeling node also enables you to specify a fourth scoring data set that is not required to contain the target variable. These four different uses for data sets are called the roles of the data sets. For the training, validation and test sets, the predictive modeling nodes can produce two output data sets: one containing the original data plus scores (predicted values, residuals, classification results, and so on) and the other containing various statistics pertaining to the fit of the model (the error function, misclassification rate, and so on). For scoring sets, only the output data set containing scores can be produced.

Scored Data Sets

Output data sets containing scores have new variables with names usually formed by adding prefixes to the name of the target variable or variables and, in some situations, the input variables or the decision data set.

Prefixes Commonly Used in Scored Data Sets
Prefix	Root	Description	Target Needed?
BL_	Decision data set	Best possible loss of any of the decisions, –B(i)	Yes
BP_	Decision data set	Best possible profit of any of the decisions, B(i)	Yes
CL_	Decision data set	Loss computed from the target value, –C(i)	Yes
CP_	Decision data set	Profit computed from the target value, C(i)	Yes
D_	Decision data set	Label of the decision chosen by the model	No
E__	Target	Error function	Yes
EL__	Decision data set	Expected loss for the decision chosen by the model, –E(i)	No
EP__	Decision data set	Expected profit for the decision chosen by the model, E(i)	No
F_	Target	Normalized category that the case comes from	Yes
I__	Target	Normalized category that the case is classified into	No
IC_	Decision data set	Investment cost IC(i)	No
M__	Variable	Missing indicator dummy variable	—
P__	Target or dummy	Outputs (predicted values and posterior probabilities)	No
R__	Target or dummy	Plain residuals: target minus output	Yes
RA__	Target	Anscombe residuals	Yes
RAS_	Target	Standardized Anscombe residuals	Yes
RAT_	Target	Studentized Anscombe residuals	Yes
RD_	Target	Deviance residuals	Yes
RDS_	Target	Standardized deviance residuals	Yes
RDT_	Target	Studentized deviance residuals	Yes
ROI_	Decision data set	Return on investment, ROI(i)	Yes
RS_	Target	Standardized residuals	Yes
RT_	Target	Studentized residuals	Yes
S_	Variable	Standardized variable	—
T__	Variable	Studentized variable	—
U__	Target	Unformatted category that the case is classified into	No

Usually, for categorical targets, the actual target values are dummy 0/1 variables. Hence, the outputs (P_) are estimates of posterior probabilities. Some modeling nodes might allow other ways of fitting categorical targets. For example, when the Regression node fits an ordinal target by linear least squares, it uses the index of the category as the actual target value. Hence, it does not produce posterior probabilities.

Outputs (P_) are always predictions of the actual target variable, even if the target variable is standardized or otherwise rescaled during modeling computations. Similarly, plain residuals (R_) are always the actual target value minus the output. Plain residuals are not multiplied by error weights or by frequencies.

For least squares estimation, the error function variable (E_) contains the squared error for each case. For generalized linear models or other methods based on minimizing deviance, the E_ variable is the deviance. For other types of maximum likelihood estimation, the E_ variable is the negative log likelihood. In other words, the E_ variable is whatever the training method is trying to minimize the sum of.

The deviance residual is the signed square root of the value of the error function for a given case. In other words, if you square the deviance residuals, multiply them by the frequency values, and add them up, you get the value of the error function for the entire data set. Hence, if the target variable is rescaled, the deviance residuals are based on the rescaled target values, not on the actual target values. However, deviance residuals cannot be computed for categorical target variables.

For categorical target variables, names for dummy target variables are created by concatenating the target name with the formatted target values, with invalid characters replaced by underscores. Output and residual names are created by adding the appropriate prefix (P_, R_, and so on) to the dummy target variable names. The F_ variable is the formatted value of the target variable. The I_ variable is the category that the case is classified into--also a formatted value. The I_ value is the category with the highest posterior probability. If a decision matrix is used, the D_ value is the decision with the largest estimated profit or smallest estimated loss. The D_ value might differ from the I_ value for two reasons:

The decision alternatives do not necessarily correspond to the target categories, and
The I_ depends directly on the posterior probabilities, not on estimated profit or loss.

However, the I_ value can depend indirectly on the decision matrix when the decision matrix is used in model estimation or selection.

Predicted values are computed for all cases. The model is used to compute predicted values whenever possible, regardless of whether the target variable is missing, inputs excluded from the model (for example, by stepwise selection) are missing, the frequency variable is missing, and so on. When predicted values cannot be computed using the model — for example, when required inputs are missing — the P_ variables are set according to an intercept-only model:

For an interval target, the P_ variable is the unconditional mean of the target variable.
For categorical targets, the P_ variables are set to the prior probabilities.

Scored output data sets also contain a variable named _WARN_ that indicates problems computing predicted values or making decisions. _WARN_ is a character variable that either is blank, indicating there were no problems, or that contains one or more of the following character codes:

_WARN_ Codes
Code	Meaning
C	Missing cost variable
M	Missing inputs
P	Invalid posterior probability (for example, <0 or >1)
U	Unrecognized input category

Regardless of how the P_ variables are computed, the I_ variables as well as the residuals and errors are computed exactly the same way given the values of the P_ variables. All cases with nonmissing targets and positive frequencies contribute to the fit statistics. It is important that all such cases be included in the computation of fit statistics because model comparisons must be based on exactly the same sets of cases for every model under consideration, regardless of which modeling nodes are used.

Fit Statistics

The output data sets containing fit statistics produced by the Regression node and the Decision Tree node have only one record. Since the Neural Network node can analyze multiple target variables, it produces one record for each target variable and one record for the overall fit; the variable called _NAME_ indicates which target variable the statistics are for.

The fit statistics for training data generally include the following variables, computed from the sum of frequencies and ordinary residuals:

Variables Included in Fit Statistics for Training Data
Name	Label
_NOBS_	Sum of Frequencies
_DFT_	Total Degrees of Freedom
_DIV_	Divisor for ASE
_ASE_	Train: Average Squared Error
_MAX_	Train: Maximum Absolute Error
_RASE_	Train: Root Average Squared Error
_SSE_	Train: Sum of Squared Error

Note that _DFT_, _DIV_, and _NOBS_ can all be different when the target variable is categorical.

The following fit statistics are computed according to the error function (such as squared error, deviance, or negative log likelihood) that was minimized:

Fit Statistics Computed According to the Error Function
Name	Label
_AIC_	Sum of Frequencies
_AVERR_	Total Degrees of Freedom
_ERR_	Divisor for ASE
_SBC_	Train: Average Squared Error

For a categorical target variable, the following statistics are also computed:

Additional Statistics Computed for a Categorical Target Variable
Name	Label
_MISC_	Train: Misclassification Rate
_WRONG_	Train: Number of Wrong Classifications

When decision processing is done, the statistics in the following table are also computed for the training set. The profit variables are computed for a profit or revenue matrix, and the loss variables are computed for a loss matrix:

Additional Statistics Computed for Decision Processing
Name	Label
_PROF_	Train: Total Profit
_APROF_	Train: Average Profit
_LOSS_	Train: Total Loss
_ALOSS_	Train: Average Loss

For a validation data set, the variable names contain a V following the first underscore. For a test data set, the variable names contain a T following the first underscore. Not all the fit statistics are appropriate for validation and test sets, and adjustments for model degrees of freedom are not applicable. Hence, ASE and MSE become the same. For a validation set, the following fit statistics are computed:

Fit Statistics Computed for a Validation Data Set
Name	Label
_VASE_	Valid: Average Squared Error
_VAVERR_	Valid: Average Error Function
_VDIV_	Valid: Divisor for ASE
_VERR_	Valid: Error Function

Fit Statistics Computed for a Validation Data Set
_VMAX_	Valid: Maximum Absolute Error
_VMSE_	Valid: Mean Squared Error

Fit Statistics Computed for a Validation Data Set
_VNOBS_	Valid: Sum of Frequencies
_VRASE_	Valid: Root Average Squared Error

Fit Statistics Computed for a Validation Data Set
_VRMSE_	Valid: Root Mean Square Error
_VSSE_	Valid: Sum of Squared Errors

For a validation set and a categorical target variable, the following fit statistics are computed:

Fit Statistics Computed for a Validation Data Set with a Categorical Target Variable
Name	Label
_VMISC_	Valid: Misclassification Rate
_VWRONG_	Valid: Number of Wrong Classifications

When decision processing is done, the following statistics are also computed for the validation set:

Additional Statistics Computed for Decision Processing
Name	Label
_VPROF_	Valid: Total Profit
_VAPROF_	Valid: Average Profit
_VLOSS_	Valid: Total Loss
_VALOSS_	Valid: Average Loss

Cross validation statistics are similar to the above except that the letter X appears instead of V. These statistics appear in the same data set or data sets as fit statistics for the training data. For a test set, the following fit statistics are computed:

Fit Statistics Computed for a Test Data Set
Name	Label
_TASE_	Test: Average Squared Error
_TAVERR_	Test: Average Error Function
_TDIV_	Test: Divisor for ASE
_TERR_	Test: Error Function

Fit Statistics Computed for a Test Data Set
_TMAX_	Test: Maximum Absolute Error
_TMSE_	Test: Mean Squared Error

Fit Statistics Computed for a Test Data Set
_TNOBS_	Test: Sum of Frequencies
_TRASE_	Test: Root Average Squared Error

Fit Statistics Computed for a Test Data Set
_TRMSE_	Test: Root Mean Square Error
_TSSE_	Test: Sum of Squared Errors

For a test data set and a categorical target variable, the following fit statistics are computed:

Fit Statistics Computed for a Test Data Set with a Categorical Target Variable
Name	Label
_TMISC_	Test: Misclassification Rate
_TMISL_	Test: Lower 95% Confidence Limit for TMISC
_TMISU_	Test: Upper 95% Confidence Limit for TMISC
_TWRONG_	Test: Number of Wrong Classifications

When decision processing is done, the following statistics are also computed for the test set:

Fit Statistics Computed for Test Data Sets Using Decision Processing
Name	Label
_TPROF_	Test: Total Profit
_TAPROF_	Test: Average Profit
_TLOSS_	Test: Total Loss
_TALOSS_	Test: Average Loss