When you have sufficient data, you can subdivide your data into three parts, called the training, validation, and test data. During the selection process, models are fit on the training data, and the prediction error for the models is found by using the validation data. This prediction error on the validation data can be used to decide when to terminate the selection process or what effects to include as the selection process proceeds. Finally, after you have obtained a selected model, you can use the test set to assess how the selected model generalizes on data that played no role in selecting the model.
In some cases, you might want to use only training and test data. For example, you might decide to use an information criterion to decide what effects to include and when to terminate the selection process. In this case, no validation data are required, but test data can still be useful in assessing the predictive performance of the selected model. In other cases, you might decide to use validation data during the selection process but forgo assessing the selected model on test data. Hastie, Tibshirani, and Friedman (2001) state that it is difficult to give a general rule for how many observations you should assign to each role. They note that a typical split might be 50% for training and 25% each for validation and testing.
You use a PARTITION
statement to logically subdivide the DATA=
data set into separate roles. You can name the fractions of the data that you want to reserve as test data and validation
data. For example, the following statements randomly subdivide the inData
data set, reserving 50% for training and 25% each for validation and testing:
proc hpquantselect data=inData; partition fraction(test=0.25 validate=0.25); ... run;
In some cases, you might need to exercise more control over the partitioning of the input data set. You can do this by naming
both a variable in the input data set and a formatted value of that variable that correspond to each role. For example, the
following statements assign roles to the observations in the inData
data set based on the value of the variable Group
in that data set. Observations in which the value of Group
is group 1
are assigned for testing, and those whose value is group 2
are assigned to training. All other observations are ignored.
proc hpquantselect data=inData; partition roleVar=Group(test='group 1' train='group 2') ... run;
After you reserve observations for training, validation, and testing, a model fit on the training data is scored on the validation and test data, and the average check loss (ACL) is computed separately for each of these subsets. The ACL for each data role is the error sum of squares for observations in that role divided by the number of observations in that role.