When you have sufficient data, you can subdivide your data into three parts, which are called the training, validation, and test data. For a single model fit or during the model selection process, models are fit and selected based on the training data. After a model has been fit, the validation and test sets can be used to assess how the selected model generalizes on data that played no role in selecting the model. For example, Hastie, Tibshirani, and Friedman (2001) advocate using validation data in the model selection process to determine which effects to include in each step and when to terminate the selection process. PROC HPGENSELECT does not currently use validation data in this way, so the validation and test data subsets are equivalent.
You can use validation and test data to score data that were not used in fitting the model. Statistics in an output data set
that is created by an OUTPUT
statement are computed for validation and test data, using the model fit based on the training data. You can use the ROLE
option in an OUTPUT
statement to add a variable (named Role
by default) to an output data set to indicate the role played by each observation.
You use a PARTITION
statement to logically subdivide the DATA=
data set into separate roles. You can specify the fractions of the data that you want to reserve as test data and validation
data. For example, the following statements randomly subdivide the inData
data set, reserving 50% for training and 25% each for validation and testing:
proc hpgenselect data=inData; partition fraction(test=0.25 validate=0.25); ... run;
In some cases you might need to exercise more control over the partitioning of the input data set. You can do this by naming
both a variable in the input data set and a formatted value of that variable that corresponds to each role. For example, the
following statements assign roles to the observations in the inData
data set based on the value of the variable group
in that data set. Observations whose value of Group
is group 1
are assigned for testing, and those whose value is group 2
are assigned to training. All other observations are ignored.
proc hpgenselect data=inData; partition roleVar=Group(test='group 1' train='group 2') ... run;
When you have reserved observations for training, validation, and testing, a model that is fit on the training data is scored on the validation and test data, and fit statistics, including the average squared error (ASE), are computed separately for each of these subsets. The ASE for each data role is the sum of the squared differences between the responses and the predictions for observations in that role divided by the number of observations in that role.