When you have sufficient
data, you can partition your data into three parts: training data,
validation data, and test data. During the selection process, models
are fit on the training data, and the prediction error for the model
is determined using the validation data. This prediction error can
be used to decide when to terminate the selection process or which
effects to include as the selection process proceeds. Finally, after
a model is selected, the test data can be used to assess how the selected
model generalizes on data that played no role in selecting the model.
You can partition your
data in either of these ways:
-
You can specify a proportion of
the validation or test data. The proportions are used to divide the
input data by sampling. You can also specify whether to use a random
seed to determine the start of this proportion.
-
If the input data set contains
a variable whose values indicate whether an observation is a validation
or test case, you can specify the variable to use when partitioning
the data. When you specify the variable, you also select the appropriate
values for validation or test cases. The input data set is divided
into partitions by using these values.