Partitioning Data

When you have sufficient data, you can partition your data into three parts: training data, validation data, and test data. During the selection process, models are fit on the training data, and the prediction error for the model is determined using the validation data. This prediction error can be used to decide when to terminate the selection process or which effects to include as the selection process proceeds. Finally, after a model is selected, the test data can be used to assess how the selected model generalizes on data that played no role in selecting the model.
You can partition your data in either of these ways:
  • You can specify a proportion of the validation or test data. The proportions are used to divide the input data by sampling. You can also specify whether to use a random seed to determine the start of this proportion.
  • If the input data set contains a variable whose values indicate whether an observation is a validation or test case, you can specify the variable to use when partitioning the data. When you specify the variable, you also select the appropriate values for validation or test cases. The input data set is divided into partitions by using these values.