Partition the Data

In data mining, a strategy for assessing the quality of model generalization is to partition the data source. A portion of the data, called the training data set, is used for preliminary model fitting. The rest is reserved for empirical validation and is often split into two parts: validation data and test data. The validation data set is used to prevent a modeling node from overfitting the training data and to compare models. The test data set is used for a final assessment of the model.

Note: In SAS Enterprise Miner, the default data partitioning method for class target variables is to stratify on the target variable or variables. This method is appropriate for this sample data because there is a large number of non-donors in the input data relative to the number of donors. Stratifying ensures that both non-donors and donors are well-represented in the data partitions.

To use the Data Partition node to partition the input data into training and validation sets:

Select the Sample tab on the Toolbar.
Select the Data Partition node icon. Drag the node into the Diagram Workspace.
Connect the StatExplore node to the Data Partition node.
Select the Data Partition node. In the Properties Panel, scroll down to view the data set allocations in the Train properties.
- Click on the value of Training, and enter 55.0
- Click on the value of Validation, and enter 45.0
- Click on the value of Test, and enter 0.0
These properties define the percentage of input data that is used in each type of mining data set. In this example, you use a training data set and a validation data set, but you do not use a test data set.
In the Diagram Workspace, right-click the Data Partition node, and select Run from the resulting menu. Click Yes in the confirmation window that opens.
In the window that appears when processing completes, click OK.