Partition the Data

In data mining, a strategy for assessing the quality of model generalization is to partition the data source. A portion of the data, called the training data set, is used for preliminary model fitting. The rest is reserved for empirical validation and is often split into two parts: validation data and test data. The validation data set is used to prevent a modeling node from overfitting the training data and to compare models. The test data set is used for a final assessment of the model.
Note: In SAS Enterprise Miner, the default data partitioning method for class target variables is to stratify the target variable or variables.
To use the Data Partition node to partition the input data into training and validation sets:
  1. Select the Sample tab on the Toolbar.
  2. Select the Data Partition node icon. Drag the node into the Diagram Workspace.
  3. Connect the CS_ACCEPTS node to the Data Partition node.
    initial PFD
  4. Select the Data Partition node. In the Properties Panel, scroll down to view the data set allocations in the Train properties.
    • Click on the value of Training, and enter 70.0
    • Click on the value of Validation, and enter 30.0
    • Click on the value of Test, and enter 0.0
    These properties define the percentage of input data that is used in each type of mining data set. In this example, you use a training data set and a validation data set, but you do not use a test data set.
  5. In the Diagram Workspace, right-click the Data Partition node, and select Run from the resulting menu. Click Yes in the confirmation window that opens.
  6. In the Run Status window, click OK.