In data mining, a strategy
for assessing the quality of model generalization is to partition
the data source. A portion of the data, called the
training
data set, is used for preliminary model fitting. The
rest is reserved for empirical validation and is often split into
two parts: validation data and test data. The
validation
data set is used to prevent a modeling node from overfitting
the training data and to compare models. The
test data
set is used for a final assessment of the model.
Note: In SAS Enterprise Miner,
the default data partitioning method for class target variables is
to stratify on the target variable or variables. This method is appropriate
for this sample data because there is a large number of non-donors
in the input data relative to the number of donors. Stratifying ensures
that both non-donors and donors are well-represented in the data partitions.
To use the Data Partition
node to partition the input data into training and validation
sets:
-
Select the
Sample tab on the Toolbar.
-
Select the Data Partition
node icon. Drag the node into the Diagram Workspace.
-
Connect the StatExplore
node to the Data Partition node.
-
Select the Data Partition
node. In the Properties Panel, scroll down to view the data set allocations
in the Train properties.
-
Click on the value of
Training, and enter
55.0
-
Click on the value of
Validation, and enter
45.0
-
Click on the value of
Test, and enter
0.0
These properties define
the percentage of input data that is used in each type of mining data
set. In this example, you use a training data set and a validation
data set, but you do not use a test data set.
-
In the Diagram Workspace,
right-click the Data Partition node, and select
Run from the resulting menu. Click
Yes in the
confirmation window that opens.
-
In the window that appears
when processing completes, click
OK.