In data mining, a strategy
for assessing the quality of model generalization is to partition
the data source. A portion of the data, called the
training
data set, is used for preliminary model fitting. The
rest is reserved for empirical validation and is often split into
two parts: validation data and test data. The
validation
data set is used to prevent a modeling node from overfitting
the training data and to compare models. The
test data
set is used for a final assessment of the model.
Note: In SAS Enterprise Miner,
the default data partitioning method for class target variables is
to stratify on the target variable or variables. This method is appropriate
for this sample data because there is a large number of non-donors
in the input data relative to the number of donors. Stratifying ensures
that both non-donors and donors are well-represented in the data partitions.
To use the Data Partition
node to partition the input data into training and validation
sets:
-
Select the
Sample tab
on the Toolbar.
-
Select the
Data
Partition node icon. Drag the node into the Diagram Workspace.
-
Connect the
StatExplore node
to the
Data Partition node.
-
Select the
Data
Partition node. In the Properties Panel, scroll down
to view the data set allocations in the Train properties.
-
Click on the value of
Training,
and enter
55.0
-
Click on the value of
Validation,
and enter
45.0
-
Click on the value of
Test,
and enter
0.0
These properties define
the percentage of input data that is used in each type of mining data
set. In this example, you use a training data set and a validation
data set, but you do not use a test data set.
-
In the Diagram Workspace,
right-click the
Data Partition node, and
select
Run from the resulting menu. Click
Yes in
the
Confirmation window that opens.
-
In the window that appears
when processing completes, click
OK.