In data mining, a strategy for assessing the quality of
model generalization is to partition the data source. A portion of the data, called the
training data set, is used for preliminary model fitting. The rest is reserved for empirical validation
and is often split into two
parts:
validation data and test data. The
validation data set is used to prevent a modeling node from overfitting the training data and to compare
models. The
test data set is used for a final
assessment of the model.
Note: In SAS Enterprise Miner,
the default data partitioning method for class target variables is
to stratify on the target variable or variables. This method is appropriate
for this sample data because there is a large number of non-donors
in the input data relative to the number of donors. Stratifying ensures
that both non-donors and donors are well-represented in the data partitions.
To use the Data Partition
node to partition the input data into training and validation
sets:
-
Select the
Sample tab
on the Toolbar.
-
Select the
Data
Partition node icon. Drag the node into the Diagram Workspace.
-
Connect the
StatExplore node
to the
Data Partition node.
-
Select the
Data
Partition node. In the Properties Panel, scroll down to view the data set allocations in the
Train properties.
-
Click on the value of Training,
and enter 55.0
-
Click on the value of Validation,
and enter 45.0
-
Click on the value of Test,
and enter 0.0
These properties define the percentage of input data that is used in each type of
mining data set. In this example, you use a training data set and a validation data
set, but you do not use a test data set.
-
In the Diagram Workspace,
right-click the
Data Partition node, and
select
Run from the resulting menu. Click
Yes in
the
Confirmation window that opens.
-
In the window that appears
when processing completes, click
OK.