In data mining, a strategy
for assessing the quality of model generalization is to partition
the data source. A portion of the data, called the
training
data set, is used for preliminary model fitting.
The rest is reserved for empirical validation and is often split into
two parts: validation data and test data. The
validation
data set is used to prevent a modeling node from
overfitting the training data and to compare models. The
test
data set is used for a final assessment of the
model.
Note: In SAS Enterprise Miner,
the default data partitioning method for class target variables is
to stratify the target variable or variables.
To use the Data Partition
node to partition the input data into training and validation
sets:
-
Select the
Sample tab
on the Toolbar.
-
Select the
Data
Partition node icon. Drag the node to the Diagram Workspace.
-
Connect the
CS_ACCEPTS node
to the
Data Partition node.
-
Select the
Data
Partition node. In the Properties Panel, scroll
down to view the data set allocations in the Train properties.
-
Click on the value of
Training,
and enter
70.0
-
Click on the value of
Validation,
and enter
30.0
-
Click on the value of
Test,
and enter
0.0
These properties define
the percentage of input data that is used in each type of mining data
set. In this example, you use a training data set and a validation
data set, but you do not use a test data set.
-
In the Diagram Workspace,
right-click the
Data Partition node, and
select
Run from the resulting menu. Click
Yes in
the confirmation window that opens.
-
In the
Run
Status window, click
OK.