In data
mining, a strategy for assessing the quality of model generalization
is to partition the data source. A portion of the data, called the
training data set, is used for preliminary model fitting.
The rest is reserved for empirical validation and is often split into
two parts: validation data and test data. The
validation
data set is used to prevent a modeling node from overfitting
the training data and to compare models. The
test data
set is used for a final assessment of the model.
Note: In SAS Enterprise
Miner, the default data partitioning method for class target variables
is to stratify on the target variable or variables. This method is
appropriate for this sample data because there is a large number of
non-donors in the input data relative to the number of donors. Stratifying
ensures that both non-donors and donors are well-represented in the
data partitions.
To use
the Data Partition node to partition the input data into training and validation
sets, complete the following steps:
-
Select
the
Sample tab on the Toolbar.
-
Select
the Data Partition node icon. Drag the node into the Diagram Workspace.
-
Connect
the DONOR_RAW_DATA input data source node to the Data Partition node.
Note: Several nodes
can be connected to the same node. In this example, the DONOR_RAW_DATA
input data source node is connected to both the StatExplore node and
the Data Partition node.
-
Select
the Data Partition node. In the Properties Panel, scroll down to view
the data set allocations in the Train properties.
-
Click on the value of
Training, and enter
55.0
-
Click on the value of
Validation, and enter
45.0
-
Click on the value of
Test, and enter
0.0
These
properties define the percentage of input data that is used in each
type of mining data set. In this example, you use a training data
set and a validation data set, but you do not use a test data set.
-
In the
Diagram Workspace, right-click the Data Partition node, and select
Run from the resulting menu. Click
Yes in the confirmation window that opens.
-
In the
window that appears when processing completes, click
OK.