Working with Nodes That Sample, Explore, and Modify

Partition the Raw Data

In data mining, one strategy for assessing model generalization is to partition the data source. A portion of the data, called the training data, is used for preliminary model fitting. The rest is reserved for empirical validation. The hold-out sample itself is often split into two parts: validation data and test data. The validation data is used to prevent a modeling node from over-fitting the training data (model fine-tuning), and to compare prediction models. The test data set is used for a final assessment of the chosen model.

Enterprise Miner can partition your data in several ways. Choose one of the following methods.

By default, Enterprise Miner uses either simple random sampling or stratified sampling, depending on your target. If your target is a class variable, then SAS Enterprise Miner stratifies the sample on the class target. Otherwise, simple random sampling is used.
If you specify simple random sampling, every observation in the data set has the same probability of being included in the sample.
If you specify simple cluster sampling, SAS Enterprise Miner samples from a cluster of observations that are similar in some way.
If you specify stratified sampling, you identify variables in your data set to form strata of the total population. SAS Enterprise Miner samples from each stratum so that the strata proportions of the total population are preserved in each sample.

In this task, you use the Data Partition node to partition your data.

Select the Sample tab from the node toolbar at the top left of the application. Drag a Data Partition node from the toolbar into the Diagram Workspace.
Connect the DONOR_RAW_DATA Data Source node to the Data Partition node.
Select the Data Partition node in the Diagram Workspace. Details about data partitioning appear in the Properties panel.

Note: If the target variable is a class variable, the default partitioning method that Enterprise Miner uses is stratification. Otherwise, the default partitioning method is simple random.
In the Properties panel under the Data Set Percentages section, set the following values:
- set Training to 55
- set Validation to 45
- set Test to 0
In the Data Set Percentages section of the Properties panel, the values for the Training, Validation, and Test properties specify how you want to proportionally allocate the original data set into the three partitions. You can allocate the percentages for each partition by using any real number between 0 and 100, as long as the sum of the three partitions equals 100.

Note: By default, the Data Partition node partitions the data by stratifying on the target variable. This is a good idea in this case, because there are few donors relative to non-donors.
Run the Data Partition node.

Top of Page