Previous Page | Next Page

Working with Nodes That Sample, Explore, and Modify

Partition the Raw Data

In data mining, one strategy for assessing model generalization is to partition the data source. A portion of the data, called the training data, is used for preliminary model fitting. The rest is reserved for empirical validation. The hold-out sample itself is often split into two parts: validation data and test data. The validation data is used to prevent a modeling node from over-fitting the training data (model fine-tuning), and to compare prediction models. The test data set is used for a final assessment of the chosen model.

Enterprise Miner can partition your data in several ways. Choose one of the following methods.

In this task, you use the Data Partition node to partition your data.

  1. Select the Sample tab from the node toolbar at the top left of the application. Drag a Data Partition node from the toolbar into the Diagram Workspace.

  2. Connect the DONOR_RAW_DATA Data Source node to the Data Partition node.

    [untitled graphic]

  3. Select the Data Partition node in the Diagram Workspace. Details about data partitioning appear in the Properties panel.

    Note:   If the target variable is a class variable, the default partitioning method that Enterprise Miner uses is stratification. Otherwise, the default partitioning method is simple random.  [cautionend]

  4. In the Properties panel under the Data Set Percentages section, set the following values:

    • set Training to 55

    • set Validation to 45

    • set Test to 0

    [untitled graphic]

    In the Data Set Percentages section of the Properties panel, the values for the Training, Validation, and Test properties specify how you want to proportionally allocate the original data set into the three partitions. You can allocate the percentages for each partition by using any real number between 0 and 100, as long as the sum of the three partitions equals 100.

    Note:   By default, the Data Partition node partitions the data by stratifying on the target variable. This is a good idea in this case, because there are few donors relative to non-donors.  [cautionend]

  5. Run the Data Partition node.

Previous Page | Next Page | Top of Page