Example Process Flow Diagram

Task 6. Creating Training and Validation Data Sets

In data mining, one strategy for assessing model generalization is to partition the input data source. A portion of the data, called the training data, is used for model fitting. The rest is held out for empirical validation. The hold-out sample itself is often split into two parts: validation data and test data. The validation data is used to help prevent a modeling node from overfitting the training data (model fine-tuning), and to compare prediction models. The test data set is used for a final assessment of the chosen model.

Because there are only 1,000 customer cases in the input data source, only training and validation data sets will be created. The validation data set will be used to choose the champion model for screening new credit applicants based on the model that minimizes loss. Ideally, you would want to create a test data set to obtain a final, unbiased estimate of the generalization error of each model.

To create the training and validation data sets:

  1. Add a Data Partition node to the Diagram Workspace.

  2. Connect the Input Data Source node to the Data Partition node.

  3. Open the configuration interface to the Data Partition node. When the node opens, the Partition tab is displayed.

  4. Allocate 60% of the input data to the training data set and 40% of the input data to the validation data set (type 0 in the Test text box). To replicate the results of this example flow, use the default Random Seed value of 12345.

    [Partition tab of the Data Partition configuration window showing Simple Random method and data allocated 60% Train and 40% Validation.]

  5. Create the partitioned data sets using a sample that is stratified by the target GOOD_BAD. A stratified sample helps to preserve the initial ratio of good to bad credit applicants in both the training and validation data sets. Stratification is often important when one of the target event levels is rare.

    1. Select Stratified under Method in the Partition tab.

    2. Select the Stratification tab.

    3. Scroll down to the bottom of the window to display the GOOD_BAD variable.

    4. Right-click in the Status cell of the GOOD_BAD variable row and select Set Status. Another pop-up menu appears.

    5. Select use.

    [Stratification tab of the Data Partition configuration window showing the Status of target GOOD_BAD set to USE.]

  6. Close the Data Partition node. Select Yes in the Message window to save the node settings.

space
Previous Page | Next Page | Top of Page