Example Process Flow Diagram

Task 5. Examining Summary Statistics for the Interval and Class Variables

It is advisable to examine the summary statistics for the interval and class variables prior to modeling. Interval variables that are heavily skewed and that have large kurtosis values may need to be filtered or transformed (or both) prior to modeling. It is also important to identify the class and interval variables that have missing values. The entire customer case is excluded from the regression or neural network analysis when a variable attribute for a customer is missing. Although it is usual to collect much information about customers, missing values are common in most data mining data marts. If variables have missing values, you may want to consider imputing these variables with the Replacement node.

  1. To view summary statistics for the interval variables, select the Interval Variables tab. The variable AMOUNT has the largest skewness statistic. The skewness statistic is 0 for a symmetric distribution. Note that there are no missing values for the interval variables.

    [Interval Variables tab of the Input Data Source window showing Interval Variable summary statistics]

  2. To view summary statistics for the class variables, select the Class Variables tab. Note that also there are no missing values for the class variables.

  3. Close the Input Data Source node and save all changes when prompted.

Note:   You can view the distribution of any variable by right-clicking in a cell in the variable row and selecting View Distribution of <variable name>.  [cautionend]

space
Previous Page | Next Page | Top of Page