Splitting Criteria :: SAS/STAT(R) 12.3 User's Guide: High-Performance Procedures

Splitting Criteria

Entropy Splitting Criterion
Gini Splitting Criterion
Kolmogorov-Smirnov (FastCHAID) Splitting Criterion

When you specify entropy or the Gini statistic as the splitting criterion, the value of the split is judged by the decrease in the specified criterion. Thus, the criterion for the original leaf is computed, as is the criterion for the final, split leaf. The per-variable split and then the variable on which to split are chosen based on the gain.

When you specify FastCHAID as the splitting criterion, splitting is based on the Kolmogorov-Smirnov distance of the variables.

Entropy Splitting Criterion

The entropy is related to the amount of information that a split contains. The entropy of a single leaf $\lambda$ is given by the equation

$\begin{equation*} \mathrm{Entropy}_\lambda = -\sum _ t { \frac{N_ t^\lambda }{N_\lambda }\log _2\left(\frac{N_ t^\lambda }{N_\lambda }\right) } \end{equation*}$

where $N_ t^\lambda$ is the number of observations with the target level t on leaf $\lambda$ and $N_\lambda$ is the number of observations on the leaf (Hastie, Tibshirani, and Friedman, 2001; Quinlan, 1993).

When a leaf is split, the total entropy is then

$\begin{equation*} \mathrm{Entropy} = -\sum _\lambda { \frac{N_\lambda }{N_0} \sum _ t { \frac{N_ t^\lambda }{N_\lambda }\log _2\left(\frac{N_ t^\lambda }{N_\lambda }\right) } } \end{equation*}$

where is the number of observations on the original unsplit leaf.

Gini Splitting Criterion

Split Gini is similar to split entropy. First, the per-leaf Gini statistic or index is given by Hastie, Tibshirani, and Friedman (2001) as

$\begin{equation*} \mathrm{Gini}_\lambda = \sum _ t { \frac{N_ t^\lambda }{N_\lambda }\left(1 - \frac{N_ t^\lambda }{N_\lambda }\right) } \end{equation*}$

When split, the Gini statistic is then

$\begin{equation*} \mathrm{Gini} = \sum { \frac{N_\lambda }{N_0} \sum _ t {\frac{N_ t^\lambda }{N_\lambda }\left(1 - \frac{N_ t^\lambda }{N_\lambda }\right) } } \end{equation*}$

Kolmogorov-Smirnov (FastCHAID) Splitting Criterion

The Kolmogorov-Smirnov (K-S) distance is the maximum distance between the cumulative distribution functions (CDFs) of two or more target levels (Friedman, 1977; Rokach and Maimon, 2008; Utgoff and Clouse, 1996). To create a meaningful CDF for nominal inputs, nominal target levels are ordered first by the level that is specified in the EVENT= option in the PROC HPSPLIT statement (if specified) and then by the other levels in internal order.

After the CDFs have been created, the maximum K-S distance is given by

$\begin{equation*} \mathrm{MAX} \mathrm{KS} = \text {MAX}_{i j k}\left| CDF_ i^{\tau _ j} - CDF_ i^{\tau _ k} \right| \end{equation*}$

where i is an interval variable bin or an explanatory variable level, $\tau _ j$ is the jth target level, and $\tau _ k$ is the kth target level.

At each step of determining each variable’s split, the maximum K-S distance is computed, resulting in a single split. The splitting continues recursively until the value specified in the MAXBRANCH= option has been reached.

After each variable’s split has been determined, the variable that has the lowest p-value is chosen as the variable on which to split. Because this operation is similar to another established tree algorithm (Kass, 1980; Soman, Diwakar, and Ajay, 2010), this overall criterion is called “FastCHAID.”

The HPSPLIT Procedure