The HPSPLIT Procedure

Splitting Criteria

When you specify entropy or the Gini statistic as the splitting criterion, the value of the split is judged by the decrease in the specified criterion. Thus, the criterion for the original leaf is computed, as is the criterion for the final, split leaf. The per-variable split and then the variable on which to split are chosen based on the gain.

When you specify FastCHAID as the splitting criterion, splitting is based on the Kolmogorov-Smirnov distance of the variables.

Entropy Splitting Criterion

The entropy is related to the amount of information that a split contains. The entropy of a single leaf $\lambda $ is given by the equation

\begin{equation*}  \mathrm{Entropy}_\lambda = -\sum _ t { \frac{N_ t^\lambda }{N_\lambda }\log _2\left(\frac{N_ t^\lambda }{N_\lambda }\right) } \end{equation*}

where $N_ t^\lambda $ is the number of observations with the target level t on leaf $\lambda $ and $N_\lambda $ is the number of observations on the leaf (Hastie, Tibshirani, and Friedman, 2001; Quinlan, 1993).

When a leaf is split, the total entropy is then

\begin{equation*}  \mathrm{Entropy} = -\sum _\lambda { \frac{N_\lambda }{N_0} \sum _ t { \frac{N_ t^\lambda }{N_\lambda }\log _2\left(\frac{N_ t^\lambda }{N_\lambda }\right) } } \end{equation*}

where $N_0$ is the number of observations on the original unsplit leaf.

Gini Splitting Criterion

Split Gini is similar to split entropy. First, the per-leaf Gini statistic or index is given by Hastie, Tibshirani, and Friedman (2001) as

\begin{equation*}  \mathrm{Gini}_\lambda = \sum _ t { \frac{N_ t^\lambda }{N_\lambda }\left(1 - \frac{N_ t^\lambda }{N_\lambda }\right) } \end{equation*}

When split, the Gini statistic is then

\begin{equation*}  \mathrm{Gini} = \sum { \frac{N_\lambda }{N_0} \sum _ t {\frac{N_ t^\lambda }{N_\lambda }\left(1 - \frac{N_ t^\lambda }{N_\lambda }\right) } } \end{equation*}

Kolmogorov-Smirnov (FastCHAID) Splitting Criterion

The Kolmogorov-Smirnov (K-S) distance is the maximum distance between the cumulative distribution functions (CDFs) of two or more target levels (Friedman, 1977; Rokach and Maimon, 2008; Utgoff and Clouse, 1996). To create a meaningful CDF for nominal inputs, nominal target levels are ordered first by the level that is specified in the EVENT= option in the PROC HPSPLIT statement (if specified) and then by the other levels in internal order.

After the CDFs have been created, the maximum K-S distance is given by

\begin{equation*}  \mathrm{MAX} \mathrm{KS} = \text {MAX}_{i j k}\left| CDF_ i^{\tau _ j} - CDF_ i^{\tau _ k} \right| \end{equation*}

where i is an interval variable bin or an explanatory variable level, $\tau _ j$ is the jth target level, and $\tau _ k$ is the kth target level.

At each step of determining each variable’s split, the maximum K-S distance is computed, resulting in a single split. The splitting continues recursively until the value specified in the MAXBRANCH= option has been reached.

After each variable’s split has been determined, the variable that has the lowest p-value is chosen as the variable on which to split. Because this operation is similar to another established tree algorithm (Kass, 1980; Soman, Diwakar, and Ajay, 2010), this overall criterion is called "FastCHAID."

Chi-Square Splitting Criterion

The chi-square criterion uses the sum of squares of the two-way contingency table that is formed by an input variable’s levels and the target levels.

For C children of the split, the sum is equal to

\begin{equation*}  \chi ^2 = \frac{1}{N} \sum ^ T_\tau \sum ^ C_ c \frac{\left(N_\tau N_ c - N N_{\tau c}\right)^2}{N_\tau N_ c} \end{equation*}

where N is the number of observations under consideration, T is the number of target levels, $N_\tau $ is the number of observations that have target level $\tau $, $N_ c$ is the number of observations assigned to child c, and $N_{\tau c}$ is the number of observations that have target level $\tau $ and are assigned to child c.

After splitting each variable, the variable that has the lowest p-value is chosen as the variable on which to split.

Variance Splitting Criterion

The variance criterion chooses the split that has the maximum reduction in variance. For some leaf $\lambda $, variance is calculated as

\begin{equation*}  V = \sum _ i^\lambda \left(y_ i - \hat y_\lambda \right)^2 \end{equation*}

So the reduction in variance then becomes

\begin{equation*}  \Delta V = \sum _ i^\lambda \left(y_ i - y_{\lambda _0}\right)^2 - \sum _\lambda ^{\lambda _0} \sum _ i^\lambda \left(y_ i - y_\lambda \right)^2 \end{equation*}

where $\lambda _0$ is the leaf being split and $\lambda $ are the leaves after the split.

This is a type of change in purity criterion.

F Test Splitting Criterion

The F test criterion uses the F sum to choose the best split for a variable:

\begin{equation*}  F = \frac{\sum _\lambda ^{\Lambda } N_\lambda \left(\hat y_\lambda - \hat y_{\lambda _0} \right)^2\left(N_{\lambda _0} - \left|\Lambda \right|\right)}{\sum _\lambda ^{\Lambda } \sum _ i^\lambda \left(y_ i - \hat y_\lambda \right)^2 \left(\left|\Lambda \right| - 1\right)} \end{equation*}

After splitting each variable, the variable that has the lowest F test p-value is chosen as the variable on which to split. The p-value is the probability that $z \ge F$, where z is a random variable from an F distribution that has $N_{\lambda _0} - \left|\Lambda \right|$, $\left|\Lambda \right| - 1$ degrees of freedom. Because the value $N_{\lambda _0}$ is the sum of weights on the unsplit leaf, this is not a strictly correct degrees of freedom.