The HPSPLIT Procedure

Splitting Criteria

Subsections:

Criteria Based on Impurity
Criteria Based on Statistical Test

The goal of recursive partitioning, as described in the section Building a Decision Tree, is to subdivide the predictor space in such a way that the response values for the observations in the terminal nodes are as similar as possible. The HPSPLIT procedure provides two types of criteria for splitting a parent node $\tau$ : criteria that maximize a decrease in node impurity, as defined by an impurity function, and criteria that are defined by a statistical test. You select the criterion by specifying an option in the GROW statement.

Criteria Based on Impurity

The entropy, Gini index, and RSS criteria decrease impurity. The impurity of a parent node $\tau$ is defined as $i(\tau )$ , a nonnegative number that is equal to zero for a pure node—in other words, a node for which all the observations have the same value of the response variable. Parent nodes for which the observations have very different values of the response variable have a large impurity.

The HPSPLIT procedure selects the best splitting variable and the best cutoff value to produce the highest reduction in impurity,

$\begin{equation*} \Delta i (s,\tau )=i(\tau )-\sum _{b=1}^{B}p(\tau _ b|\tau )i(\tau _ b) \end{equation*}$

where $\tau _ b$ denotes the bth child node, p( $\tau _ b$ | $\tau$ ) is the proportion of observations in $\tau$ that are assigned to $\tau _ b$ , and B is the number of branches after splitting $\tau$ .

The impurity reduction criteria available for classification trees are based on different impurity functions i( $\tau$ ) as follows:

Entropy criterion (default)

The entropy impurity of node $\tau$ is defined as

$\begin{equation*} i(\tau )=-\sum _{j=1}^{J}p_ j \log _2 p_ j \end{equation*}$

where $p_ j$ is the proportion of observations that have the jth response value.
Gini index criterion

The Gini index criterion defines i( $\tau$ ) as the Gini index that corresponds to the ASE of a class response and is given by

$\begin{equation*} i(\tau )=-\sum _{j=1}^{J}p_ j^2 \end{equation*}$

For more information, see Hastie, Tibshirani, and Friedman (2009).

The impurity reduction criterion available for regression trees is as follows:

RSS criterion (default)

The RSS criterion, also referred to as the ANOVA criterion, defines i( $\tau$ ) as the residual sum of squares

$\begin{equation*} i(\tau )=\frac{1}{N(\tau )} \sum _{i=1}^{N(\tau )}{(Y_ i - \overline{Y})^2} \end{equation*}$

where $N(\tau )$ is the number of observations in $\tau$ , $Y_ i$ is the response value of observation i, and $\overline{Y}$ is the average response of the observations in $\tau$ .

Criteria Based on Statistical Test

The chi-square, F test, CHAID, and FastCHAID criteria are defined by statistical tests. These criteria calculate the worth of a split by testing for a significant difference in the response variable across the branches defined by a split. The worth is defined as $-\log (p)$ , where p is the p-value of the test. You can adjust the p-values for these criteria by specifying the BONFERRONI option in the GROW statement.

The criteria based on statistical tests compute the worth of a split as follows:

Chi-square criterion

For categorical response variables, the worth is based on the p-value for the Pearson chi-square test that compares the frequencies of the levels of the response across the child nodes.
F-test criterion

For continuous response variables, the worth is based on the F test for the null hypothesis that the means of the response values are identical across the child nodes. The test statistic is

$\begin{equation*} F=\frac{\mbox{SS}_{\mbox{between}}/(B-1)}{\mbox{SS}_{\mbox{within}}/(N(\tau )-B)} \end{equation*}$

where

$\begin{equation*} \mbox{SS}_{\mbox{between}}= \sum _{b=1}^{B}N(\tau _ b)(\overline{Y}(\tau _ b)-\overline{Y}(\tau ))^2 \end{equation*}$

$\begin{equation*} \mbox{SS}_{\mbox{within}}= \sum _{b=1}^{B}\sum _{i=1}^{N(\tau _ b)}{(Y_{bi}) - \overline{Y}(\tau _ b))^2} \end{equation*}$

Available for both categorical and continuous response variables:

CHAID criterion

For categorical and continuous response variables, CHAID is an approach first described by Kass (1980) that regards every possible split as representing a test. CHAID tests the hypothesis of no association between the values of the response (target) and the branches of a node. The Bonferroni adjusted probability is defined as m $\alpha$ , where $\alpha$ is the significance level of a test and m is the number of independent tests.

For categorical response variables, the HPSPLIT procedure also provides the FastCHAID criterion, which is a special case of CHAID. FastCHAID is faster to compute because it prioritizes the possible splits by sorting them according to the response variable.