Input Variable Splitting and Selection

Subsections:

You can use the following criteria to determine a split for nominal targets:

• entropy

• FastCHAID

• Gini

• chi-square

You can use the following criteria to determine a split for interval targets.

• F test

• variance

PROC HPSPLIT determines the best split in two stages. First, the splitter uses a splitting algorithm category to find the best split for each variable according to the criterion. Next, the variable that has the best split determines the split of the leaf.

The splitter uses different algorithms called splitting categories to find the best split for a variable. Three categories are available: exhaustive, a C4.5-like greedy algorithm that groups levels together using the criterion that is specified in the CRITERION statement until the value specified in the MAXBRANCH= option is reached, and a fast sort–based greedy algorithm. The splitter switches between the different algorithms as the number of levels increases because each splitting category has a different computational complexity that depends on the number of levels.

Splitting Categories and Types

The number of available levels in the variable to be split determines the splitting category. A variable’s level is available if the variable has not yet been used in the path from the root of the tree to the leaf that is being split, or if a given level has not been switched to a different branch along that path. This definition of available allows a variable to be split multiple times. Adjusting the splitting category based on the number of available levels obtains the best split possible according to the statistical criterion while still enabling the splitter to perform quickly despite dealing with an arbitrary number of variable levels or bins.

An exhaustive split search of an interval variable has a much different computational complexity than an exhaustive split search of a nominal variable does. Because of this difference, only two splitter categories are used for interval variables: the exhaustive search and a fast, greedy search. The exhaustive search examines every possible arrangement of splits, up to one less than the value specified in the MAXBRANCH option. The best one is chosen as that variable’s split.

If an exhaustive search is computationally infeasible—that is, it requires more operations to perform than the value specified in the LEVTHRESH2= option—the splitter falls back to a faster, greedy algorithm. The greedy algorithm finds the best single split. It then finds the best split of the resulting two regions, choosing the best region and the best split of that region. This process continues until the number of regions equals the value specified in the MAXBRANCH= option or until no further splits are possible.

An exhaustive search of nominal variable splits requires checking every possible assignment of levels to resulting regions. Therefore, the number of operations that are required to perform this search is exponential as a function of the number of variable levels. If the number of operations that are required to perform the exhaustive search is greater than the value specified in the LEVTHRESH1= option, then the splitter uses a faster, greedy search.

The fast greedy algorithm examines each possible pairwise combination of levels. The splitter looks at the best pairwise combination of levels and merges the best pair. This process continues until the number of splits is below one less than the value specified in the MAXBRANCH= option. However, if the number of levels is huge, even this method is infeasible, and the splitter falls back to an even faster method.

After ordering the nominal variable levels based on the EVENT= option, the splitter finds the best splits iteratively. At each iteration, the best split is chosen using the statistical metric for each previously split range of bins or levels. In effect, this combines a number of binary-split nodes into one ensemble of one less than the number of splits specified in the MAXBRANCH= option.

For FastCHAID, the splitter uses only the fastest algorithm regardless of the number of levels. The statistic that is used for choosing split goodness is the Kolmogorov-Smirnov (K-S) distance for the empirical cumulative distribution function. The K-S splitter follows Friedman’s (1977) proposal, splitting once at the point that has the maximum K-S distance between all the levels. The splitter then finds the maximum K-S distance of the resulting regions and splits there. The splitter continues until the number of splits is equal to the value specified in the MAXBRANCH= option minus 1.

The chi-square criterion uses the p-value of the two-way contingency table of possible split assignments to find the best split of each variable. This criterion can use any splitting category—exhaustive, intermediate, and fastest.

The variance criterion uses the change in variance due to a split to determine each variable’s split. Negative changes are not permitted. This criterion can use the exhaustive, intermediate, and fastest splitting categories.

The F test criterion uses the p-value of the F test of possible splits to determine how to split each variable. This criterion can use the exhaustive, intermediate, and fastest splitting categories.

Selecting the Split Variable

After it finds the split for each variable, the splitter chooses the best split variable to use for the final tree node according to the criterion that is specified in the CRITERION statement:

• The entropy and Gini criteria use the named metric to guide the decision.

• The FastCHAID and chi-square criteria use the p-value of the two-way table of target-child counts of the proposed split. The ALPHA= option in the PROC HPSPLIT statement specifies the value below which the p-value must fall in order to be accepted as a candidate split. In addition, the BONFERRONI keyword in the PROC HPSPLIT statement adjusts the p-value of the split by the Bonferroni adjustment.

• The variance criterion uses the change in target variance to choose the best variable on which to split.

• The F test criterion uses the p-value of an F test to determine the best variable on which to split.

The splitting metrics are based on the population that lands at the node, not the whole tree. For example, the change in entropy when you split the leaf is determined by the number of observations at that leaf. Although subtle, this distinction makes it potentially useful to grow and prune according to the entropy, even when no validation data set is present. This is because the metric that is used in pruning is based on the partition of the entire data set.