The HPSPLIT Procedure

Splitting Strategy

When you are building a tree, it is computationally intensive to calculate the node purity or the node worth for all possible split points of every predictor variable. For this reason, the HPSPLIT procedure implements a strategy that combines three different methods of generating candidate splits. The exhaustive method computes the split criterion for all the levels of a predictor variable. The greedy method, which is based on the CHAID algorithm, finds split candidates by recursively halving the data. The fast-sort method prioritizes splits according to the proportions of levels of a categorical response or the average of a continuous response.

By default, PROC HPSPLIT first tries to find candidates for splits by using the exhaustive method. If the number of computations exceeds the number that you specify in the LEVTHRESH1= or LEVTHRESH2= option, the procedure switches to the greedy algorithm. If the number of computations for a specified predictor runs out with the greedy algorithm, the procedure switches to the fast-sort method.

For categorical predictor variables, the exhaustive method tries to group levels in every possible combination. If the number of computations exceeds the threshold before this method finishes, the HPSPLIT procedure switches to the greedy algorithm, which groups levels by adding levels one at a time to the number of groups that you specify in the MAXBRANCH= option. If the number of computations exceeds the threshold before the greedy algorithm finishes, the procedure prioritizes split points by sorting the predictor levels by their proportion of a categorical response or by the average of a continuous response. Use the LEVTHRESH1= option to specify the number of computations for nominal predictors.

Continuous predictor variables are first binned in equidistant intervals as specified by the INTERVALBINS= option. Binning reduces the number of computations when you are searching for candidate splits. Use the option LEVTHRESH2= to specify the threshold for the number of computations before switching from exhaustive to fast-sort method. The greedy method is not available for interval predictor variables.

By default, a variable can be used more than once in a branch of a decision tree as long as the split is on a different value of that variable. You can specify the SPLITONCE option in the PROC HPSPLIT statement to request that variables be used only once per branch.