The HPSPLIT Procedure

CRITERION Statement

  • CRITERION criterion </ options>;

The CRITERION statement specifies the criterion by which to grow the tree.

For nominal targets, you can set the criterion to one of the following:

CHAID

uses values of a chi-square test (decision tree) or an F test (regression tree) to merge similar levels of nominal inputs until the number of children in the proposed split reaches the value of the MAXBRANCH= option. It then uses the p-values of the final split to determine the variable on which to split. For interval inputs, CHAID chooses the best single split until the number of children in the proposed split reaches the value of the MAXBRANCH= option.

This criterion is available for both interval and nominal targets.

CHISQUARE

uses the p-values to split each variable and then to determine the split.

ENTROPY

uses the gain in information (decrease in entropy) to split each variable and then to determine the split.

FASTCHAID

uses a Kolmogorov-Smirnov splitter to determine splits for each variable, following a recursive method similar to that of Friedman (1977) (after ordering the levels of nominal variables by the level specified in the EVENT= option), and then uses the lowest of each variable’s resulting p-values to determine the variable on which to split.

GINI

uses the decrease in Gini statistic to split each variable and then to determine the split.

IGR

uses the entropy metric to split each variable and then uses the information gain ratio to determine the split.

This criterion is available only for nominal targets.

The default criterion for nominal targets is ENTROPY.

For interval targets, you can set the criterion to one of the following:

FTEST

uses an F test to split each variable and then to determine the split.

VARIANCE

uses the change in target variance to split each variable and then to determine the split.

The default criterion for interval targets is VARIANCE.

You can also specify the following options after a slash (/):

LEVTHRESH1=number

specifies the maximum number of computations to perform for an exhaustive search for a nominal input. If the input variable being examined is a nominal variable, the splitter tries to fall back to the fast algorithm. Otherwise, it falls back to a greedy algorithm. The LEVTHRESH1= option does not affect interval inputs.

By default, LEVTHRESH1=500,000.

LEVTHRESH2=number

specifies the maximum number of computations to perform in a greedy search for nominal input variables. If the input variable that is being examined is an interval variable, the LEVTHRESH2= option specifies the number of computations to perform for an exhaustive search of all possible split points.

If the number of computations in either case is greater than number, the splitter uses a much faster greedy algorithm.

Although this option is similar to the LEVTHRESH1= option, it specifies the computations of the nominal variable fallback algorithm for finding the best splits of a nominal variable, a calculation that has a much different computational complexity.

By default, LEVTHRESH2=1,000,000.