The HPSPLIT Procedure

PRUNE Statement

  • PRUNE prune-method <(prune-options)>;

The PRUNE statement specifies the pruning method and related options. You can specify the following prune-methods:

C45 <(prune-option)>

requests C4.5 pruning (Quinlan 1993), which is based on the upper confidence limit for the error rate. For more information, see the section Pruning. This pruning method is available only for classification trees (categorical responses). PROC HPSPLIT uses the error rate from the validation set, if one is available.

PROC HPSPLIT generates a subtree selection plot during pruning. The subtree selection plot shows the minimum change in prediction error as a function of the number of leaves in the subtree.

You can specify the following prune-option:

CONFIDENCE=confidence-level

specifies the pruning confidence level, which must be a positive number in the range of [0, 1]. The default confidence level is 0.25.

COSTCOMPLEXITY <(prune-options)>
CC <(prune-options)>

requests cost-complexity pruning (Breiman et al. 1984; Quinlan 1987; Zhang and Singer 2010). You can specify this pruning method for both classification trees and regression trees (continuous response). This is the default pruning method.

If you specify a validation set by using a PARTITION statement, PROC HPSPLIT uses the validation set for subtree selection. If you specify the number of leaves by using the LEAVES= option, the procedure selects the subtree that has the specified number of leaves, or if no subtree with exactly that number of leaves is available, it selects a subtree with fewer leaves. By default, if you do not specify a validation set by using a PARTITION statement and you do not use the LEAVES= option, the procedure uses k-fold cross validation for subtree selection.

For cost-complexity pruning, PROC HPSPLIT generates either a cost-complexity analysis plot based on cross validation or a cost-complexity pruning plot that shows the error metric for the training and validation sets when a validation set exists. The error metric is misclassification rate for classification trees and average square error (ASE) for regression trees.

You can specify the following prune-option:

LEAVES=number |ALL

specifies the subtree that has the requested number of leaves, or if no subtree with exactly that number of leaves is available, specifies a subtree with fewer leaves. When LEAVES=ALL, the largest tree is selected.

OFF

turns off pruning completely. No pruning is performed, and no pruning plots are generated.

REDUCEDERROR <(prune-options)>
REP <(prune-options)>

requests reduced-error pruning (Quinlan 1986). Reduced-error pruning has two stages: subtree sequence generation and subtree selection. For reduced-error pruning, if you specify a validation set by using a PARTITION statement, the validation set is used for subtree sequence generation and for subtree selection. Otherwise, the training set is used. For more information, see the section Pruning.

For reduced-error pruning, PROC HPSPLIT generates a pruning plot that shows the requested error metric as a function of the number of leaves on the subtree.

You can specify the following prune-options:

LEAVES=number |ALL

specifies the subtree that has the requested number of leaves, or if no subtree with exactly that number of leaves is available, specifies a subtree with fewer leaves. When LEAVES=ALL, the largest tree is selected.

METRIC=ASE |MISC

specifies the metric for reduced-error pruning. Average square error (ASE) is the default metric. You can specify ASE for both classification trees and regression trees (continuous response). You can specify MISC only for classification trees (categorical response).