The HPSPLIT Procedure

PROC HPSPLIT Features

The main features of the HPSPLIT procedure are as follows:

  • provides a variety of methods of splitting nodes, including criteria based on impurity (entropy, Gini index, residual sum of squares) and criteria based on statistical tests (chi-square, F test, CHAID, FastCHAID)

  • provides a computationally efficient strategy for generating candidate splits

  • provides the cost-complexity, C4.5, and reduced-error methods of pruning trees

  • supports the use of cross validation and validation data for selecting the best subtree

  • provides various methods of handling missing values, including surrogate rules

  • creates tree diagrams, plots for cost-complexity analysis, and plots of ROC curves

  • computes statistics for assessing model fit, including model-based (resubstitution) statistics and cross validation statistics

  • computes measures of variable importance

  • produces a file that contains SAS DATA step code for scoring new data

  • produces a file that contains node rules

  • provides an output data set with leaf assignments and predicted values for observations

The HPSPLIT procedure uses ODS Graphics to create plots as part of its output. For general information about ODS Graphics, see Chapter 21: Statistical Graphics Using ODS in SAS/STAT 14.1 User's Guide. For specific information about the statistical graphics available with the HPSPLIT procedure, see the PLOTS options in the PROC HPSPLIT statement and the section ODS Graphics.

Because the HPSPLIT procedure is a high-performance analytical procedure, it also does the following:

  • enables you to run in distributed mode on a cluster of machines that distribute the data and the computations

  • enables you to run in single-machine mode on the server where SAS is installed

  • exploits all available cores and concurrent threads, regardless of execution mode

For more information, see the section Processing Modes.