The main features of the HPSPLIT procedure are as follows:
provides a variety of methods of splitting nodes, including criteria based on impurity (entropy, Gini index, residual sum of squares) and criteria based on statistical tests (chi-square, F test, CHAID, FastCHAID)
provides a computationally efficient strategy for generating candidate splits
provides the cost-complexity, C4.5, and reduced-error methods of pruning trees
supports the use of cross validation and validation data for selecting the best subtree
provides various methods of handling missing values, including surrogate rules
creates tree diagrams, plots for cost-complexity analysis, and plots of ROC curves
computes statistics for assessing model fit, including model-based (resubstitution) statistics and cross validation statistics
computes measures of variable importance
produces a file that contains SAS DATA step code for scoring new data
produces a file that contains node rules
provides an output data set with leaf assignments and predicted values for observations
The HPSPLIT procedure uses ODS Graphics to create plots as part of its output. For general information about ODS Graphics, see Chapter 21: Statistical Graphics Using ODS. For specific information about the statistical graphics available with the HPSPLIT procedure, see the PLOTS options in the PROC HPSPLIT statement and the section ODS Graphics.
Because the HPSPLIT procedure is a high-performance analytical procedure, it also does the following:
enables you to run in distributed mode on a cluster of machines that distribute the data and the computations
enables you to run in single-machine mode on the server where SAS is installed
exploits all available cores and concurrent threads, regardless of execution mode
For more information, see the section Processing Modes in SAS/STAT 14.1 User's Guide: High-Performance Procedures.