The HPSPLIT Procedure

PROC HPSPLIT Statement

  • PROC HPSPLIT <options>;

The PROC HPSPLIT statement invokes the procedure. Table 61.1 summarizes the options in the PROC HPSPLIT statement.

Table 61.1: PROC HPSPLIT Statement Options

Option

Description

Basic Options

CVCC

Requests a table of the results of cost-complexity pruning based on cross validation

CVMETHOD=

Specifies the cross validation method to use

CVMODELFIT

Requests model assessment and a confusion matrix with cross validation

DATA=

Specifies the predictor data set

INTERVALBINS=

Specifies the number of bins for continuous variables

MINVARIANCE=

Sets the minimum variance for a regression tree leaf to be split

NODES

Requests a table that describes the nodes of the final tree

NOPRINT=

Suppresses ODS output

NSURROGATES=

Specifies the number of surrogate rules to create

PLOTS=

Specifies options for plots

SEED=

Specifies the random number seed to use for cross validation

SPLITONCE

Specifies that variables should be split only once per branch

Splitting Options

ASSIGNMISSING=

Specifies how to handle missing values in a predictor variable

LEVTHRESH1=

Specifies the maximum number of computations to perform in an exhaustive search for a categorical predictor

LEVTHRESH2=

Specifies the number of computations to perform before the splitter uses the fastest greedy search

MAXBRANCH=

Specifies the maximum number of leaves per node

MAXDEPTH=

Specifies the maximum tree depth

MINCATSIZE=

Specifies the number of observations per level in order for the level to be considered for splitting

MINLEAFSIZE=

Specifies the minimum number of observations per leaf


You can specify the following options.

ASSIGNMISSING=BRANCH |NONE |POPULAR |SIMILAR

specifies how PROC HPSPLIT creates a default splitting rule to handle missing values, unknown levels, and levels that have fewer observations than you specify in the MINCATSIZE= option. An unknown level is a level of a categorical predictor that does not exist in the training data but is encountered during scoring.

Both the ASSIGNMISSING= and NSURROGATES= options affect training and scoring. See the NSURROGATES= option for the definition of surrogate rules.

During training, the primary splitting rule is created first, along with the default splitting rule (controlled by the ASSIGNMISSING= option). If you request surrogate rules (by using the NSURROGATES= option), they are created after the primary and default splits are made. When the splitting rules have been created, the rules are used as described in the following list for assigning the training data by using the new splitting rules, and splitting rule creation continues on the new children.

Observation assignment during the training phase and during scoring proceeds as follows:

  1. The primary splitting rule is applied if the rule’s variable is not missing. Otherwise:

  2. The first surrogate rule (with the largest agreement, described in the section Primary and Surrogate Splitting Rules) is applied if the first surrogate rule’s variable is not missing. Otherwise:

  3. Each subsequent (ordered by agreement) splitting rule is applied as described in the first surrogate item in this list. If all of the surrogate rules’ variables are missing, then:

  4. The default splitting rule is used.

Because there is always a default splitting rule, all data can be scored, even if the primary rule and all surrogate rules cannot be used on a particular observation.

You can specify one of the following:

BRANCH

specifies that PROC HPSPLIT create a special child (branch) for the default rule and assign to that child missing values, unknown levels, and levels that have fewer observations than you specify in the MINCATSIZE= option. You need to have missing values in the training data set for this to work. If no missing values or levels that have fewer observations than you specify in the MINCATSIZE= option are available at the split in the training data set, then missing values are assigned using ASSIGNMISSING=POPULAR.

NONE

specifies that observations that have any missing variables be excluded from the analysis. In addition, the default rule (to handle unknown levels, which include missing values, because they are excluded from the analysis, and levels that have fewer observations than you specify in the MINCATSIZE= option) is created using ASSIGNMISSING=POPULAR.

POPULAR

specifies that missing values be assigned to the most popular (largest) child.

SIMILAR

specifies that missing values be assigned to the child that they are most similar to (using the chi-square for categorical responses or F-test criterion for continuous responses). For more information about this option and the similarity measurement, see the section Handling Missing Values. If no missing values or levels that have fewer observations than you specify in the MINCATSIZE= option are available at the split in the training data set, then missing values are assigned using ASSINGMISSING=POPULAR.

By default, ASSIGNMISSING=NONE.

CVCC
CVCOSTCOMPLEXITY

requests a table of the results of cost-complexity pruning based on cross validation. For each fold in the cross validation, the table provides the penalty parameter, the number of leaves, and the average ASE. In addition, the table provides the minimum and maximum ASE, and the minimum, median, and maximum number of leaves. The number of leaves is the floor of the median. You can use the PLOTS=CVCC option to request a plot of the information in this table.

CVMETHOD=NONE |RANDOM <(k)>

requests the cross validation method to be performed.

You can specify one of the following:

NONE

suppresses cross validation.

RANDOM <(k)>

assigns each training observation randomly to one of the k folds (with a probability of 1/k for any given fold) for cross validation. The default value of k is 10.

Cross validation applies to the CVMODELFIT option and cost-complexity pruning. In k-fold cross validation, the training set is divided into k folds; k trees are then built using all but the one fold. For example, the first tree uses all of the training set except for the observations in the first fold. The second uses all of the training set except for the observations in the second fold, and so on.

The holdout fold for each tree is used to calculate the ASE of that tree. The average ASE across the k trees is the cross validation error for that set of trees. This process is repeated for each value of the parameter that is cross validated. The parameter that has the minimum cross validated error is used as the best parameter value. A final tree is then grown using this final parameter value.

Note: Attempting to use a PARTITION statement along with cross validation results in an error.

By default, CVMETHOD=RANDOM(10) when you perform cost-complexity pruning with no PARTITION statement.

CVMODELFIT

requests model assessment with cross validation. When you specify this option, the procedure does a cross validation of the final model parameters and produces a table that describes the cross validation error measures of the parameters. The table contains various summary statistics of the ASE, the number of leaves, and, for a categorical response, the misclassification rate across the k trees grown. This option also requests a table that contains the cross validation confusion matrix.

If cost complexity is cross validated (the default if you use cost-complexity pruning without a validation set), the assessment is a completely separate cross validation of the final tree penalty parameter, using a different seed for fold assignment.

Note: The assessment is not on a holdout sample but instead on k-fold cross validation. This is not a second run of the procedure but instead is done automatically and internally.

DATA=SAS-data-set

names the predictor SAS data set to be used by PROC HPSPLIT. The default is the most recently created data set.

If the procedure executes in distributed mode, the predictor data are distributed to memory on the appliance nodes and analyzed in parallel, unless the data are already distributed in the appliance database. In that case, the procedure reads the data alongside the distributed database. For more information, see the section Processing Modes in SAS/STAT 14.1 User's Guide: High-Performance Procedures about the various execution modes and the section Alongside-the-Database Execution in SAS/STAT 14.1 User's Guide: High-Performance Procedures about the alongside-the-database model.

INTERVALBINS=number

specifies the number of bins for continuous variables. PROC HPSPLIT bins continuous predictors to a fixed bin size. This option controls the number of bins and thereby also the size of the bins. For more information about interval variable binning, see the section Details: HPSPLIT Procedure.

By default, INTERVALBINS=100.

LEVTHRESH1=number

applies only to categorical predictor variables and specifies the limit for the number of computations in an exhaustive search for the optimal partition of the levels of a particular variable. The splitter first evaluates the number of computations that are needed for an exhaustive search. If this number exceeds the limit, then the splitter falls back to a faster heuristic algorithm. You can use the LEVTHRESH2= option to specify the limit for the number of computations in this faster algorithm.

By default, LEVTHRESH1=500000.

LEVTHRESH2=number

applies to categorical predictor variables and continuous predictor variables with multiway splits. This option does not apply to continuous predictor variables with binary splits.

For a categorical predictor variable, the splitter first evaluates the number of computations that are needed for an exhaustive search. If this number exceeds the limit that you specify in the LEVTHRESH1= option, then the splitter falls back to a faster heuristic algorithm. You can use the LEVTHRESH2= option to specify the limit for the number of computations in this faster algorithm. If this number of computations exceeds the limit that you specify in the LEVTHRESH2= option, then the splitter falls back to an even faster algorithm.

For a continuous predictor variable, the splitter first tries to perform an exhaustive search for the optimal split values. The splitter first evaluates the number of computations that are needed for an exhaustive search. If this number exceeds the limit that you specify in the LEVTHRESH2= option, the splitter falls back to a faster heuristic algorithm.

By default, LEVTHRESH2=1000000.

MAXBRANCH=b

specifies the maximum number of children per node in the tree. PROC HPSPLIT tries to create this number of children unless it is impossible (for example, if a split variable does not have enough levels). By default, MAXBRANCH=2.

MAXDEPTH=number

specifies the maximum depth of the tree to be grown. The default is set using the following equation, where b is the value for the MAXBRANCH= option:

\begin{equation*} \mathrm{MaxDepth} = \left\lceil \frac{10}{\log _2\left(b\right)} \right\rceil \end{equation*}
MINCATSIZE=number

specifies the number of observations that a categorical variable level must have in order to be considered in the split. Predictor variable levels that have fewer observations than number receive the nonsurrogate missing value assignment for that split. See the NSURROGATES= option for the definition of surrogate rules.

By default, MINCATSIZE=1.

MINLEAFSIZE=number

specifies the minimum number of observations that each child of a split must contain in the training data set in order for the split to be considered.

By default, MINLEAFSIZE=1.

MINVARIANCE=value

specifies the minimum variance for a regression tree leaf to be eligible for splitting. That is, leaves whose variance is less than value are not split any further.

By default, MINVARIANCE=1E–8.

The variance at some leaf $\lambda $ with weight $N_\lambda $ (number of observations) is calculated using

\begin{equation*} \text {Var} = \frac{\sum _{i}^{\lambda }{y_ i^2}}{N_\lambda } - \bar y_\lambda ^2 \end{equation*}

where $y_ i$ is the response value at observation i and $\bar y_\lambda $ is the average value of the response within leaf $\lambda $.

NODES=DETAIL |SUMMARY

requests a table that contains the description of the paths from each leaf to the root.

You can specify the following values:

DETAIL

prints all nodes and leaves, including the path from each node or leaf to the root of the tree.

SUMMARY

prints all nodes and leaves, without including the path to the root of the tree.

NOPRINT

suppresses the generation of ODS output.

NSURROGATES=number

specifies the number of surrogate rules to create for each splitting rule. Surrogate rules are backup splitting rules that are used when the variable that corresponds to the primary splitting rule is missing. By default, NSURROGATES=0.

Both the ASSIGNMISSING= option and NSURROGATES= options affect training and scoring.

During training, the primary splitting rule is created first, along with the default splitting rule (controlled by the ASSIGNMISSING= option). If you request surrogate rules (by specifying the NSURROGATES= option), they are created after the primary and default splits are created. When the splitting rules have been created, the rules are used as described in the following list for assigning the training data by using the new splitting rules, and splitting rule creation continues on the new children.

Observation assignment during the training phase and during scoring proceeds as follows:

  1. The primary splitting rule is applied if the primary rule’s variable is not missing. Otherwise:

  2. The first surrogate rule (with the largest agreement, described in the section Primary and Surrogate Splitting Rules) is applied if the first surrogate rule’s variable is not missing. Otherwise:

  3. Each subsequent (ordered by agreement) splitting rule is applied as described in the first surrogate item in the list. If all of the surrogate rules’ variables are missing, then:

  4. The default splitting rule is used.

Because there is always a default splitting rule, all data can be scored, even if the primary rule and all surrogate rules cannot be used on a particular observation.

As auxiliary rules, surrogate rules are used in the following ways:

  • Surrogate rules are used in generating the tree.

  • Surrogate rules therefore also affect the final tree metrics.

  • Surrogate rules are used in the SAS code output from the CODE statement.

  • Surrogate rules are used in scoring the predictor data set by using the OUTPUT statement.

  • Surrogate rules are not written out in the “Node Rules” file, output from the RULES statement.

PLOTS <(global-plot-option)> <= plot-request <(options)>>
PLOTS <(global-plot-option)> <= (plot-request <(options)> <... plot-request <(options)>>)>

controls the plots that are produced through ODS Graphics. When you specify only one plot-request, you can omit the parentheses around it. Some examples follow.

You can specify the following global-plot-option:

ONLY

suppresses the default plots. Only plots that you specifically request are displayed.

You can specify the following plot-requests:

ALL

produces all appropriate plots.

CVCC

produces the cross validation plot. This option is enabled by default if you request cross validation of cost-complexity pruning.

NONE

suppresses the default plots. Only plots that you specifically request are displayed.

PRUNEUNTIL

produces a plot of the metric that is used to select the final subtree. This option is enabled by default except in the following cases:

  • You request cross validation of cost-complexity pruning.

  • You specify the LEAVES= option in the PRUNE statement when you use the cost-complexity pruning method.

  • You specify the OFF option in the PRUNE statement. Note that specifying the PRUNEUNTIL option has no effect in this case.

ROC

produces the receiver operating characteristic (ROC) curve. This option is enabled by default.

WHOLETREE <(whole-tree-options)>

produces a plot to visualize the entire finished (grown and pruned) tree. This option is enabled by default.

You can specify the following whole-tree-options:

LINKSTYLE=link-style

specifies the style of links between nodes and leaves in the tree. You can specify the following link-styles:

CURVED

requests curved links between the nodes and their children.

ORTHOGONAL

requests that links go straight down partway from a node to its children, create a horizontal line at the base of the vertical line, and then go straight down from that line to each child.

STRAIGHT

requests that links go straight from the nodes to their children.

By default, LINKSTYLE=CURVED.

LINKWIDTH=link-width

specifies the width of links between nodes and leaves in the tree. You can specify the following link-widths:

CONSTANT

requests that all links have the same thickness.

PROPORTIONAL

requests that links have a thickness proportional to the total number of observations that go between the node and each child.

By default, LINKWIDTH=PROPORTIONAL.

NOLEGEND

turns off the legend.

ZOOMEDTREE <(zoomed-tree-options)>

produces a plot to visualize a portion of the finished (grown and pruned) tree. This option is enabled by default.

You can specify the following zoomed-tree-options:

DEPTH=depth

requests that PROC HPSPLIT create plots specified by the ZOOMEDTREE option for each node-id down to depth.

LINKSTYLE=link-style

specifies the style of links between nodes and leaves in the tree. You can specify the following link-styles:

CURVED

requests curved links between the nodes and their children.

ORTHOGONAL

requests that links go straight down partway from a node to its children, create a horizontal line at the base of the vertical line, and then go straight down from that line to each child.

STRAIGHT

requests that links go straight from the nodes to their children.

By default, LINKSTYLE=CURVED.

LINKWIDTH=link-width

specifies the width of links between nodes and leaves in the tree. You can specify the following link-widths:

CONSTANT

requests that all links have the same thickness.

PROPORTIONAL

requests that links have a thickness proportional to the total number of observations that go between the node and each child.

By default, LINKWIDTH=PROPORTIONAL.

NODES=(node-id <node-id <...>>)

requests plots for the subtree that is rooted at nodes at node-id. The default node ID is “0,” the root of the entire tree. The values of node-id are alphanumeric strings that are displayed within the nodes in the plot that is created by the WHOLETREE option. PROC HPSPLIT creates one plot specified by the ZOOMEDTREE option for each node-id that you specify.

NOLEGEND

turns off the legend

SEED=number

specifies the initial seed for random number generation for cross validation. The value of number must be an integer. The default seed is based on the date and time.

SPLITONCE

specifies that variables be split only once on a branch. However, a variable can be used more than once across branches. That is, a variable cannot be split more than once on the path from any leaf to the root node.