The HPSPLIT procedure calculates primary and surrogate splitting rules for assigning the observations in a node to a branch. Both types of splitting rules use the value of a single predictor variable to assign an observation to a branch.
A primary splitting rule is always calculated by default, and it provides for the assignment of observations when the predictor variable is missing, even when there are no missing values in the training data. This allows for the possibility of missing values when you are scoring new data.
In addition, you can request one or more surrogate splitting rules by using the NSURROGATES= option. The purpose of a surrogate rule is to handle the assignment of observations by using an alternative variable that has similar predictive ability and has nonmissing values in observations where the primary predictor is missing. Surrogate rules enable you to make better use of the data. The HPSPLIT procedure uses the method of Breiman et al. (1984) to determine surrogate rules. By default, NSURROGATES=0. If you request one or more surrogate rules, the last column of the "Variable Importance" table shows the number of times that a variable is used as a surrogate. If a variable is used as a surrogate, you can see exactly how it is used in the SAS DATA step code that is generated when you specify the CODE statement.
When you request more than one surrogate rule, the rule that is applied to an observation is the first rule for which the
surrogate variable is nonmissing. For example, consider the set of rules in Table 16.2 for a particular node in a decision tree where X
, Y
, and Z
are three continuous predictors.
Table 16.2: Example of Splitting Rules
Rule |
Description |
---|---|
Primary |
If (NOT MISSING( |
Surrogate 1 |
Else if (MISSING( |
Surrogate 2 |
Else if (MISSING( |
Default |
Else assign to branch 2 |
Note that the default rule ensures that observations that have missing values are always assigned to a branch regardless of whether you request surrogate rules.
PROC HPSPLIT selects a surrogate rule whose alternative predictor variable has the largest agreement with the predictor variable for the primary split. The measure of agreement between primary and surrogate splitting rules is the proportion of nonmissing observations (that is, observations without a missing value in the predictor variables) in the within-node training sample that the two rules assign to the same branch.