When building and pruning a tree, PROC HPSPLIT ignores observations that have a missing value in the target. However, it includes these observations when using the SCORE statement to score the data, and it includes them in the SAS DATA step code.
PROC HPSPLIT always includes observations that have missing values in input variables. It uses a special level or bin for them that is not used in per-variable split determination.
Each split handles missing values by assigning them to one of the children. This ensures that data scored by the SAS DATA step score code can always assign a target to any record.
If you request surrogate rules, these surrogate rules provide a set of backup splitting rules to assign targets. If the variable for a given rule (primary or surrogate) is missing, the backup is used. Thus, the set of surrogate rules provides a progressive set of backups for the preceding splitting rules. If no splitting rule (primary or any surrogate) is usable, the MISSING= option provides a default rule to use.
The MISSING= in the PROC HPSPLIT statement option controls missing value assignment as follows:
The BRANCH option requests that missing values be assigned to their own branch.
The POPULARITY option requests that missing values be assigned to the most popular leaf (default).
The SIMILARITY option requests that missing values be assigned to the branch they are most similar to (using the chi-square or F test criterion).
Observations that correspond to unknown levels are assigned using the default missing value assignment. Nominal variable levels whose counts are less than the value of the MINCATSIZE= option are also assigned using the default missing value assignment.
Surrogate rules are based on the description in Breiman et al. (1984). After splitting a node, PROC HPSPLIT tries to create a surrogate rule for that split from every input variable that is not involved in the split. Only variables that have enough bins or levels to assign one to each new leaf are eligible to be used as surrogates. PROC HPSPLIT first restricts the observations to those that are not missing in both the primary split and in the candidate surrogate. Then it selects the requested number of surrogate-split variables based on the agreement, in order of agreement. That is, the surrogate split that has the largest agreement is first, followed by the second-largest, until it finds the number of surrogate rules that are requested in the NSURROGATES= option or until it runs out of candidates. The agreement is simply the fraction of eligible (that is, nonmissing) observations that are assigned to each new leaf by both the primary and candidate surrogate splitting rules.