The HPSPLIT Procedure

Unknown Values of Categorical Predictors

Unseen and infrequent levels of categorical predictors are referred to as unknown values, and they present problems for building decision trees in two situations. In the first situation, categorical predictors in validation or scoring data contain levels that do not occur in the training data. Unseen levels of this type are treated as missing values by splitting rules. In the second situation, categorical predictors in training data contain levels that occur very rarely and might be considered as outliers.

You can use the MINCATSIZE= option to specify the minimum number of occurrences that are required for a level to be used in splitting. By default, MINCATSIZE=1. By specifying a number greater than 1, you can filter out rarely occurring levels.

Note that for continuous predictors, the splitting rules always provide for the assignment of values regardless of whether they are unseen or infrequent. For example, a decision tree might have a branch based on the range "(Age > 20) AND (Age < 60)". Here "Age" is a continuous predictor. Any value of "Age" that is not in this range can be assigned to another branch of that decision tree.