The HPSPLIT Procedure

Example 13.1 Creating a Node Rules Description of a Tree

This example creates a tree model and saves a node rules representation of the model in a file. It uses the mortgage application data set `HMEQ` in the Sample Library, which is described in the Getting Started example in section Getting Started: HPSPLIT Procedure.

The following statements create the tree model:

```proc hpsplit data=sampsio.hmeq maxdepth=7 maxbranch=2;
input DELINQ DEROG JOB NINQ REASON / level=nom;
input CLAGE CLNO DEBTINC LOAN MORTDUE VALUE YOJ  / level=int;
criterion entropy;
prune misc / N <= 6;
partition fraction(validate=0.2);
rules file='hpsplhme2-rules.txt';
score out=scored2;
run;
```

The target variable (`BAD`) and input variables that are specified for the tree model are the same as in the Getting Started example. The criteria for growing and pruning the tree are also the same, except that pruning is stopped when the number of leaves is less than or equal to 6.

The RULES statement specifies a file named `hpsplhme2-rules.txt`, to which the node rules description of the model is saved. The following listing of this file shows that each leaf of the tree (labeled as a NODE) is numbered and described:

```*------------------------------------------------------------*
NODE = 2
*------------------------------------------------------------*
(DELINQ IS ONE OF 5, 6, 7, 8, 10, 11, 12, 13)
AND (DELINQ IS ONE OF 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13)
PREDICTED VALUE IS 1
PREDICTED 1 = 0.9403( 63/67)
PREDICTED 0 = 0.0597( 4/67)
*------------------------------------------------------------*
NODE = 7
*------------------------------------------------------------*
MISSING(DEROG) OR (DEROG IS ONE OF 0, 1, 2, 7, 10)
AND MISSING(NINQ) OR (NINQ IS ONE OF 0, 1, 2, 3, 7, 9, 11, 12, 13, 14)
AND MISSING(DELINQ) OR (DELINQ IS ONE OF 1, 2, 3, 4)
AND (DELINQ IS ONE OF 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13)
PREDICTED VALUE IS 0
PREDICTED 1 = 0.3283( 263/801)
PREDICTED 0 = 0.6717( 538/801)
*------------------------------------------------------------*
NODE = 6
*------------------------------------------------------------*
(DEROG IS ONE OF 3, 4, 5, 6, 8, 9)
AND MISSING(NINQ) OR (NINQ IS ONE OF 0, 1, 2, 3, 7, 9, 11, 12, 13, 14)
AND MISSING(DELINQ) OR (DELINQ IS ONE OF 1, 2, 3, 4)
AND (DELINQ IS ONE OF 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13)
PREDICTED VALUE IS 1
PREDICTED 1 = 0.9032( 28/31)
PREDICTED 0 = 0.09677( 3/31)
*------------------------------------------------------------*
NODE = 9
*------------------------------------------------------------*
MISSING(DEBTINC) OR (DEBTINC <45.137782)
AND MISSING(DELINQ) OR (DELINQ IS ONE OF 0, 15)
PREDICTED VALUE IS 0
PREDICTED 1 = 0.1286( 488/3795)
PREDICTED 0 = 0.8714( 3307/3795)
```

The preceding listing includes each leaf of the tree, along with the proportion of training set observations that are in the region that is represented by the respective leaf. The predicted value is shown for each leaf, along with the fraction of that leaf’s observations that is in each of the target levels. The nodes are not numbered consecutively because the non-leaf nodes are not included.

The splits that lead to each leaf are shown above the predicted value and fractions. The same variable can be involved in more than one split. For example, the leaf labeled NODE = 2 covers the region where `DELINQ` is between 1 and 8 or between 10 and 13, and also the region where `DELINQ` is between 5 and 8 or between 10 and 13. In other words, the variable `DELINQ` is split twice in succession. By preserving multiple splits of the same variable rather than merging them, the rules description makes it possible to traverse the splits from the bottom of the tree to the top. At this leaf (node 2), the predicted value for BAD is 1 because the majority of the observations (94%) have value 0.

The SCORE statement saves scores for the observations in a SAS data set named `SCORED2`. Output 13.1.1 lists the first 10 observations of `SCORED2`.

Output 13.1.1: Scored Input Data Set (HMEQ)

The variables `_LEAF_` and `_NODE_` show the leaf to which the observation was assigned. The variables `P_BAD0` and `P_BAD1` are the proportions of observations in the training set that have `BAD`=1 and `BAD`=0 for that leaf. The variables `V_BAD0` and `V_BAD1` are the proportions of observations in the validation set that have `BAD`=1 and `BAD`=0 for that leaf. For information about the variables in the scored data set, see the section Outputs.