The HPSPLIT Procedure

Example 9.1 Creating an English Rules Description of a Tree

This example creates a tree model and saves an English rules representation of the model in a file. It uses the mortgage application data set HMEQ in the Sample Library, which is described in the Getting Started example in section Getting Started: HPSPLIT Procedure.

The following statements create the tree model.

proc hpsplit data=sashelp.hmeq maxdepth=7 maxbranch=2;
  target BAD;
  input DELINQ DEROG JOB NINQ REASON / level=nom;
  input CLAGE CLNO DEBTINC LOAN MORTDUE VALUE YOJ  / level=int;
  criterion entropy;
  prune misc / N <= 6;
  partition fraction(validate=0.2);
  rules file='hpsplhme2-rules.txt';
  score out=scored2;
run;

The target variable (BAD) and input variables that are specified for the tree model are the same as in the Getting Started example. The criteria for growing and pruning the tree are also the same, except that pruning is stopped when the number of leaves is less than or equal to 6.

The RULES statement specifies a file named hpsplhme2-rules.txt, to which the English rules description of the model is saved. A listing of this file shows that each leaf of the tree (labeled as a NODE) is numbered and described:

*------------------------------------------------------------*
NODE = 2
*------------------------------------------------------------*
DELINQ IS ONE OF 5, 6, 7, 8, 10, 11, 12, 13, 15
AND DELINQ IS ONE OF 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 15
    PREDICTED VALUE IS 1
    PREDICTED 1 = 0.9342( 71/76)
    PREDICTED 0 = 0.06579( 5/76)
*------------------------------------------------------------*
NODE = 4
*------------------------------------------------------------*
NINQ IS ONE OF 4, 5, 6, 8, 9, 10, 11, 12, 13, 14, 17
AND DELINQ IS ONE OF MISSING, 1, 2, 3, 4
AND DELINQ IS ONE OF 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 15
    PREDICTED VALUE IS 1
    PREDICTED 1 = 0.8714( 61/70)
    PREDICTED 0 = 0.1286( 9/70)
*------------------------------------------------------------*
NODE = 5
*------------------------------------------------------------*
NINQ IS ONE OF MISSING, 0, 1, 2, 3, 7
AND DELINQ IS ONE OF MISSING, 1, 2, 3, 4
AND DELINQ IS ONE OF 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 15
    PREDICTED VALUE IS 0
    PREDICTED 1 = 0.3682( 306/831)
    PREDICTED 0 = 0.6318( 525/831)
*------------------------------------------------------------*
NODE = 8
*------------------------------------------------------------*
DEBTINC IS MISSING OR DEBTINC <45.137782
AND CLAGE IS MISSING OR CLAGE <186.91737
AND DELINQ IS ONE OF MISSING, 0
    PREDICTED VALUE IS 0
    PREDICTED 1 = 0.1739( 392/2254)
    PREDICTED 0 = 0.8261( 1862/2254)
*------------------------------------------------------------*
NODE = 9
*------------------------------------------------------------*
DEBTINC >=45.137782
AND CLAGE IS MISSING OR CLAGE <186.91737
AND DELINQ IS ONE OF MISSING, 0
    PREDICTED VALUE IS 1
    PREDICTED 1 = 1( 37/37)
    PREDICTED 0 = 0( 0/37)
*------------------------------------------------------------*
NODE = 10
*------------------------------------------------------------*
CLAGE >=186.91737
AND DELINQ IS ONE OF MISSING, 0
    PREDICTED VALUE IS 0
    PREDICTED 1 = 0.0589( 90/1528)
    PREDICTED 0 = 0.9411( 1438/1528)

The listing includes each leaf of the tree, along with the proportion of training set observations that are in the region that is represented by the respective leaf. The predicted value is shown for each leaf, along with the fraction of that leaf’s observations that is in each of the target levels. The nodes are not numbered consecutively in the order 1, 2, 3, and so on, because the non-leaf nodes are not included.

The splits that lead to each leaf are shown above the predicted value and fractions. The same variable can be involved in more than one split. For instance, the leaf labeled NODE = 2 covers the region where DELINQ is between 1 and 8, between 10 and 13, or is equal to 15, and the region where DELINQ is also between 5 and 8, between 10 and 13, or is equal to 15. In other words, the variable DELINQ is split twice in succession.

By preserving multiple splits of the same variable rather than merging them, the rules description makes it possible to traverse the splits from the bottom of the tree to the top. For the leaf labeled NODE=10, the data set was first split on DELINQ and the subset with DELINQ=0 or where the value for DELINQ was missing were split again on CLAGE, with those observations having CLAGE >=186.91737 going to node 10. At this leaf (node 10), the predicted value for BAD is 0 because the majority of the observations (94%) have value 0.

The SCORE statement saves scores for the observations in a SAS data set named SCORED2. Output 9.1.1 lists the first ten observations of SCORED2.

Output 9.1.1: Scored Input Data Set (HMEQ)

Obs BAD _NODE_ _LEAF_ P_BAD1 P_BAD0 V_BAD1 V_BAD0
1 1 8 3 0.17391 0.82609 0.18808 0.81192
2 1 5 2 0.36823 0.63177 0.36126 0.63874
3 1 8 3 0.17391 0.82609 0.18808 0.81192
4 1 8 3 0.17391 0.82609 0.18808 0.81192
5 0 8 3 0.17391 0.82609 0.18808 0.81192
6 1 8 3 0.17391 0.82609 0.18808 0.81192
7 1 5 2 0.36823 0.63177 0.36126 0.63874
8 1 8 3 0.17391 0.82609 0.18808 0.81192
9 1 5 2 0.36823 0.63177 0.36126 0.63874
10 1 8 3 0.17391 0.82609 0.18808 0.81192


The variables _LEAF_ and _NODE_ show the leaf to which the observation was assigned. The variables P_BAD0 and P_BAD1 are the proportions of observations in the training set that have BAD=1 and BAD=0 for that leaf. The variables V_BAD0 and V_BAD1 are the proportions of observations in the validation set that have BAD=1 and BAD=0 for that leaf. For information about the variables in the scored data set, see the section Outputs.