The HPSPLIT Procedure

Example 13.1 Creating a Node Rules Description of a Tree

This example creates a tree model and saves a node rules representation of the model in a file. It uses the mortgage application data set HMEQ in the Sample Library, which is described in the Getting Started example in section Getting Started: HPSPLIT Procedure.

The following statements create the tree model:

proc hpsplit data=sampsio.hmeq maxdepth=7 maxbranch=2;
   target BAD;
   input DELINQ DEROG JOB NINQ REASON / level=nom;
   input CLAGE CLNO DEBTINC LOAN MORTDUE VALUE YOJ  / level=int;
   criterion entropy;
   prune misc / N <= 6;
   partition fraction(validate=0.2);
   rules file='hpsplhme2-rules.txt';
   score out=scored2;
run;

The target variable (BAD) and input variables that are specified for the tree model are the same as in the Getting Started example. The criteria for growing and pruning the tree are also the same, except that pruning is stopped when the number of leaves is less than or equal to 6.

The RULES statement specifies a file named hpsplhme2-rules.txt, to which the node rules description of the model is saved. The following listing of this file shows that each leaf of the tree (labeled as a NODE) is numbered and described:

*------------------------------------------------------------*
NODE = 2
*------------------------------------------------------------*
(DELINQ IS ONE OF 5, 6, 7, 8, 10, 11, 12, 13)
AND (DELINQ IS ONE OF 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13)
    PREDICTED VALUE IS 1
    PREDICTED 1 = 0.9403( 63/67)
    PREDICTED 0 = 0.0597( 4/67)
*------------------------------------------------------------*
NODE = 7
*------------------------------------------------------------*
MISSING(DEROG) OR (DEROG IS ONE OF 0, 1, 2, 7, 10)
AND MISSING(NINQ) OR (NINQ IS ONE OF 0, 1, 2, 3, 7, 9, 11, 12, 13, 14)
AND MISSING(DELINQ) OR (DELINQ IS ONE OF 1, 2, 3, 4)
AND (DELINQ IS ONE OF 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13)
    PREDICTED VALUE IS 0
    PREDICTED 1 = 0.3283( 263/801)
    PREDICTED 0 = 0.6717( 538/801)
*------------------------------------------------------------*
NODE = 6
*------------------------------------------------------------*
(DEROG IS ONE OF 3, 4, 5, 6, 8, 9)
AND MISSING(NINQ) OR (NINQ IS ONE OF 0, 1, 2, 3, 7, 9, 11, 12, 13, 14)
AND MISSING(DELINQ) OR (DELINQ IS ONE OF 1, 2, 3, 4)
AND (DELINQ IS ONE OF 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13)
    PREDICTED VALUE IS 1
    PREDICTED 1 = 0.9032( 28/31)
    PREDICTED 0 = 0.09677( 3/31)
*------------------------------------------------------------*
NODE = 9
*------------------------------------------------------------*
MISSING(DEBTINC) OR (DEBTINC <45.137782)
AND MISSING(DELINQ) OR (DELINQ IS ONE OF 0, 15)
    PREDICTED VALUE IS 0
    PREDICTED 1 = 0.1286( 488/3795)
    PREDICTED 0 = 0.8714( 3307/3795)

The preceding listing includes each leaf of the tree, along with the proportion of training set observations that are in the region that is represented by the respective leaf. The predicted value is shown for each leaf, along with the fraction of that leaf’s observations that is in each of the target levels. The nodes are not numbered consecutively because the non-leaf nodes are not included.

The splits that lead to each leaf are shown above the predicted value and fractions. The same variable can be involved in more than one split. For example, the leaf labeled NODE = 2 covers the region where DELINQ is between 1 and 8 or between 10 and 13, and also the region where DELINQ is between 5 and 8 or between 10 and 13. In other words, the variable DELINQ is split twice in succession. By preserving multiple splits of the same variable rather than merging them, the rules description makes it possible to traverse the splits from the bottom of the tree to the top. At this leaf (node 2), the predicted value for BAD is 1 because the majority of the observations (94%) have value 0.

The SCORE statement saves scores for the observations in a SAS data set named SCORED2. Output 13.1.1 lists the first 10 observations of SCORED2.

Output 13.1.1: Scored Input Data Set (HMEQ)

Obs BAD _NODE_ _LEAF_ P_BAD1 P_BAD0 V_BAD1 V_BAD0
1 1 8 3 0.18694 0.81306 0.17913 0.82087
2 1 5 2 0.38040 0.61960 0.38191 0.61809
3 1 8 3 0.18694 0.81306 0.17913 0.82087
4 1 8 3 0.18694 0.81306 0.17913 0.82087
5 0 8 3 0.18694 0.81306 0.17913 0.82087
6 1 8 3 0.18694 0.81306 0.17913 0.82087
7 1 5 2 0.38040 0.61960 0.38191 0.61809
8 1 8 3 0.18694 0.81306 0.17913 0.82087
9 1 5 2 0.38040 0.61960 0.38191 0.61809
10 1 8 3 0.18694 0.81306 0.17913 0.82087


The variables _LEAF_ and _NODE_ show the leaf to which the observation was assigned. The variables P_BAD0 and P_BAD1 are the proportions of observations in the training set that have BAD=1 and BAD=0 for that leaf. The variables V_BAD0 and V_BAD1 are the proportions of observations in the validation set that have BAD=1 and BAD=0 for that leaf. For information about the variables in the scored data set, see the section Outputs.