This example creates a tree model and saves an English rules representation of the model in a file. It uses the mortgage application
data set HMEQ
in the Sample Library, which is described in the Getting Started example in section Getting Started: HPSPLIT Procedure.
The following statements create the tree model.
proc hpsplit data=sashelp.hmeq maxdepth=7 maxbranch=2; target BAD; input DELINQ DEROG JOB NINQ REASON / level=nom; input CLAGE CLNO DEBTINC LOAN MORTDUE VALUE YOJ / level=int; criterion entropy; prune misc / N <= 6; partition fraction(validate=0.2); rules file='hpsplhme2-rules.txt'; score out=scored2; run;
The target variable (BAD
) and input variables that are specified for the tree model are the same as in the Getting Started example. The criteria for
growing and pruning the tree are also the same, except that pruning is stopped when the number of leaves is less than or equal
to 6.
The RULES statement specifies a file named hpsplhme2-rules.txt
, to which the English rules description of the model is saved. A listing of this file shows that each leaf of the tree (labeled
as a “NODE”) is numbered and described:
*------------------------------------------------------------* NODE = 2 *------------------------------------------------------------* DELINQ IS ONE OF 5, 6, 7, 8, 10, 11, 12, 13, 15 AND DELINQ IS ONE OF 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 15 PREDICTED VALUE IS 1 PREDICTED 1 = 0.9342( 71/76) PREDICTED 0 = 0.06579( 5/76) *------------------------------------------------------------* NODE = 4 *------------------------------------------------------------* NINQ IS ONE OF 4, 5, 6, 8, 9, 10, 11, 12, 13, 14, 17 AND DELINQ IS ONE OF MISSING, 1, 2, 3, 4 AND DELINQ IS ONE OF 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 15 PREDICTED VALUE IS 1 PREDICTED 1 = 0.8714( 61/70) PREDICTED 0 = 0.1286( 9/70) *------------------------------------------------------------* NODE = 5 *------------------------------------------------------------* NINQ IS ONE OF MISSING, 0, 1, 2, 3, 7 AND DELINQ IS ONE OF MISSING, 1, 2, 3, 4 AND DELINQ IS ONE OF 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 15 PREDICTED VALUE IS 0 PREDICTED 1 = 0.3682( 306/831) PREDICTED 0 = 0.6318( 525/831) *------------------------------------------------------------* NODE = 8 *------------------------------------------------------------* DEBTINC IS MISSING OR DEBTINC <45.137782 AND CLAGE IS MISSING OR CLAGE <186.91737 AND DELINQ IS ONE OF MISSING, 0 PREDICTED VALUE IS 0 PREDICTED 1 = 0.1739( 392/2254) PREDICTED 0 = 0.8261( 1862/2254) *------------------------------------------------------------* NODE = 9 *------------------------------------------------------------* DEBTINC >=45.137782 AND CLAGE IS MISSING OR CLAGE <186.91737 AND DELINQ IS ONE OF MISSING, 0 PREDICTED VALUE IS 1 PREDICTED 1 = 1( 37/37) PREDICTED 0 = 0( 0/37) *------------------------------------------------------------* NODE = 10 *------------------------------------------------------------* CLAGE >=186.91737 AND DELINQ IS ONE OF MISSING, 0 PREDICTED VALUE IS 0 PREDICTED 1 = 0.0589( 90/1528) PREDICTED 0 = 0.9411( 1438/1528)
The listing includes each leaf of the tree, along with the proportion of training set observations that are in the region that is represented by the respective leaf. The predicted value is shown for each leaf, along with the fraction of that leaf’s observations that is in each of the target levels. The nodes are not numbered consecutively in the order 1, 2, 3, and so on, because the non-leaf nodes are not included.
The splits that lead to each leaf are shown above the predicted value and fractions. The same variable can be involved in
more than one split. For instance, the leaf labeled “NODE = 2” covers the region where DELINQ
is between 1 and 8, between 10 and 13, or is equal to 15, and the region where DELINQ
is also between 5 and 8, between 10 and 13, or is equal to 15. In other words, the variable DELINQ
is split twice in succession.
By preserving multiple splits of the same variable rather than merging them, the rules description makes it possible to traverse
the splits from the bottom of the tree to the top. For the leaf labeled “NODE=10”, the data set was first split on DELINQ
and the subset with DELINQ
=0 or where the value for DELINQ
was missing were split again on CLAGE
, with those observations having CLAGE
>=186.91737 going to node 10. At this leaf (node 10), the predicted value for BAD is 0 because the majority of the observations
(94%) have value 0.
The SCORE statement saves scores for the observations in a SAS data set named SCORED2
. Output 9.1.1 lists the first ten observations of SCORED2
.
Output 9.1.1: Scored Input Data Set (HMEQ)
Obs | BAD | _NODE_ | _LEAF_ | P_BAD1 | P_BAD0 | V_BAD1 | V_BAD0 |
---|---|---|---|---|---|---|---|
1 | 1 | 8 | 3 | 0.17391 | 0.82609 | 0.18808 | 0.81192 |
2 | 1 | 5 | 2 | 0.36823 | 0.63177 | 0.36126 | 0.63874 |
3 | 1 | 8 | 3 | 0.17391 | 0.82609 | 0.18808 | 0.81192 |
4 | 1 | 8 | 3 | 0.17391 | 0.82609 | 0.18808 | 0.81192 |
5 | 0 | 8 | 3 | 0.17391 | 0.82609 | 0.18808 | 0.81192 |
6 | 1 | 8 | 3 | 0.17391 | 0.82609 | 0.18808 | 0.81192 |
7 | 1 | 5 | 2 | 0.36823 | 0.63177 | 0.36126 | 0.63874 |
8 | 1 | 8 | 3 | 0.17391 | 0.82609 | 0.18808 | 0.81192 |
9 | 1 | 5 | 2 | 0.36823 | 0.63177 | 0.36126 | 0.63874 |
10 | 1 | 8 | 3 | 0.17391 | 0.82609 | 0.18808 | 0.81192 |
The variables _LEAF_
and _NODE_
show the leaf to which the observation was assigned. The variables P_BAD0
and P_BAD1
are the proportions of observations in the training set that have BAD
=1 and BAD
=0 for that leaf. The variables V_BAD0
and V_BAD1
are the proportions of observations in the validation set that have BAD
=1 and BAD
=0 for that leaf. For information about the variables in the scored data set, see the section Outputs.