This example creates a tree model and saves a node rules representation of the model in a file. It uses the mortgage application
data set HMEQ
in the Sample Library, which is described in the Getting Started example in section Getting Started: HPSPLIT Procedure.
The following statements create the tree model:
proc hpsplit data=sampsio.hmeq maxdepth=7 maxbranch=2; target BAD; input DELINQ DEROG JOB NINQ REASON / level=nom; input CLAGE CLNO DEBTINC LOAN MORTDUE VALUE YOJ / level=int; criterion entropy; prune misc / N <= 6; partition fraction(validate=0.2); rules file='hpsplhme2-rules.txt'; score out=scored2; run;
The target variable (BAD
) and input variables that are specified for the tree model are the same as in the Getting Started example. The criteria for
growing and pruning the tree are also the same, except that pruning is stopped when the number of leaves is less than or equal
to 6.
The RULES
statement specifies a file named hpsplhme2-rules.txt
, to which the node rules description of the model is saved. The following listing of this file shows that each leaf of the
tree (labeled as a NODE) is numbered and described:
*------------------------------------------------------------* NODE = 2 *------------------------------------------------------------* (DELINQ IS ONE OF 5, 6, 7, 8, 10, 11, 12, 13) AND (DELINQ IS ONE OF 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13) PREDICTED VALUE IS 1 PREDICTED 1 = 0.9403( 63/67) PREDICTED 0 = 0.0597( 4/67) *------------------------------------------------------------* NODE = 7 *------------------------------------------------------------* MISSING(DEROG) OR (DEROG IS ONE OF 0, 1, 2, 7, 10) AND MISSING(NINQ) OR (NINQ IS ONE OF 0, 1, 2, 3, 7, 9, 11, 12, 13, 14) AND MISSING(DELINQ) OR (DELINQ IS ONE OF 1, 2, 3, 4) AND (DELINQ IS ONE OF 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13) PREDICTED VALUE IS 0 PREDICTED 1 = 0.3283( 263/801) PREDICTED 0 = 0.6717( 538/801) *------------------------------------------------------------* NODE = 6 *------------------------------------------------------------* (DEROG IS ONE OF 3, 4, 5, 6, 8, 9) AND MISSING(NINQ) OR (NINQ IS ONE OF 0, 1, 2, 3, 7, 9, 11, 12, 13, 14) AND MISSING(DELINQ) OR (DELINQ IS ONE OF 1, 2, 3, 4) AND (DELINQ IS ONE OF 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13) PREDICTED VALUE IS 1 PREDICTED 1 = 0.9032( 28/31) PREDICTED 0 = 0.09677( 3/31) *------------------------------------------------------------* NODE = 9 *------------------------------------------------------------* MISSING(DEBTINC) OR (DEBTINC <45.137782) AND MISSING(DELINQ) OR (DELINQ IS ONE OF 0, 15) PREDICTED VALUE IS 0 PREDICTED 1 = 0.1286( 488/3795) PREDICTED 0 = 0.8714( 3307/3795)
The preceding listing includes each leaf of the tree, along with the proportion of training set observations that are in the region that is represented by the respective leaf. The predicted value is shown for each leaf, along with the fraction of that leaf’s observations that is in each of the target levels. The nodes are not numbered consecutively because the non-leaf nodes are not included.
The splits that lead to each leaf are shown above the predicted value and fractions. The same variable can be involved in
more than one split. For example, the leaf labeled "NODE = 2" covers the region where DELINQ
is between 1 and 8 or between 10 and 13, and also the region where DELINQ
is between 5 and 8 or between 10 and 13. In other words, the variable DELINQ
is split twice in succession. By preserving multiple splits of the same variable rather than merging them, the rules description
makes it possible to traverse the splits from the bottom of the tree to the top. At this leaf (node 2), the predicted value
for BAD is 1 because the majority of the observations (94%) have value 0.
The SCORE
statement saves scores for the observations in a SAS data set named SCORED2
. Output 15.1.1 lists the first 10 observations of SCORED2
.
Output 15.1.1: Scored Input Data Set (HMEQ)
Obs | BAD | _NODE_ | _LEAF_ | P_BAD1 | P_BAD0 | V_BAD1 | V_BAD0 |
---|---|---|---|---|---|---|---|
1 | 1 | 9 | 4 | 0.50962 | 0.49038 | 0.44828 | 0.55172 |
2 | 1 | 4 | 1 | 0.36740 | 0.63260 | 0.34973 | 0.65027 |
3 | 1 | 9 | 4 | 0.50962 | 0.49038 | 0.44828 | 0.55172 |
4 | 1 | 9 | 4 | 0.50962 | 0.49038 | 0.44828 | 0.55172 |
5 | 0 | 9 | 4 | 0.50962 | 0.49038 | 0.44828 | 0.55172 |
6 | 1 | 9 | 4 | 0.50962 | 0.49038 | 0.44828 | 0.55172 |
7 | 1 | 3 | 0 | 0.70485 | 0.29515 | 0.65672 | 0.34328 |
8 | 1 | 9 | 4 | 0.50962 | 0.49038 | 0.44828 | 0.55172 |
9 | 1 | 4 | 1 | 0.36740 | 0.63260 | 0.34973 | 0.65027 |
10 | 1 | 9 | 4 | 0.50962 | 0.49038 | 0.44828 | 0.55172 |
The variables _LEAF_
and _NODE_
show the leaf to which the observation was assigned. The variables P_BAD0
and P_BAD1
are the proportions of observations in the training set that have BAD
=1 and BAD
=0 for that leaf. The variables V_BAD0
and V_BAD1
are the proportions of observations in the validation set that have BAD
=1 and BAD
=0 for that leaf. For information about the variables in the scored data set, see the section Outputs.