Decision trees are commonly used in banking to predict default in mortgage applications. The data set HMEQ
, which is in the sample library, contains observations for 5,960 mortgage applicants. A variable named BAD
indicates whether the applicant paid or defaulted on the loan.
This example uses HMEQ
to build a tree model that is used to score the data and can be used to score data on new applicants. Table 9.1 describes the variables in HMEQ
.
Table 9.1: Variables in the Home Equity (HMEQ
) Data Set
Variable |
Role |
Level |
Description |
---|---|---|---|
BAD |
Target |
Binary |
1 = applicant defaulted on the loan or is seriously delinquent |
0 = applicant paid the loan |
|||
CLAGE |
Input |
Interval |
Age of oldest credit line in months |
CLNO |
Input |
Interval |
Number of credit lines |
DEBTINC |
Input |
Interval |
Debt-to-income ratio |
DELINQ |
Input |
Interval |
Number of delinquent credit lines |
DEROG |
Input |
Interval |
Number of major derogatory reports |
JOB |
Input |
Nominal |
Occupational category |
LOAN |
Input |
Interval |
Requested loan amount |
MORTDUE |
Input |
Interval |
Amount due on existing mortgage |
NINQ |
Input |
Interval |
Number of recent credit inquiries |
REASON |
Input |
Binary |
DebtCon = debt consolidation |
HomeImp = home improvement |
|||
VALUE |
Input |
Interval |
Value of current property |
YOJ |
Input |
Interval |
Years at present job |
Figure 9.1 shows a partial listing of HMEQ
.
Figure 9.1: Partial Listing of the HMEQ
Data
Obs | BAD | LOAN | MORTDUE | VALUE | REASON | JOB | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1100 | 25860 | 39025 | HomeImp | Other | 10.5 | 0 | 0 | 94.367 | 1 | 9 | . |
2 | 1 | 1300 | 70053 | 68400 | HomeImp | Other | 7.0 | 0 | 2 | 121.833 | 0 | 14 | . |
3 | 1 | 1500 | 13500 | 16700 | HomeImp | Other | 4.0 | 0 | 0 | 149.467 | 1 | 10 | . |
4 | 1 | 1500 | . | . | . | . | . | . | . | . | . | ||
5 | 0 | 1700 | 97800 | 112000 | HomeImp | Office | 3.0 | 0 | 0 | 93.333 | 0 | 14 | . |
6 | 1 | 1700 | 30548 | 40320 | HomeImp | Other | 9.0 | 0 | 0 | 101.466 | 1 | 8 | 37.1136 |
7 | 1 | 1800 | 48649 | 57037 | HomeImp | Other | 5.0 | 3 | 2 | 77.100 | 1 | 17 | . |
8 | 1 | 1800 | 28502 | 43034 | HomeImp | Other | 11.0 | 0 | 0 | 88.766 | 0 | 8 | 36.8849 |
9 | 1 | 2000 | 32700 | 46740 | HomeImp | Other | 3.0 | 0 | 2 | 216.933 | 1 | 12 | . |
10 | 1 | 2000 | . | 62250 | HomeImp | Sales | 16.0 | 0 | 0 | 115.800 | 0 | 13 | . |
The target variable for the tree model is BAD, a nominal variable that has two values (0 indicates payment, and 1 indicates default). The other variables are input variables for the model.
The following statements use the HPSPLIT procedure to create a decision tree and an output file that contains SAS DATA step code for predicting the probability of default:
proc hpsplit data=sashelp.hmeq maxdepth=7 maxbranch=2; target BAD; input DELINQ DEROG JOB NINQ REASON / level=nom; input CLAGE CLNO DEBTINC LOAN MORTDUE VALUE YOJ / level=int; prune misc / N <= 10; partition fraction(validate=0.2); code file='hpsplhme-code.sas'; run;
The TARGET statement specifies the target variable, and the INPUT statements specify the input variables and their levels. The MAXDEPTH= option specifies the maximum depth of the tree to be grown, and the MAXBRANCH= option specifies the maximum number of children per node.
By default, the entropy metric is used to grow the tree. The PRUNE statement requests the misclassification rate metric for choosing a node to prune back to a leaf. The option N<=10 stops the pruning when the number of leaves is less than or equal to 10.
The PARTITION statement specifies the probability (0.2) of randomly selecting a given observation in HMEQ
for validation; the remaining observations are used for training.
The CODE statement specifies a file named hpsplmhe-code.sas
, to which SAS DATA step code for scoring is saved.
The following statements score the data in HMEQ
and save the results in a SAS data set named SCORED
.
data scored; set sashelp.hmeq; %include 'hpsplhme-code.sas'; run;
A partial listing of SCORED
is shown in Figure 9.2.
Figure 9.2: Partial Listing of the Scored HMEQ Data
Obs | BAD | LOAN | MORTDUE | VALUE | REASON | JOB | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | _NODE_ | _LEAF_ | _WARN_ | P_BAD1 | P_BAD0 | V_BAD1 | V_BAD0 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1100 | 25860 | 39025 | HomeImp | Other | 10.5 | 0 | 0 | 94.367 | 1 | 9 | . | 16 | 7 | 0.17391 | 0.82609 | 0.18808 | 0.81192 | |
2 | 1 | 1300 | 70053 | 68400 | HomeImp | Other | 7.0 | 0 | 2 | 121.833 | 0 | 14 | . | 13 | 6 | 0.29969 | 0.70031 | 0.32450 | 0.67550 | |
3 | 1 | 1500 | 13500 | 16700 | HomeImp | Other | 4.0 | 0 | 0 | 149.467 | 1 | 10 | . | 16 | 7 | 0.17391 | 0.82609 | 0.18808 | 0.81192 | |
4 | 1 | 1500 | . | . | . | . | . | . | . | . | . | 16 | 7 | 0.17391 | 0.82609 | 0.18808 | 0.81192 | |||
5 | 0 | 1700 | 97800 | 112000 | HomeImp | Office | 3.0 | 0 | 0 | 93.333 | 0 | 14 | . | 16 | 7 | 0.17391 | 0.82609 | 0.18808 | 0.81192 | |
6 | 1 | 1700 | 30548 | 40320 | HomeImp | Other | 9.0 | 0 | 0 | 101.466 | 1 | 8 | 37.1136 | 16 | 7 | 0.17391 | 0.82609 | 0.18808 | 0.81192 | |
7 | 1 | 1800 | 48649 | 57037 | HomeImp | Other | 5.0 | 3 | 2 | 77.100 | 1 | 17 | . | 6 | 2 | 0.93939 | 0.06061 | 0.87500 | 0.12500 | |
8 | 1 | 1800 | 28502 | 43034 | HomeImp | Other | 11.0 | 0 | 0 | 88.766 | 0 | 8 | 36.8849 | 16 | 7 | 0.17391 | 0.82609 | 0.18808 | 0.81192 | |
9 | 1 | 2000 | 32700 | 46740 | HomeImp | Other | 3.0 | 0 | 2 | 216.933 | 1 | 12 | . | 13 | 6 | 0.29969 | 0.70031 | 0.32450 | 0.67550 | |
10 | 1 | 2000 | . | 62250 | HomeImp | Sales | 16.0 | 0 | 0 | 115.800 | 0 | 13 | . | 16 | 7 | 0.17391 | 0.82609 | 0.18808 | 0.81192 |
The data set contains the original variables and new variables that are created by the score statements. The variable P_BAD1
is the proportion of training observations at this leaf that have BAD
=1, and this variable can be interpreted as the probability of default. The variable V_BAD1
is the proportion of validation observations at this leaf that have BAD
=1. The other new variables are described in the section Outputs
The preceding statements can be used to score new data by including the new data set in place of HMEQ
. The new data set must contain the same variables as the data that are used to build the tree model.