Decision trees are commonly used in banking to predict default in mortgage applications. The data set HMEQ, which is in the sample library, contains observations for 5,960 mortgage applicants. A variable named BAD indicates whether the applicant paid or defaulted on the loan.
This example uses HMEQ to build a tree model that is used to score the data and can be used to score data on new applicants. Table 15.1 describes the variables in HMEQ.
Table 15.1: Variables in the Home Equity (HMEQ) Data Set
|
Variable |
Role |
Level |
Description |
|---|---|---|---|
|
BAD |
Target |
Binary |
1 = applicant defaulted on the loan or is seriously delinquent |
|
0 = applicant paid the loan |
|||
|
CLAGE |
Input |
Interval |
Age of oldest credit line in months |
|
CLNO |
Input |
Interval |
Number of credit lines |
|
DEBTINC |
Input |
Interval |
Debt-to-income ratio |
|
DELINQ |
Input |
Interval |
Number of delinquent credit lines |
|
DEROG |
Input |
Interval |
Number of major derogatory reports |
|
JOB |
Input |
Nominal |
Occupational category |
|
LOAN |
Input |
Interval |
Requested loan amount |
|
MORTDUE |
Input |
Interval |
Amount due on existing mortgage |
|
NINQ |
Input |
Interval |
Number of recent credit inquiries |
|
REASON |
Input |
Binary |
DebtCon = debt consolidation |
|
HomeImp = home improvement |
|||
|
VALUE |
Input |
Interval |
Value of current property |
|
YOJ |
Input |
Interval |
Years at present job |
Figure 15.1 shows a partial listing of HMEQ.
Figure 15.1: Partial Listing of the HMEQ Data
| Obs | BAD | LOAN | MORTDUE | VALUE | REASON | JOB | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 1100 | 25860 | 39025 | HomeImp | Other | 10.5 | 0 | 0 | 94.367 | 1 | 9 | . |
| 2 | 1 | 1300 | 70053 | 68400 | HomeImp | Other | 7.0 | 0 | 2 | 121.833 | 0 | 14 | . |
| 3 | 1 | 1500 | 13500 | 16700 | HomeImp | Other | 4.0 | 0 | 0 | 149.467 | 1 | 10 | . |
| 4 | 1 | 1500 | . | . | . | . | . | . | . | . | . | ||
| 5 | 0 | 1700 | 97800 | 112000 | HomeImp | Office | 3.0 | 0 | 0 | 93.333 | 0 | 14 | . |
| 6 | 1 | 1700 | 30548 | 40320 | HomeImp | Other | 9.0 | 0 | 0 | 101.466 | 1 | 8 | 37.1136 |
| 7 | 1 | 1800 | 48649 | 57037 | HomeImp | Other | 5.0 | 3 | 2 | 77.100 | 1 | 17 | . |
| 8 | 1 | 1800 | 28502 | 43034 | HomeImp | Other | 11.0 | 0 | 0 | 88.766 | 0 | 8 | 36.8849 |
| 9 | 1 | 2000 | 32700 | 46740 | HomeImp | Other | 3.0 | 0 | 2 | 216.933 | 1 | 12 | . |
| 10 | 1 | 2000 | . | 62250 | HomeImp | Sales | 16.0 | 0 | 0 | 115.800 | 0 | 13 | . |
The target variable for the tree model is BAD, a nominal variable that has two values (0 indicates payment of loan, and 1 indicates default). The other variables are input variables for the model.
The following statements use the HPSPLIT procedure to create a decision tree and an output file that contains SAS DATA step code for predicting the probability of default:
proc hpsplit data=sampsio.hmeq maxdepth=7 maxbranch=2; target BAD; input DELINQ DEROG JOB NINQ REASON / level=nom; input CLAGE CLNO DEBTINC LOAN MORTDUE VALUE YOJ / level=int; prune misc / N <= 10; partition fraction(validate=0.2); code file='hpsplhme-code.sas'; run;
The TARGET statement specifies the target variable, and the INPUT statements specify the input variables and their levels. The MAXDEPTH= option specifies the maximum depth of the tree to be grown, and the MAXBRANCH= option specifies the maximum number of children per node.
By default, the entropy metric is used to grow the tree. The PRUNE statement requests the misclassification rate metric for choosing a node to prune back to a leaf. The option N<=10 stops the pruning when the number of leaves is less than or equal to 10.
The PARTITION
statement specifies the probability (0.2) of randomly selecting a given observation in HMEQ for validation; the remaining observations are used for training.
The CODE
statement specifies a file named hpsplmhe-code.sas, to which SAS DATA step code for scoring is saved.
The following statements score the data in HMEQ and save the results in a SAS data set named SCORED.
data scored; set sampsio.hmeq; %include 'hpsplhme-code.sas'; run;
A partial listing of SCORED is shown in Figure 15.2.
Figure 15.2: Partial Listing of the Scored HMEQ Data
| Obs | BAD | LOAN | MORTDUE | VALUE | REASON | JOB | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | _NODE_ | _LEAF_ | _WARN_ | P_BAD1 | P_BAD0 | V_BAD1 | V_BAD0 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 1100 | 25860 | 39025 | HomeImp | Other | 10.5 | 0 | 0 | 94.367 | 1 | 9 | . | 9 | 2 | 0.18923 | 0.81077 | 0.16996 | 0.83004 | |
| 2 | 1 | 1300 | 70053 | 68400 | HomeImp | Other | 7.0 | 0 | 2 | 121.833 | 0 | 14 | . | 14 | 6 | 0.30818 | 0.69182 | 0.28750 | 0.71250 | |
| 3 | 1 | 1500 | 13500 | 16700 | HomeImp | Other | 4.0 | 0 | 0 | 149.467 | 1 | 10 | . | 9 | 2 | 0.18923 | 0.81077 | 0.16996 | 0.83004 | |
| 4 | 1 | 1500 | . | . | . | . | . | . | . | . | . | 9 | 2 | 0.18923 | 0.81077 | 0.16996 | 0.83004 | |||
| 5 | 0 | 1700 | 97800 | 112000 | HomeImp | Office | 3.0 | 0 | 0 | 93.333 | 0 | 14 | . | 9 | 2 | 0.18923 | 0.81077 | 0.16996 | 0.83004 | |
| 6 | 1 | 1700 | 30548 | 40320 | HomeImp | Other | 9.0 | 0 | 0 | 101.466 | 1 | 8 | 37.1136 | 9 | 2 | 0.18923 | 0.81077 | 0.16996 | 0.83004 | |
| 7 | 1 | 1800 | 48649 | 57037 | HomeImp | Other | 5.0 | 3 | 2 | 77.100 | 1 | 17 | . | 13 | 5 | 0.58125 | 0.41875 | 0.60784 | 0.39216 | |
| 8 | 1 | 1800 | 28502 | 43034 | HomeImp | Other | 11.0 | 0 | 0 | 88.766 | 0 | 8 | 36.8849 | 9 | 2 | 0.18923 | 0.81077 | 0.16996 | 0.83004 | |
| 9 | 1 | 2000 | 32700 | 46740 | HomeImp | Other | 3.0 | 0 | 2 | 216.933 | 1 | 12 | . | 14 | 6 | 0.30818 | 0.69182 | 0.28750 | 0.71250 | |
| 10 | 1 | 2000 | . | 62250 | HomeImp | Sales | 16.0 | 0 | 0 | 115.800 | 0 | 13 | . | 9 | 2 | 0.18923 | 0.81077 | 0.16996 | 0.83004 |
The data set contains the original variables and new variables that are created by the score statements. The variable P_BAD1 is the proportion of training observations at this leaf that have BAD=1, and this variable can be interpreted as the probability of default. The variable V_BAD1 is the proportion of validation observations at this leaf that have BAD=1. The other new variables are described in the section Outputs
The preceding statements can be used to score new data by including the new data set in place of HMEQ. The new data set must contain the same variables as the data that are used to build the tree model.