The HPSPLIT Procedure

Getting Started: HPSPLIT Procedure

Decision trees are commonly used in banking to predict default in mortgage applications. The data set HMEQ, which is in the sample library, contains observations for 5,960 mortgage applicants. A variable named BAD indicates whether the applicant paid or defaulted on the loan.

This example uses HMEQ to build a tree model that is used to score the data and can be used to score data on new applicants. Table 9.1 describes the variables in HMEQ.

Table 9.1: Variables in the Home Equity (HMEQ) Data Set

Variable

Role

Level

Description

BAD

Target

Binary

1 = applicant defaulted on the loan or is seriously delinquent

     

0 = applicant paid the loan

CLAGE

Input

Interval

Age of oldest credit line in months

CLNO

Input

Interval

Number of credit lines

DEBTINC

Input

Interval

Debt-to-income ratio

DELINQ

Input

Interval

Number of delinquent credit lines

DEROG

Input

Interval

Number of major derogatory reports

JOB

Input

Nominal

Occupational category

LOAN

Input

Interval

Requested loan amount

MORTDUE

Input

Interval

Amount due on existing mortgage

NINQ

Input

Interval

Number of recent credit inquiries

REASON

Input

Binary

DebtCon = debt consolidation

     

HomeImp = home improvement

VALUE

Input

Interval

Value of current property

YOJ

Input

Interval

Years at present job


Figure 9.1 shows a partial listing of HMEQ.

Figure 9.1: Partial Listing of the HMEQ Data

Obs BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC
1 1 1100 25860 39025 HomeImp Other 10.5 0 0 94.367 1 9 .
2 1 1300 70053 68400 HomeImp Other 7.0 0 2 121.833 0 14 .
3 1 1500 13500 16700 HomeImp Other 4.0 0 0 149.467 1 10 .
4 1 1500 . .     . . . . . . .
5 0 1700 97800 112000 HomeImp Office 3.0 0 0 93.333 0 14 .
6 1 1700 30548 40320 HomeImp Other 9.0 0 0 101.466 1 8 37.1136
7 1 1800 48649 57037 HomeImp Other 5.0 3 2 77.100 1 17 .
8 1 1800 28502 43034 HomeImp Other 11.0 0 0 88.766 0 8 36.8849
9 1 2000 32700 46740 HomeImp Other 3.0 0 2 216.933 1 12 .
10 1 2000 . 62250 HomeImp Sales 16.0 0 0 115.800 0 13 .


The target variable for the tree model is BAD, a nominal variable that has two values (0 indicates payment, and 1 indicates default). The other variables are input variables for the model.

The following statements use the HPSPLIT procedure to create a decision tree and an output file that contains SAS DATA step code for predicting the probability of default:

proc hpsplit data=sashelp.hmeq maxdepth=7 maxbranch=2;
  target BAD;
  input DELINQ DEROG JOB NINQ REASON / level=nom;
  input CLAGE CLNO DEBTINC LOAN MORTDUE VALUE YOJ  / level=int;
  prune misc / N <= 10;
  partition fraction(validate=0.2);
  code file='hpsplhme-code.sas';
run;

The TARGET statement specifies the target variable, and the INPUT statements specify the input variables and their levels. The MAXDEPTH= option specifies the maximum depth of the tree to be grown, and the MAXBRANCH= option specifies the maximum number of children per node.

By default, the entropy metric is used to grow the tree. The PRUNE statement requests the misclassification rate metric for choosing a node to prune back to a leaf. The option N<=10 stops the pruning when the number of leaves is less than or equal to 10.

The PARTITION statement specifies the probability (0.2) of randomly selecting a given observation in HMEQ for validation; the remaining observations are used for training.

The CODE statement specifies a file named hpsplmhe-code.sas, to which SAS DATA step code for scoring is saved.

The following statements score the data in HMEQ and save the results in a SAS data set named SCORED.

data scored;
  set sashelp.hmeq;
  %include 'hpsplhme-code.sas';
run;

A partial listing of SCORED is shown in Figure 9.2.

Figure 9.2: Partial Listing of the Scored HMEQ Data

Obs BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC _NODE_ _LEAF_ _WARN_ P_BAD1 P_BAD0 V_BAD1 V_BAD0
1 1 1100 25860 39025 HomeImp Other 10.5 0 0 94.367 1 9 . 16 7   0.17391 0.82609 0.18808 0.81192
2 1 1300 70053 68400 HomeImp Other 7.0 0 2 121.833 0 14 . 13 6   0.29969 0.70031 0.32450 0.67550
3 1 1500 13500 16700 HomeImp Other 4.0 0 0 149.467 1 10 . 16 7   0.17391 0.82609 0.18808 0.81192
4 1 1500 . .     . . . . . . . 16 7   0.17391 0.82609 0.18808 0.81192
5 0 1700 97800 112000 HomeImp Office 3.0 0 0 93.333 0 14 . 16 7   0.17391 0.82609 0.18808 0.81192
6 1 1700 30548 40320 HomeImp Other 9.0 0 0 101.466 1 8 37.1136 16 7   0.17391 0.82609 0.18808 0.81192
7 1 1800 48649 57037 HomeImp Other 5.0 3 2 77.100 1 17 . 6 2   0.93939 0.06061 0.87500 0.12500
8 1 1800 28502 43034 HomeImp Other 11.0 0 0 88.766 0 8 36.8849 16 7   0.17391 0.82609 0.18808 0.81192
9 1 2000 32700 46740 HomeImp Other 3.0 0 2 216.933 1 12 . 13 6   0.29969 0.70031 0.32450 0.67550
10 1 2000 . 62250 HomeImp Sales 16.0 0 0 115.800 0 13 . 16 7   0.17391 0.82609 0.18808 0.81192


The data set contains the original variables and new variables that are created by the score statements. The variable P_BAD1 is the proportion of training observations at this leaf that have BAD=1, and this variable can be interpreted as the probability of default. The variable V_BAD1 is the proportion of validation observations at this leaf that have BAD=1. The other new variables are described in the section Outputs

The preceding statements can be used to score new data by including the new data set in place of HMEQ. The new data set must contain the same variables as the data that are used to build the tree model.