The HPSPLIT Procedure

Getting Started: HPSPLIT Procedure

Decision trees are commonly used in banking to predict default in mortgage applications. The data set HMEQ, which is in the sample library, contains observations for 5,960 mortgage applicants. A variable named BAD indicates whether the applicant paid or defaulted on the loan.

This example uses HMEQ to build a tree model that is used to score the data and can be used to score data on new applicants. Table 15.1 describes the variables in HMEQ.

Table 15.1: Variables in the Home Equity (HMEQ) Data Set

Variable	Role	Level	Description
BAD	Target	Binary	1 = applicant defaulted on the loan or is seriously delinquent
			0 = applicant paid the loan
CLAGE	Input	Interval	Age of oldest credit line in months
CLNO	Input	Interval	Number of credit lines
DEBTINC	Input	Interval	Debt-to-income ratio
DELINQ	Input	Interval	Number of delinquent credit lines
DEROG	Input	Interval	Number of major derogatory reports
JOB	Input	Nominal	Occupational category
LOAN	Input	Interval	Requested loan amount
MORTDUE	Input	Interval	Amount due on existing mortgage
NINQ	Input	Interval	Number of recent credit inquiries
REASON	Input	Binary	DebtCon = debt consolidation
			HomeImp = home improvement
VALUE	Input	Interval	Value of current property
YOJ	Input	Interval	Years at present job

Figure 15.1 shows a partial listing of HMEQ.

Figure 15.1: Partial Listing of the HMEQ Data

Obs	BAD	LOAN	MORTDUE	VALUE	REASON	JOB	YOJ	DEROG	DELINQ	CLAGE	NINQ	CLNO	DEBTINC
1	1	1100	25860	39025	HomeImp	Other	10.5	0	0	94.367	1	9	.
2	1	1300	70053	68400	HomeImp	Other	7.0	0	2	121.833	0	14	.
3	1	1500	13500	16700	HomeImp	Other	4.0	0	0	149.467	1	10	.
4	1	1500	.	.			.	.	.	.	.	.	.
5	0	1700	97800	112000	HomeImp	Office	3.0	0	0	93.333	0	14	.
6	1	1700	30548	40320	HomeImp	Other	9.0	0	0	101.466	1	8	37.1136
7	1	1800	48649	57037	HomeImp	Other	5.0	3	2	77.100	1	17	.
8	1	1800	28502	43034	HomeImp	Other	11.0	0	0	88.766	0	8	36.8849
9	1	2000	32700	46740	HomeImp	Other	3.0	0	2	216.933	1	12	.
10	1	2000	.	62250	HomeImp	Sales	16.0	0	0	115.800	0	13	.

The target variable for the tree model is BAD, a nominal variable that has two values (0 indicates payment of loan, and 1 indicates default). The other variables are input variables for the model.

The following statements use the HPSPLIT procedure to create a decision tree and an output file that contains SAS DATA step code for predicting the probability of default:

proc hpsplit data=sampsio.hmeq maxdepth=7 maxbranch=2;
   target BAD;
   input DELINQ DEROG JOB NINQ REASON / level=nom;
   input CLAGE CLNO DEBTINC LOAN MORTDUE VALUE YOJ  / level=int;
   prune misc / N <= 10;
   partition fraction(validate=0.2);
   code file='hpsplhme-code.sas';
run;

The TARGET statement specifies the target variable, and the INPUT statements specify the input variables and their levels. The MAXDEPTH= option specifies the maximum depth of the tree to be grown, and the MAXBRANCH= option specifies the maximum number of children per node.

By default, the entropy metric is used to grow the tree. The PRUNE statement requests the misclassification rate metric for choosing a node to prune back to a leaf. The option N<=10 stops the pruning when the number of leaves is less than or equal to 10.

The PARTITION statement specifies the probability (0.2) of randomly selecting a given observation in HMEQ for validation; the remaining observations are used for training.

The CODE statement specifies a file named hpsplmhe-code.sas, to which SAS DATA step code for scoring is saved.

The following statements score the data in HMEQ and save the results in a SAS data set named SCORED.

data scored;
set sampsio.hmeq;
  %include 'hpsplhme-code.sas';
run;

A partial listing of SCORED is shown in Figure 15.2.

Figure 15.2: Partial Listing of the Scored HMEQ Data

Obs	BAD	LOAN	MORTDUE	VALUE	REASON	JOB	YOJ	DEROG	DELINQ	CLAGE	NINQ	CLNO	DEBTINC	_NODE_	_LEAF_	P_BAD1	P_BAD0	V_BAD1	V_BAD0
1	1	1100	25860	39025	HomeImp	Other	10.5	0	0	94.367	1	9	.	9	2	0.18923	0.81077	0.16996	0.83004
2	1	1300	70053	68400	HomeImp	Other	7.0	0	2	121.833	0	14	.	14	6	0.30818	0.69182	0.28750	0.71250
3	1	1500	13500	16700	HomeImp	Other	4.0	0	0	149.467	1	10	.	9	2	0.18923	0.81077	0.16996	0.83004
4	1	1500	.	.			.	.	.	.	.	.	.	9	2	0.18923	0.81077	0.16996	0.83004
5	0	1700	97800	112000	HomeImp	Office	3.0	0	0	93.333	0	14	.	9	2	0.18923	0.81077	0.16996	0.83004
6	1	1700	30548	40320	HomeImp	Other	9.0	0	0	101.466	1	8	37.1136	9	2	0.18923	0.81077	0.16996	0.83004
7	1	1800	48649	57037	HomeImp	Other	5.0	3	2	77.100	1	17	.	13	5	0.58125	0.41875	0.60784	0.39216
8	1	1800	28502	43034	HomeImp	Other	11.0	0	0	88.766	0	8	36.8849	9	2	0.18923	0.81077	0.16996	0.83004
9	1	2000	32700	46740	HomeImp	Other	3.0	0	2	216.933	1	12	.	14	6	0.30818	0.69182	0.28750	0.71250
10	1	2000	.	62250	HomeImp	Sales	16.0	0	0	115.800	0	13	.	9	2	0.18923	0.81077	0.16996	0.83004

The data set contains the original variables and new variables that are created by the score statements. The variable P_BAD1 is the proportion of training observations at this leaf that have BAD=1, and this variable can be interpreted as the probability of default. The variable V_BAD1 is the proportion of validation observations at this leaf that have BAD=1. The other new variables are described in the section Outputs

The preceding statements can be used to score new data by including the new data set in place of HMEQ. The new data set must contain the same variables as the data that are used to build the tree model.