The HPSPLIT Procedure

Example 61.3 Creating a Regression Tree

This example performs an analysis similar to the one in the "Getting Started" section of Chapter 60: The HPREG Procedure, where a linear regression model is fit. You can alternatively fit a regression tree to predict the salaries of Major League Baseball players based on their performance measures from the previous season by using almost identical code. Regression trees are piecewise constant models that, for relatively small data sets such as this, provide succinct summaries of how the predictors determine the predictions. These models are usually easier to interpret than linear regression models. The Sashelp.Baseball data set contains salary and performance information for Major League Baseball players (excluding pitchers) who played at least one game in both the 1986 and 1987 seasons (Time Inc. 1987). The following statements create a regression tree model:

ods graphics on;

proc hpsplit seed=123;
   class league division;
   model logSalary = nAtBat nHits nHome nRuns nRBI nBB
                     yrMajor crAtBat crHits crHome crRuns crRbi
                     crBB league division nOuts nAssts nError;
   output out=hpsplout;

By default, the tree is grown using the RSS criterion, and cost-complexity pruning with 10-fold cross validation is performed. The OUTPUT statement requests generation of the data set hpsplout, which contains the predicted salary from the tree model for each observation.

Much of the output for a regression tree is identical to the output for a classification tree. Tables and plots where there are differences are displayed and discussed on the following pages.

Output 61.3.1 displays the full regression tree.

Output 61.3.1: Overview Diagram of Regression Tree

Overview Diagram of Regression Tree

You can see from this diagram that the final selected tree has eight leaves. For a regression tree, the shade of the leaves represents the predicted response value, which is the average observed logSalary for the observations in that leaf. Node 3 has the lowest predicted response value, indicated by the lightest shade of blue, and Node A has the highest, indicated by the dark shade.

Output 61.3.2 shows details of the tree.

Output 61.3.2: Detailed Diagram of Regression Tree

Detailed Diagram of Regression Tree

As with a classification tree, you can see split variables and values for a portion of the tree in this view. You can also see the exact predicted response value, which is the average observed response, in each node.

The confusion matrix is omitted from the output when you are modeling a regression tree because it is relevant only for a categorical response.

Output 61.3.3 displays fit statistics for the final regression tree.

Output 61.3.3: Regression Tree Performance

The HPSPLIT Procedure

Model-Based Fit Statistics
for Selected Tree
8 0.1443 37.9587

Note that this table contains different statistics from those included for a classification tree. The ASE and RSS are reported here to help you assess the model fit. You could also use the CVMODELFIT option in the PROC HPSPLIT statement to obtain the cross validated fit statistics, as with a classification tree.

Output 61.3.4 shows the hpsplout data set that is created by using the OUTPUT statement and contains the first 10 observations of the predicted log-transformed salaries for each player in Sashelp.Baseball based on the regression tree model.

Output 61.3.4: Scored Predictor Data Set

Obs logSalary _Node_ _Leaf_ P_logSalary
1 . 3 0 4.56042
2 6.16331 11 5 6.08107
3 6.17379 9 3 6.41796
4 6.21461 13 6 6.60314
5 4.51634 3 0 4.56042
6 6.62007 13 6 6.60314
7 4.24850 3 0 4.56042
8 4.60517 3 0 4.56042
9 4.31749 3 0 4.56042
10 7.00307 10 4 7.02886

The variable P_logSalary contains the predicted salaries on the log scale. Note that all observations in the same leaf have the same predicted response. The OUT= data set can contain additional variables from the DATA= data set if you specify them in the ID statement.