The HPREG Procedure

Getting Started: HPREG Procedure

The following example is closely modeled on the example in the section "Getting Started: GLMSELECT Procedure" in the SAS/STAT User's Guide.

The Sashelp.Baseball data set contains salary and performance information for Major League Baseball players who played at least one game in both the 1986 and 1987 seasons, excluding pitchers. The salaries (Sports Illustrated, April 20, 1987) are for the 1987 season and the performance measures are from 1986 (Collier Books, The 1987 Baseball Encyclopedia Update). The following step displays in Figure 15.1 the variables in the data set:

proc contents varnum data=sashelp.baseball;
   ods select position;
run;

Figure 15.1: Sashelp.Baseball Data Set

The CONTENTS Procedure

Variables in Creation Order
# Variable Type Len Label
1 Name Char 18 Player's Name
2 Team Char 14 Team at the End of 1986
3 nAtBat Num 8 Times at Bat in 1986
4 nHits Num 8 Hits in 1986
5 nHome Num 8 Home Runs in 1986
6 nRuns Num 8 Runs in 1986
7 nRBI Num 8 RBIs in 1986
8 nBB Num 8 Walks in 1986
9 YrMajor Num 8 Years in the Major Leagues
10 CrAtBat Num 8 Career Times at Bat
11 CrHits Num 8 Career Hits
12 CrHome Num 8 Career Home Runs
13 CrRuns Num 8 Career Runs
14 CrRbi Num 8 Career RBIs
15 CrBB Num 8 Career Walks
16 League Char 8 League at the End of 1986
17 Division Char 8 Division at the End of 1986
18 Position Char 8 Position(s) in 1986
19 nOuts Num 8 Put Outs in 1986
20 nAssts Num 8 Assists in 1986
21 nError Num 8 Errors in 1986
22 Salary Num 8 1987 Salary in $ Thousands
23 Div Char 16 League and Division
24 logSalary Num 8 Log Salary



Suppose you want to investigate whether you can model the players’ salaries for the 1987 season based on performance measures for the previous season. The aim is to obtain a parsimonious model that does not overfit this particular data, making it useful for prediction. This example shows how you can use PROC HPREG as a starting point for such an analysis. Since the variation of salaries is much greater for the higher salaries, it is appropriate to apply a log transformation to the salaries before doing the model selection.

The following statements select a model with the default settings for stepwise selection:

proc hpreg data=sashelp.baseball;
  class league division;
  model logSalary = nAtBat nHits nHome nRuns nRBI nBB
                    yrMajor crAtBat crHits crHome crRuns crRbi
                    crBB league division nOuts nAssts nError;
  selection method=stepwise;
run;

The default output from this analysis is presented in Figure 15.2 through Figure 15.6.

Figure 15.2: Performance, Data Access, Model, and Selection Information

The HPREG Procedure

Performance Information
Execution Mode Single-Machine
Number of Threads 4

Data Access Information
Data Engine Role Path
SASHELP.BASEBALL V9 Input On Client

Model Information
Data Source SASHELP.BASEBALL
Dependent Variable logSalary
Class Parameterization GLM

Selection Information
Selection Method Stepwise
Select Criterion SBC
Stop Criterion SBC
Effect Hierarchy Enforced None
Stop Horizon 3



Figure 15.2 displays the "Performance Information," "Data Access Information," "Model Information," and "Selection Information" tables. The "Performance Information" table shows that procedure executes in single-machine mode—that is, the model is fit on the machine where the SAS session executes. This run of the HPREG procedure was performed on a multicore machine with four CPUs; one computational thread was spawned per CPU.

The "Data Access Information" table shows that the input data set is accessed with the V9 (base) engine on the client machine.

The "Model Information" table identifies the data source and response and shows that the CLASS variables are parameterized in the GLM parameterization, which is the default.

The "Selection Information" provides details about the method and criteria used to perform the model selection. The requested selection method is a variant of the traditional stepwise selection where the decisions about what effects to add or drop at any step and when to terminate the selection are both based on the Schwarz Bayesian information criterion (SBC). The effect in the current model whose removal yields the maximal decrease in the SBC statistic is dropped provided this lowers the SBC value. When no further decrease in the SBC value can be obtained by dropping an effect in the model, the effect whose addition to the model yields the lowest SBC statistic is added and the whole process is repeated. The method terminates when dropping or adding any effect increases the SBC statistic.

Figure 15.3 displays the "Number of Observations," "Class Levels," and "Dimensions" tables. The "Number of Observations" table shows that of the 322 observations in the input data, only 263 observations are used in the analysis because there are observations with incomplete data. The "Class Level Information" table lists the levels of the classification variables "division" and "league." When you specify effects that contain classification variables, the number of parameters is usually larger than the number of effects. The "Dimensions" table shows the number of effects and the number of parameters considered.

Figure 15.3: Number of Observations, Class Levels, and Dimensions

Number of Observations Read 322
Number of Observations Used 263

Class Level Information
Class Levels Values
League 2 American National
Division 2 East West

Dimensions
Number of Effects 19
Number of Parameters 21



The "Stepwise Selection Summary" table in Figure 15.4 shows the effect that was added or dropped at each step of the selection process together with fit statistics for the model at each step. In this case, both selection and stopping are based on the SBC statistic.

Figure 15.4: Selection Summary Table

The HPREG Procedure

Selection Summary
Step Effect
Entered
Effect
Removed
Number
Effects In
SBC
0 Intercept   1 -57.2041
1 CrRuns   2 -194.3166
2 nHits   3 -252.5794
3 YrMajor   4 -262.7322
4   CrRuns 3 -262.8353
5 nBB   4 -269.7804*

* Optimal Value of Criterion




Figure 15.5 displays the "Stop Reason," "Selection Reason," and "Selected Effects" tables. Note that these tables are displayed without any titles. The "Stop Reason" table indicates that selection stopped because adding or removing any effect would worsen the SBC value that is used as the selection criterion. In this case, because no CHOOSE= criterion is specified in the SELECTION statement, the final model is the selected model; this is indicated in the "Selection Reason" table. The "Selected Effects" table lists the effects in the selected model.

Figure 15.5: Stopping and Selection Reasons

Stepwise selection stopped because adding or removing an effect does not improve the SBC criterion.

The model at step 5 is selected.

Selected Effects: Intercept nHits nBB YrMajor



The "Analysis of Variance," "Fit Statistics," and "Parameter Estimates" tables shown in Figure 15.6 give details of the selected model.

Figure 15.6: Details of the Selected Model

Analysis of Variance
Source DF Sum of
Squares
Mean
Square
F Value Pr > F
Model 3 120.52553 40.17518 120.12 <.0001
Error 259 86.62820 0.33447    
Corrected Total 262 207.15373      

Root MSE 0.57834
R-Square 0.58182
Adj R-Sq 0.57697
AIC -19.06903
AICC -18.83557
SBC -269.78041
ASE 0.32938

Parameter Estimates
Parameter DF Estimate Standard
Error
t Value Pr > |t|
Intercept 1 4.013911 0.111290 36.07 <.0001
nHits 1 0.007929 0.000994 7.98 <.0001
nBB 1 0.007280 0.002049 3.55 0.0005
YrMajor 1 0.100663 0.007551 13.33 <.0001



You might want to examine regression diagnostics for the selected model to investigate whether collinearity among the selected parameters or the presence of outlying or high leverage observations might be impacting the fit produced. The following statements include some options and statements to obtain these diagnostics:

proc hpreg data=sashelp.baseball;
  id name;
  class league division;
  model logSalary = nAtBat nHits nHome nRuns nRBI nBB
                    yrMajor crAtBat crHits crHome crRuns crRbi
                    crBB league division nOuts nAssts nError / vif clb;
  selection method=stepwise;
  output out=baseballOut p=predictedLogSalary r h cookd rstudent;
run;

The VIF and CLB options in the MODEL statement request variance inflation factors and 95% confidence limits for the parameter estimates. Figure 15.7 shows the "Parameter Estimates" with these requested statistics. The variance inflation factors (VIF) measure the inflation in the variances of the parameter estimates due to collinearities that exist among the regressor (independent) variables. Although there are no formal criteria for deciding whether a VIF is large enough to affect the predicted values, the VIF values for the selected effects in this example are small enough to indicate that there are no collinearity issues among the selected regressors.

Figure 15.7: Parameter Estimates with Additional Statistics

The HPREG Procedure
Selected Model

Parameter Estimates
Parameter DF Estimate Standard
Error
t Value Pr > |t| Variance
Inflation
95% Confidence Limits
Intercept 1 4.013911 0.111290 36.07 <.0001 0 3.79476 4.23306
nHits 1 0.007929 0.000994 7.98 <.0001 1.49642 0.00597 0.00989
nBB 1 0.007280 0.002049 3.55 0.0005 1.52109 0.00325 0.01131
YrMajor 1 0.100663 0.007551 13.33 <.0001 1.02488 0.08579 0.11553



By default, high-performance statistical procedures do not include all variables from the input data set in output data sets. The ID statement specifies that the variable name in the input data set be added as an identification variable in the baseballOut data set that is produced by the OUTPUT statement. In addition to this variable, the OUTPUT statement requests that predicted values, raw residuals, leverage values, Cook’s D statistics, and studentized residuals be added in the output data set. Note that default names are used for these statistics except for the predicted values for which a specified name, predictedLogSalary, is supplied. The following statements use PROC PRINT to display the first five observations of this output data set:


proc print data=baseballOut(obs=5);
run;

Figure 15.8: First 5 Observations of the baseballOut Data Set

Obs Name predictedLogSalary Residual H COOKD RSTUDENT
1 Allanson, Andy 4.73980 . 0.016087 . .
2 Ashby, Alan 6.34935 -0.18603 0.012645 .000335535 -0.32316
3 Davis, Alan 5.89993 0.27385 0.019909 .001161794 0.47759
4 Dawson, Andre 6.50852 -0.29392 0.011060 .000730178 -0.51031
5 Galarraga, Andres 5.12344 -0.60711 0.009684 .002720358 -1.05510