The following example is closely modeled on the example in the section "Getting Started: GLMSELECT Procedure" in the SAS/STAT User's Guide.
The Sashelp.Baseball
data set contains salary and performance information for Major League Baseball players who played at least one game in both
the 1986 and 1987 seasons, excluding pitchers. The salaries (Sports Illustrated, April 20, 1987) are for the 1987 season and the performance measures are from 1986 (Collier Books, The 1987 Baseball Encyclopedia Update). The following step displays in Figure 15.1 the variables in the data set:
proc contents varnum data=sashelp.baseball; ods select position; run;
Figure 15.1: Sashelp.Baseball
Data Set
Variables in Creation Order | ||||
---|---|---|---|---|
# | Variable | Type | Len | Label |
1 | Name | Char | 18 | Player's Name |
2 | Team | Char | 14 | Team at the End of 1986 |
3 | nAtBat | Num | 8 | Times at Bat in 1986 |
4 | nHits | Num | 8 | Hits in 1986 |
5 | nHome | Num | 8 | Home Runs in 1986 |
6 | nRuns | Num | 8 | Runs in 1986 |
7 | nRBI | Num | 8 | RBIs in 1986 |
8 | nBB | Num | 8 | Walks in 1986 |
9 | YrMajor | Num | 8 | Years in the Major Leagues |
10 | CrAtBat | Num | 8 | Career Times at Bat |
11 | CrHits | Num | 8 | Career Hits |
12 | CrHome | Num | 8 | Career Home Runs |
13 | CrRuns | Num | 8 | Career Runs |
14 | CrRbi | Num | 8 | Career RBIs |
15 | CrBB | Num | 8 | Career Walks |
16 | League | Char | 8 | League at the End of 1986 |
17 | Division | Char | 8 | Division at the End of 1986 |
18 | Position | Char | 8 | Position(s) in 1986 |
19 | nOuts | Num | 8 | Put Outs in 1986 |
20 | nAssts | Num | 8 | Assists in 1986 |
21 | nError | Num | 8 | Errors in 1986 |
22 | Salary | Num | 8 | 1987 Salary in $ Thousands |
23 | Div | Char | 16 | League and Division |
24 | logSalary | Num | 8 | Log Salary |
Suppose you want to investigate whether you can model the players’ salaries for the 1987 season based on performance measures for the previous season. The aim is to obtain a parsimonious model that does not overfit this particular data, making it useful for prediction. This example shows how you can use PROC HPREG as a starting point for such an analysis. Since the variation of salaries is much greater for the higher salaries, it is appropriate to apply a log transformation to the salaries before doing the model selection.
The following statements select a model with the default settings for stepwise selection:
proc hpreg data=sashelp.baseball; class league division; model logSalary = nAtBat nHits nHome nRuns nRBI nBB yrMajor crAtBat crHits crHome crRuns crRbi crBB league division nOuts nAssts nError; selection method=stepwise; run;
The default output from this analysis is presented in Figure 15.2 through Figure 15.6.
Figure 15.2: Performance, Data Access, Model, and Selection Information
Figure 15.2 displays the "Performance Information," "Data Access Information," "Model Information," and "Selection Information" tables. The "Performance Information" table shows that procedure executes in single-machine mode—that is, the model is fit on the machine where the SAS session executes. This run of the HPREG procedure was performed on a multicore machine with four CPUs; one computational thread was spawned per CPU.
The "Data Access Information" table shows that the input data set is accessed with the V9 (base) engine on the client machine.
The "Model Information" table identifies the data source and response and shows that the CLASS variables are parameterized in the GLM parameterization, which is the default.
The "Selection Information" provides details about the method and criteria used to perform the model selection. The requested selection method is a variant of the traditional stepwise selection where the decisions about what effects to add or drop at any step and when to terminate the selection are both based on the Schwarz Bayesian information criterion (SBC). The effect in the current model whose removal yields the maximal decrease in the SBC statistic is dropped provided this lowers the SBC value. When no further decrease in the SBC value can be obtained by dropping an effect in the model, the effect whose addition to the model yields the lowest SBC statistic is added and the whole process is repeated. The method terminates when dropping or adding any effect increases the SBC statistic.
Figure 15.3 displays the "Number of Observations," "Class Levels," and "Dimensions" tables. The "Number of Observations" table shows that of the 322 observations in the input data, only 263 observations are used in the analysis because there are observations with incomplete data. The "Class Level Information" table lists the levels of the classification variables "division" and "league." When you specify effects that contain classification variables, the number of parameters is usually larger than the number of effects. The "Dimensions" table shows the number of effects and the number of parameters considered.
Figure 15.3: Number of Observations, Class Levels, and Dimensions
The "Stepwise Selection Summary" table in Figure 15.4 shows the effect that was added or dropped at each step of the selection process together with fit statistics for the model at each step. In this case, both selection and stopping are based on the SBC statistic.
Figure 15.4: Selection Summary Table
Figure 15.5 displays the "Stop Reason," "Selection Reason," and "Selected Effects" tables. Note that these tables are displayed without any titles. The "Stop Reason" table indicates that selection stopped because adding or removing any effect would worsen the SBC value that is used as the selection criterion. In this case, because no CHOOSE= criterion is specified in the SELECTION statement, the final model is the selected model; this is indicated in the "Selection Reason" table. The "Selected Effects" table lists the effects in the selected model.
Figure 15.5: Stopping and Selection Reasons
The "Analysis of Variance," "Fit Statistics," and "Parameter Estimates" tables shown in Figure 15.6 give details of the selected model.
Figure 15.6: Details of the Selected Model
You might want to examine regression diagnostics for the selected model to investigate whether collinearity among the selected parameters or the presence of outlying or high leverage observations might be impacting the fit produced. The following statements include some options and statements to obtain these diagnostics:
proc hpreg data=sashelp.baseball; id name; class league division; model logSalary = nAtBat nHits nHome nRuns nRBI nBB yrMajor crAtBat crHits crHome crRuns crRbi crBB league division nOuts nAssts nError / vif clb; selection method=stepwise; output out=baseballOut p=predictedLogSalary r h cookd rstudent; run;
The VIF and CLB options in the MODEL statement request variance inflation factors and 95% confidence limits for the parameter estimates. Figure 15.7 shows the "Parameter Estimates" with these requested statistics. The variance inflation factors (VIF) measure the inflation in the variances of the parameter estimates due to collinearities that exist among the regressor (independent) variables. Although there are no formal criteria for deciding whether a VIF is large enough to affect the predicted values, the VIF values for the selected effects in this example are small enough to indicate that there are no collinearity issues among the selected regressors.
Figure 15.7: Parameter Estimates with Additional Statistics
Parameter Estimates | ||||||||
---|---|---|---|---|---|---|---|---|
Parameter | DF | Estimate | Standard Error |
t Value | Pr > |t| | Variance Inflation |
95% Confidence Limits | |
Intercept | 1 | 4.013911 | 0.111290 | 36.07 | <.0001 | 0 | 3.79476 | 4.23306 |
nHits | 1 | 0.007929 | 0.000994 | 7.98 | <.0001 | 1.49642 | 0.00597 | 0.00989 |
nBB | 1 | 0.007280 | 0.002049 | 3.55 | 0.0005 | 1.52109 | 0.00325 | 0.01131 |
YrMajor | 1 | 0.100663 | 0.007551 | 13.33 | <.0001 | 1.02488 | 0.08579 | 0.11553 |
By default, high-performance statistical procedures do not include all variables from the input data set in output data sets.
The ID
statement specifies that the variable name
in the input data set be added as an identification variable in the baseballOut
data set that is produced by the OUTPUT
statement. In addition to this variable, the OUTPUT
statement requests that predicted values, raw residuals, leverage values, Cook’s D statistics, and studentized residuals
be added in the output data set. Note that default names are used for these statistics except for the predicted values for
which a specified name, predictedLogSalary
, is supplied. The following statements use PROC PRINT to display the first five observations of this output data set:
proc print data=baseballOut(obs=5); run;
Figure 15.8: First 5 Observations of the baseballOut
Data Set
Obs | Name | predictedLogSalary | Residual | H | COOKD | RSTUDENT |
---|---|---|---|---|---|---|
1 | Allanson, Andy | 4.73980 | . | 0.016087 | . | . |
2 | Ashby, Alan | 6.34935 | -0.18603 | 0.012645 | .000335535 | -0.32316 |
3 | Davis, Alan | 5.89993 | 0.27385 | 0.019909 | .001161794 | 0.47759 |
4 | Dawson, Andre | 6.50852 | -0.29392 | 0.011060 | .000730178 | -0.51031 |
5 | Galarraga, Andres | 5.12344 | -0.60711 | 0.009684 | .002720358 | -1.05510 |