The following example is closely modeled on the example in the section “Getting Started: GLMSELECT Procedure” in the SAS/STAT User's Guide. The data set contains salary and performance information for Major League Baseball players (excluding pitchers) who played at least one game in both the 1986 and 1987 seasons. The salaries are for the 1987 season (Sports Illustrated, April 20, 1987) and the performance measures are from 1986 (Collier Books, The 1987 Baseball Encyclopedia Update).
data baseball; length name $ 18; length team $ 12; input name $ 1-18 nAtBat nHits nHome nRuns nRBI nBB yrMajor crAtBat crHits crHome crRuns crRbi crBB league $ division $ team $ position $ nOuts nAssts nError salary; label name="Player's Name" nAtBat="Times at Bat in 1986" nHits="Hits in 1986" nHome="Home Runs in 1986" nRuns="Runs in 1986" nRBI="RBIs in 1986" nBB="Walks in 1986" yrMajor="Years in the Major Leagues" crAtBat="Career times at bat" crHits="Career Hits" crHome="Career Home Runs" crRuns="Career Runs" crRbi="Career RBIs" crBB="Career Walks" league="League at the end of 1986" division="Division at the end of 1986" team="Team at the end of 1986" position="Position(s) in 1986" nOuts="Put Outs in 1986" nAssts="Assists in 1986" nError="Errors in 1986" salary="1987 Salary in $ Thousands"; logSalary = log(Salary); datalines; Allanson, Andy 293 66 1 30 29 14 1 293 66 1 30 29 14 American East Cleveland C 446 33 20 . Ashby, Alan 315 81 7 24 38 39 14 3449 835 69 321 414 375 National West Houston C 632 43 10 475 Davis, Alan 479 130 18 66 72 76 ... more lines ... Wilson, Willie 631 170 9 77 44 31 11 4908 1457 30 775 357 249 American West KansasCity CF 408 4 3 1000 ;
Suppose you want to investigate whether you can model the players’ salaries for the 1987 season based on performance measures for the previous season. The aim is to obtain a parsimonious model that does not overfit this particular data, making it useful for prediction. This example shows how you can use PROC HPREG as a starting point for such an analysis. Since the variation of salaries is much greater for the higher salaries, it is appropriate to apply a log transformation to the salaries before doing the model selection.
The following statements select a model with the default settings for stepwise selection:
proc hpreg data=baseball; class league division; model logSalary = nAtBat nHits nHome nRuns nRBI nBB yrMajor crAtBat crHits crHome crRuns crRbi crBB league division nOuts nAssts nError; selection method=stepwise; run;
The default output from this analysis is presented in Figure 8.1 through Figure 8.5.
Figure 8.1: Performance, Model, and Selection Information
Performance Information | |
---|---|
Execution Mode | Single-Machine |
Number of Threads | 4 |
Model Information | |
---|---|
Data Source | WORK.BASEBALL |
Dependent Variable | logSalary |
Class Parameterization | GLM |
Selection Information | |
---|---|
Selection Method | Stepwise |
Select Criterion | SBC |
Stop Criterion | SBC |
Effect Hierarchy Enforced | None |
Stop Horizon | 3 |
Figure 8.1 displays the “Performance Information,” “Model Information,” and “Selection Information” tables. The “Performance Information” table shows that procedure executes in single-machine mode—that is, the model is fit on the machine where the SAS session executes. This run of the HPREG procedure was performed on a multicore machine with four CPUs; one computational thread was spawned per CPU.
The “Model Information” table identifies the data source and response and shows that the CLASS variables are parameterized in the GLM parameterization, which is the default.
The “Selection Information” provides details about the method and criteria used to perform the model selection. The requested selection method is a variant of the traditional stepwise selection where the decisions about what effects to add or drop at any step and when to terminate the selection are both based on the Schwarz Bayesian information criterion (SBC). The effect in the current model whose removal yields the maximal decrease in the SBC statistic is dropped provided this lowers the SBC value. When no further decrease in the SBC value can be obtained by dropping an effect in the model, the effect whose addition to the model yields the lowest SBC statistic is added and the whole process is repeated. The method terminates when dropping or adding any effect increases the SBC statistic.
Figure 8.2 displays the “Number of Observations,” “Class Levels,” and “Dimensions” tables. The “Number of Observations” table shows that of the 322 observations in the input data, only 263 observations are used in the analysis because there are observations with incomplete data. The “Class Level Information” table lists the levels of the classification variables “division” and “league.” When you specify effects that contain classification variables, the number of parameters is usually larger than the number of effects. The “Dimensions” table shows the number of effects and the number of parameters considered.
Figure 8.2: Number of Observations, Class Levels, and Dimensions
Number of Observations Read | 322 |
---|---|
Number of Observations Used | 263 |
Class Level Information | ||
---|---|---|
Class | Levels | Values |
league | 2 | American National |
division | 2 | East West |
Dimensions | |
---|---|
Number of Effects | 19 |
Number of Parameters | 21 |
The “Stepwise Selection Summary” table in Figure 8.3 shows the effect that was added or dropped at each step of the selection process together with fit statistics for the model at each step. In this case, both selection and stopping are based on the SBC statistic.
Figure 8.3: Selection Summary Table
Selection Summary | ||||
---|---|---|---|---|
Step | Effect Entered |
Effect Removed |
Number Effects In |
SBC |
0 | Intercept | 1 | -57.2041 | |
1 | crRuns | 2 | -194.3166 | |
2 | nHits | 3 | -252.5794 | |
3 | yrMajor | 4 | -262.7322 | |
4 | crRuns | 3 | -262.8353 | |
5 | nBB | 4 | -269.7804* |
* Optimal Value of Criterion |
Figure 8.4 displays the “Stop Reason,” “Selection Reason,” and “Selected Effects” tables. Note that these tables are displayed without any titles. The “Stop Reason” table indicates that selection stopped because adding or removing any effect would worsen the SBC value that is used as the selection criterion. In this case, because no CHOOSE= criterion is specified in the SELECTION statement, the final model is the selected model; this is indicated in the “Selection Reason” table. The “Selected Effects” table lists the effects in the selected model.
Figure 8.4: Stopping and Selection Reasons
Stepwise selection stopped because adding or removing an effect does not improve the SBC criterion. |
The model at step 5 is selected. |
Selected Effects: | Intercept nHits nBB yrMajor |
---|
The “Analysis of Variance,” “Fit Statistics,” and “Parameter Estimates” tables shown in Figure 8.5 give details of the selected model.
Figure 8.5: Details of the Selected Model
Analysis of Variance | |||||
---|---|---|---|---|---|
Source | DF | Sum of Squares |
Mean Square |
F Value | Pr > F |
Model | 3 | 120.52553 | 40.17518 | 120.12 | <.0001 |
Error | 259 | 86.62820 | 0.33447 | ||
Corrected Total | 262 | 207.15373 |
Root MSE | 0.57834 |
---|---|
R-Square | 0.58182 |
Adj R-Sq | 0.57697 |
AIC | -19.06903 |
AICC | -18.83557 |
SBC | -269.78041 |
ASE | 0.32938 |
Parameter Estimates | |||||
---|---|---|---|---|---|
Parameter | DF | Estimate | Standard Error | t Value | Pr > |t| |
Intercept | 1 | 4.013911 | 0.111290 | 36.07 | <.0001 |
nHits | 1 | 0.007929 | 0.000994 | 7.98 | <.0001 |
nBB | 1 | 0.007280 | 0.002049 | 3.55 | 0.0005 |
yrMajor | 1 | 0.100663 | 0.007551 | 13.33 | <.0001 |
You might want to examine regression diagnostics for the selected model to investigate whether collinearity among the selected parameters or the presence of outlying or high leverage observations might be impacting the fit produced. The following statements include some options and statements to obtain these diagnostics:
proc hpreg data=baseball; id name; class league division; model logSalary = nAtBat nHits nHome nRuns nRBI nBB yrMajor crAtBat crHits crHome crRuns crRbi crBB league division nOuts nAssts nError / vif clb; selection method=stepwise; output out=baseballOut p=predictedLogSalary r h cookd rstudent; run;
The VIF and CLB options in the MODEL statement request variance inflation factors and 95% confidence limits for the parameter estimates. Figure 8.6 shows the “Parameter Estimates” with these requested statistics. The variance inflation factors (VIF) measure the inflation in the variances of the parameter estimates due to collinearities that exist among the regressor (independent) variables. Although there are no formal criteria for deciding whether a VIF is large enough to affect the predicted values, the VIF values for the selected effects in this example are small enough to indicate that there are no collinearity issues among the selected regressors.
Figure 8.6: Parameter Estimates with Additional Statistics
Parameter Estimates | ||||||||
---|---|---|---|---|---|---|---|---|
Parameter | DF | Estimate | Standard Error | t Value | Pr > |t| | Variance Inflation |
95% Confidence Limits | |
Intercept | 1 | 4.013911 | 0.111290 | 36.07 | <.0001 | 0 | 3.79476 | 4.23306 |
nHits | 1 | 0.007929 | 0.000994 | 7.98 | <.0001 | 1.49642 | 0.00597 | 0.00989 |
nBB | 1 | 0.007280 | 0.002049 | 3.55 | 0.0005 | 1.52109 | 0.00325 | 0.01131 |
yrMajor | 1 | 0.100663 | 0.007551 | 13.33 | <.0001 | 1.02488 | 0.08579 | 0.11553 |
By default, High-Performance Analytics procedures do not include all variables from the input data set in output data sets.
The ID statement specifies that the variable name
in the input data set be added as an identification variable in the baseballOut
data set that is produced by the OUTPUT statement. In addition to this variable, the OUTPUT statement requests that predicted values, raw residuals, leverage values, Cook’s D statistics, and studentized residuals
be added in the output data set. Note that default names are used for these statistics except for the predicted values for
which a specified name, predictedLogSalary
, is supplied. The following statements use PROC PRINT to display the first five observations of this output data set:
proc print data=baseballOut(obs=5); run;
Figure 8.7: First 5 Observations of the baseballOut
Data Set
Obs | name | predictedLogSalary | Residual | H | COOKD | RSTUDENT |
---|---|---|---|---|---|---|
1 | Allanson, Andy | 4.73980 | . | 0.016087 | . | . |
2 | Ashby, Alan | 6.34935 | -0.18603 | 0.012645 | .000335535 | -0.32316 |
3 | Davis, Alan | 5.89993 | 0.27385 | 0.019909 | .001161794 | 0.47759 |
4 | Dawson, Andre | 6.50852 | -0.29392 | 0.011060 | .000730178 | -0.51031 |
5 | Galarraga, Andres | 5.12344 | -0.60711 | 0.009684 | .002720358 | -1.05510 |