Statistical procedures use ODS Graphics to create graphs as part of their output. ODS Graphics is described in detail in Chapter 21: Statistical Graphics Using ODS.
Before you create graphs, ODS Graphics must be enabled (for example, by specifying the ODS GRAPHICS ON statement). For more information about enabling and disabling ODS Graphics, see the section Enabling and Disabling ODS Graphics in Chapter 21: Statistical Graphics Using ODS.
The overall appearance of graphs is controlled by ODS styles. Styles and other aspects of using ODS Graphics are discussed in the section A Primer on ODS Statistical Graphics in Chapter 21: Statistical Graphics Using ODS.
You must also specify the PLOTS= option in the PROC GLMSELECT statement.
The following sections describe the ODS graphical displays produced by PROC GLMSELECT. The examples use the Baseball
data set that is described in the section Getting Started: GLMSELECT Procedure.
PROC GLMSELECT assigns a name to each graph it creates using ODS. You can use these names to reference the graphs when using ODS. The names are listed in Table 45.12.
Table 45.12: Graphs Produced by PROC GLMSELECT
ODS Graph Name |
Plot Description |
PLOTS Option |
---|---|---|
AdjRSqPlot |
Adjusted R-square by step |
CRITERIA(UNPACK) |
AICCPlot |
Corrected Akaike’s information criterion by step |
CRITERIA(UNPACK) |
AICPlot |
Akaike’s information criterion by step |
CRITERIA(UNPACK) |
ASEPlot |
Average square errors by step |
ASE |
BICPlot |
Sawa’s Bayesian information criterion by step |
CRITERIA(UNPACK) |
CandidatesPlot |
SELECT= criterion by effect |
CANDIDATES |
ChooseCriterionPlot |
CHOOSE= criterion by step |
COEFFICIENTS(UNPACK) |
CoefficientPanel |
Coefficients and CHOOSE= criterion by step |
COEFFICIENTS |
CoefficientPlot |
Coefficients by step |
COEFFICIENTS(UNPACK) |
CPPlot |
Mallows’ by step |
CRITERIA(UNPACK) |
CriterionPanel |
Fit criteria by step |
CRITERIA |
CVPRESSPlot |
Cross validation predicted RSS by step |
CRITERIA(UNPACK) |
EffectSelectPctPlot |
Resampling effect selection percentages |
EFFECTSELECTPCT |
ParmDistPanel |
Resampling parameter estimate distributions |
PARMDIST |
PRESSPlot |
Predicted RSS by step |
CRITERIA(UNPACK) |
SBCPlot |
Schwarz Bayesian information criterion by step |
CRITERIA(UNPACK) |
ValidateASEPlot |
Average square error on validation data by step |
CRITERIA(UNPACK) |
You request the “Candidates Plot” by specifying the PLOTS=CANDIDATES option in the PROC GLMSELECT statement and the DETAILS=STEPS option in the MODEL statement. This plot shows the values of selection criterion for the candidate effects for entry or removal, sorted from best to worst from left to right across the plot. The leftmost candidate displayed is the effect selected for entry or removal at that step. You can use this plot to see at what steps the decision about which effect to add or drop is clear-cut. See Figure 45.5 for an example.
When you specify the PLOTS=COEFFICIENTS option in the PROC GLMSELECT statement, PROC GLMSELECT produces a panel of two plots showing how the standardized coefficients and the criterion used to choose the final model evolve as the selection progresses. The following statements provide an example:
ods graphics on; proc glmselect data=baseball plots=coefficients; class league division; model logSalary = nAtBat nHits nHome nRuns nRBI nBB yrMajor|yrMajor crAtBat|crAtBat crHits|crHits crHome|crHome crRuns|crRuns crRbi|crRbi crBB|crBB league division nOuts nAssts nError / selection=forward(stop=AICC CHOOSE=SBC); run;
Figure 45.16 shows the requested graphic. The upper plot in the panel displays the standardized coefficients as a function of the step
number. You can request standardized coefficients in the parameter estimates tables by specifying the STB option in the MODEL statement, but this option is not required to produce this plot. To help in tracing the changes in a parameter, the standardized
coefficients for each parameter are connected by lines. Coefficients corresponding to effects that are not in the selected
model at a step are zero and hence not observable. For example, consider the parameter crAtBat*crAtBat
in Figure 45.16. Because crAtBat*crAtBat
enters the model at step 2, the line that represents this parameter starts rising from zero at step 1 when crRuns
enters the model. Parameters that are nonzero at the final step of the selection are labeled if their magnitudes are greater
than 1% of the range of the magnitudes of all the nonzero parameters at this step. To avoid collision, labels corresponding
to parameters with similar values at the final step might get suppressed. You can control when this label collision avoidance
occurs by using the LABELGAP= suboption of the PLOTS=COEFFICIENTS option. Planned enhancements to the automatic label collision
avoidance algorithm will obviate the need for this option in future releases of the GLMSELECT procedure.
The lower plot in the panel shows how the criterion used to choose among the examined models progresses. The selected step occurs at the optimal value of this criterion. In this example, this criterion is the SBC criterion and it achieves its minimal value at step 9 of the forward selection.
In some cases, particularly when the final step contains a large number of parameters, you might be interested in using this plot only to discern if and when the parameters in the model are essential unchanged beyond a certain step. In such cases, you might want to suppress the labeling of the parameters and use a numeric axis on the horizontal axis of the plot. You can do this using the STEPAXIS= and MAXPARMLABEL= suboptions of the PLOTS=CRITERIA option. The following statements provide an example:
proc glmselect data=baseball plots(unpack maxparmlabel=0 stepaxis=number)=coefficients; class league division; model logSalary = nAtBat nHits nHome nRuns nRBI nBB yrMajor|yrMajor crAtBat|crAtBat crHits|crHits crHome|crHome crRuns|crRuns crRbi|crRbi crBB|crBB league division nOuts nAssts nError / selection=forward(stop=none); run;
The UNPACK = option requests that the plots of the coefficients and CHOOSE= criterion be shown in separate plots. The STEPAXIS=NUMBER option requests a numeric horizontal axis showing step number, and the MAXPAMLABEL=0 option suppresses the labels for the parameters. The “Coefficient Plot” is shown in Figure 45.17. You can see that the standardized coefficients do not vary greatly after step 16.
You request the criterion panel by specifying the PLOTS=CRITERIA option in the PROC GLMSELECT statement. This panel displays the progression of the ADJRSQ, AIC, AICC, and SBC criteria, as well as any other criteria that are named in the CHOOSE=, SELECT=, STOP=, or STATS= option in the MODEL statement.
The following statements provide an example:
proc glmselect data=baseball plots=criteria; class league division; model logSalary = nAtBat nHits nHome nRuns nRBI nBB yrMajor|yrMajor crAtBat|crAtBat crHits|crHits crHome|crHome crRuns|crRuns crRbi|crRbi crBB|crBB league division nOuts nAssts nError / selection=forward(steps=15 choose=AICC) stats=PRESS; run;
Figure 45.18 shows the requested criterion panel. Note that the PRESS criterion is included in the panel because it is named in the STATS= option in the MODEL statement. The selected step is displayed as a vertical reference line on the plot of each criterion, and the legend indicates which of these criteria is used to make the selection. If the selection terminates for a reason other than optimizing a criterion displayed on this plot, then the legend will not report a reason for the selected step. The optimal value of each criterion is indicated with the “Star” marker. Note that it is possible that a better value of a criterion might have been reached had more steps of the selection process been done.
You request the average square error plot by specifying the PLOTS=ASE option in the PROC GLMSELECT statement. This plot shows the progression of the average square error (ASE) evaluated separately on the training data, and the test and validation data whenever these data are provided with the TESTDATA= and VALDATA= options or are produced by using a PARTITION statement. You use the plot to detect when overfitting the training data occurs. The ASE decreases monotonically on the training data as parameters are added to a model. However, the average square error on test and validation data typically starts increasing when overfitting occurs. See Output 45.1.9 and Output 45.2.6 for examples.
The coefficient panel, criterion panel, and average square error plot display information for all the steps examined in the selection process. In some cases, you might want to focus attention on just a particular step range. For example, it is hard to discern the variation in the criteria displayed in Figure 45.18 near the selected step because the variation in these criteria in the steps close to the selected step is small relative to the variation across all steps. You can request a range of steps to display using the STARTSTEP= and ENDSTEP= suboptions of the PLOTS= option. You can specify these options as both global and specific plot options, with the specific options taking precedence if both are specified. The following statements provide an example:
proc glmselect data=baseball plots=criteria(startstep=10 endstep=16); class league division; model logSalary = nAtBat nHits nHome nRuns nRBI nBB yrMajor|yrMajor crAtBat|crAtBat crHits|crHits crHome|crHome crRuns|crRuns crRbi|crRbi crBB|crBB league division nOuts nAssts nError / selection=forward(stop=none choose=AICC); run; ods graphics off;
Figure 45.19 shows the progression of the fit criteria between steps 10 and 16. Note that if the optimal value of a criterion does not occur in this specified step range, then no optimal marker appears for that criterion. The plot of the SBC criterion in Figure 45.19 is one such case.