This example is based on the example “Using Validation and Cross Validation” in the documentation for the GLMSELECT procedure in the SAS/STAT User's Guide. This example shows how you can use validation data to monitor and control variable selection. It also demonstrates the use of split classification variables.
The following DATA step produces analysis data that contains a variable that you can use to assign observations to the training, validation, and testing roles. In this case, each role has 5,000 observations.
data analysisData; drop i j c3Num; length c3$ 7; array x{20} x1-x20; do i=1 to 15000; do j=1 to 20; x{j} = ranuni(1); end; c1 = 1 + mod(i,8); c2 = ranbin(1,3,.6); if i < 50 then do; c3 = 'tiny'; c3Num=1;end; else if i < 250 then do; c3 = 'small'; c3Num=1;end; else if i < 600 then do; c3 = 'average'; c3Num=2;end; else if i < 1200 then do; c3 = 'big'; c3Num=3;end; else do; c3 = 'huge'; c3Num=5;end; yTrue = 10 + x1 + 2*x5 + 3*x10 + 4*x20 + 3*x1*x7 + 8*x6*x7 + 5*(c1=3)*c3Num + 8*(c1=7); error = 5*rannor(1); y = yTrue + error; if mod(i,3)=1 then Role = 'TRAIN'; else if mod(i,3)=2 then Role = 'VAL'; else Role = 'TEST'; output; end; run;
By construction, the true model consists of main effects x1
, x5
, x10
, x20
, and c1
and interaction effects x1*x7
, x6*x7
, and c1*c3
. Furthermore, you can see that only levels 3 and 7 of the classification variable c1
are systematically related to the response.
Because the error term for each observation is five times a value drawn from a standard normal distribution, the expected error variance is 25. For the data in each role, you can compute an estimate of this error variance by forming the average square error (ASE) for the observations in the role. Output 8.1.1 shows the ASE for each role that you can compute with the following statements:
proc summary data=analysisData; class role; ways 1; var error; output out=ASE uss=uss n=n; data ASE; set ASE; OracleASE = uss / n; label OracleASE = 'Oracle ASE'; keep Role OracleASE; proc print data=ASE label noobs; run;
proc print data=ASE label noobs; run;
The ASE values shown Output 8.1.1 are labeled as “Oracle ASE” because you need to know the true underlying model if you want to compute these values from the response and underlying regressors. In a modeling context, a good predictive model produces values that are close to these oracle values. An overfit model produces a smaller ASE on the training data but higher values on the validation and test data. An underfit model exhibits higher values for all data roles.
Suppose you suspect that the dependent variable depends on both main effects and two-way interactions. You can use the following statements to select a model:
proc hpreg data=analysisData; partition roleVar=role(train='TRAIN' validate='VAL' test='TEST'); class c1 c2 c3(order=data); model y = c1|c2|c3|x1|x2|x3|x4|x5|x5|x6|x7|x8|x9|x10 |x11|x12|x13|x14|x15|x16|x17|x18|x19|x20 @2 /stb; selection method = stepwise(select=sl sle=0.1 sls=0.15 choose=validate) hierarchy=single details=steps; run;
A PARTITION statement assigns observations to training, validation, and testing roles based on the values of the input variable named
role
. The SELECTION statement requests STEPWISE selection based on significance level where the SLE and SLS values are set to use the defaults
of PROC REG. The CHOOSE=VALIDATE option selects the model that yields the smallest ASE value on the validation data.
The “Number Of Observation” table in Output 8.1.2 confirms that there are 5,000 observations for each data role. The “Dimensions” table shows that the selection is from 278 effects with a total of 661 parameters.
Output 8.1.2: Number of Observations, Class Levels, and Dimensions
Number of Observations Read | 15000 |
---|---|
Number of Observations Used | 15000 |
Number of Observations Used for Training | 5000 |
Number of Observations Used for Validation | 5000 |
Number of Observations Used for Testing | 5000 |
Class Level Information | ||
---|---|---|
Class | Levels | Values |
c1 | 8 | 1 2 3 4 5 6 7 8 |
c2 | 4 | 0 1 2 3 |
c3 | 5 | tiny small average big huge |
Dimensions | |
---|---|
Number of Effects | 278 |
Number of Parameters | 661 |
Output 8.1.3 shows the “Selection Summary” table. You see that 18 steps are done, at which point all effects in the model are significant at the SLS value of 0.15 and all the remaining effects if added individually would not be significant at the SLE significance level of 0.1. However, because you have specified the CHOOSE=VALIDATE option, the model at step 18 is not used as the selected model. Instead the model at step 10 (where the validation ASE achieves a local minimum value) is selected. The “Stop Reason,” “Selection Reason,” and “Selected Effects” in Output 8.1.4 provide this information.
Output 8.1.3: Selection Summary
Selection Summary | ||||
---|---|---|---|---|
Step | Effect Entered |
Number Effects In |
Validation ASE |
p Value |
0 | Intercept | 1 | 98.3895 | 1.0000 |
1 | c1 | 2 | 34.8572 | <.0001 |
2 | x7 | 3 | 32.5531 | <.0001 |
3 | x6 | 4 | 31.0646 | <.0001 |
4 | x20 | 5 | 29.7078 | <.0001 |
5 | x6*x7 | 6 | 29.2210 | <.0001 |
6 | x10 | 7 | 28.6683 | <.0001 |
7 | x1 | 8 | 28.3250 | <.0001 |
8 | x5 | 9 | 27.9766 | <.0001 |
9 | c3 | 10 | 27.8288 | <.0001 |
10 | c1*c3 | 11 | 25.9701* | <.0001 |
11 | x10*c1 | 12 | 26.0696 | 0.0109 |
12 | x4 | 13 | 26.1594 | 0.0128 |
13 | x4*x10 | 14 | 26.1814 | 0.0035 |
14 | x20*c1 | 15 | 26.3294 | 0.0156 |
15 | x1*c3 | 16 | 26.3945 | 0.0244 |
16 | x1*x7 | 17 | 26.3632 | 0.0270 |
17 | x7*x10 | 18 | 26.4120 | 0.0313 |
18 | x1*x20 | 19 | 26.4330 | 0.0871 |
* Optimal Value of Criterion |
Output 8.1.4: Stopping and Selection Reasons
Selection stopped because all candidates for removal are significant at the 0.15 level and no candidate for entry is significant at the 0.1 level. |
The model at step 10 is selected where Validation ASE is 25.9701. |
Selected Effects: | Intercept c1 c3 c1*c3 x1 x5 x6 x7 x6*x7 x10 x20 |
---|
You can see that the selected effects include all the main effects in the true model and two of the three true interaction terms. Furthermore, the selected model does not include any variables that are not in the true model. Note that these statements are not true of the larger model at the final step of the selection process.
Output 8.1.5 shows the fit statistics of the selected model. You can see that the ASE values on the training, validation, and test data are all similar, which is indicative of a reasonable predictive model. In this case where the true model is known, you can see that all three ASE values are close to oracle values for the true model, as shown in Output 8.1.1.
Output 8.1.5: Fit Statistics for the Selected Model
Root MSE | 5.03976 |
---|---|
R-Square | 0.74483 |
Adj R-Sq | 0.74246 |
AIC | 21222 |
AICC | 21223 |
SBC | 16527 |
ASE (Train) | 25.16041 |
ASE (Validate) | 25.97010 |
ASE (Test) | 25.83436 |
Because you specified the DETAILS=STEPS option in the SELECTION statement, you can see the “Fit Statistics” for the model at each step of the selection process. Output 8.1.6 shows these fit statistics for final model at step 18. You see that for this model, the ASE value on the training data is smaller than the ASE values on the validation and test data. This is indicative an overfit model that might not generalize well to new data. You see the ASE values on the validation and test data are now worse in comparison to the oracle values than the values for the selected model at step 10.
Output 8.1.6: Fit Statistics for the Model at Step 18
Root MSE | 5.01386 |
---|---|
R-Square | 0.74862 |
Adj R-Sq | 0.74510 |
AIC | 21194 |
AICC | 21196 |
SBC | 16648 |
ASE (Train) | 24.78688 |
ASE (Validate) | 26.43304 |
ASE (Test) | 26.07078 |
Output 8.1.7 shows part of the “Parameter Estimates” table for the selected model at step 10 that includes the estimates for the main effect c1
. Because the STB option is specified in the MODEL statement, this table includes standardized estimates.
Output 8.1.7: Part of the Parameter Estimates Table for the Selected Model
Parameter Estimates | ||||||
---|---|---|---|---|---|---|
Parameter | DF | Estimate | Standardized Estimate |
Standard Error | t Value | Pr > |t| |
Intercept | 1 | 9.479114 | 0 | 0.422843 | 22.42 | <.0001 |
c1 1 | 1 | 0.279417 | 0.009306 | 0.297405 | 0.94 | 0.3475 |
c1 2 | 1 | 0.615589 | 0.020502 | 0.297332 | 2.07 | 0.0385 |
c1 3 | 1 | 25.678601 | 0.855233 | 0.297280 | 86.38 | <.0001 |
c1 4 | 1 | 0.420360 | 0.014000 | 0.297283 | 1.41 | 0.1574 |
c1 5 | 1 | 0.473986 | 0.015786 | 0.297265 | 1.59 | 0.1109 |
c1 6 | 1 | 0.394044 | 0.013124 | 0.297299 | 1.33 | 0.1851 |
c1 7 | 1 | 8.469793 | 0.282089 | 0.297345 | 28.48 | <.0001 |
c1 8 | 0 | 0 | 0 | . | . | . |
The magnitudes of the standardized estimates and the t statistics of the parameters of the effect c1
reveal that only levels 3 and 7 of this effect contribute appreciably to the model. This suggests that a more parsimonious
model with similar or better predictive power might be obtained if parameters that correspond to the levels of c1
can enter or leave the model independently. You request this with the SPLIT option in the CLASS statement as shown in the
following statements:
proc hpreg data=analysisData; partition roleVar=role(train='TRAIN' validate='VAL' test='TEST'); class c1(split) c2 c3(order=data); model y = c1|c2|c3|x1|x2|x3|x4|x5|x5|x6|x7|x8|x9|x10 |x11|x12|x13|x14|x15|x16|x17|x18|x19|x20 @2 /stb; selection method = stepwise(select=sl sle=0.1 sls=0.15 choose=validate) hierarchy=single details=steps; run;
Output 8.1.8 shows the “Dimensions” table. You can see that because the columns in the design matrix that correspond to levels of c1
are treated as separate effects, the selection is now from 439 effects, even though the number of parameters is unchanged.
Output 8.1.8: Dimensions with c1
Split
Dimensions | |
---|---|
Number of Effects | 278 |
Number of Effects after Splits | 439 |
Number of Parameters | 661 |
Output 8.1.9 shows the selected effects. You can see that as anticipated the selected model now depends on only levels 3 and 7 of c1
.
Output 8.1.9: Selected Effects with c1
Split
Selected Effects: | Intercept c1_3 c1_7 c3 c1_3*c3 x1 x5 x6 x7 x6*x7 x10 x20 |
---|
Finally, the fit statistics for the selected model are shown Output 8.1.10.
Output 8.1.10: Fit Statistics for the Selected Model with c1
Split
Root MSE | 5.04060 |
---|---|
R-Square | 0.74325 |
Adj R-Sq | 0.74238 |
AIC | 21195 |
AICC | 21195 |
SBC | 16311 |
ASE (Train) | 25.31622 |
ASE (Validate) | 25.98055 |
ASE (Test) | 25.76059 |
If you compare the ASE values for this model in Output 8.1.10 with the oracle values in Output 8.1.1 and the values for the model without splitting c1
in Output 8.1.5, you see that this more parsimonious model produces the best predictive performance on the test data of all the models considered
in this example.