The HPREG Procedure

Example 12.1 Model Selection with Validation

This example is based on the example Using Validation and Cross Validation in the documentation for the GLMSELECT procedure in the SAS/STAT User's Guide. This example shows how you can use validation data to monitor and control variable selection. It also demonstrates the use of split classification variables.

The following DATA step produces analysis data that contains a variable that you can use to assign observations to the training, validation, and testing roles. In this case, each role has 5,000 observations.

  
 data analysisData;
    drop i j c3Num;
    length c3$ 7;

    array x{20} x1-x20;

    do i=1 to 15000;
       do j=1 to 20;
          x{j} = ranuni(1);
       end;

       c1 = 1 + mod(i,8);
       c2 = ranbin(1,3,.6); 
  
       if      i < 50   then do; c3 = 'tiny';     c3Num=1;end;
       else if i < 250  then do; c3 = 'small';    c3Num=1;end;
       else if i < 600  then do; c3 = 'average';  c3Num=2;end;
       else if i < 1200 then do; c3 = 'big';      c3Num=3;end;
       else                  do; c3 = 'huge';     c3Num=5;end;  

       yTrue = 10 + x1 + 2*x5 + 3*x10 + 4*x20  + 3*x1*x7 + 8*x6*x7
                  + 5*(c1=3)*c3Num + 8*(c1=7);

       error = 5*rannor(1);
      
       y = yTrue + error;
     
            if mod(i,3)=1 then Role = 'TRAIN';
       else if mod(i,3)=2 then Role = 'VAL';
       else                    Role = 'TEST';
       
       output;
   end;
 run;  

By construction, the true model consists of main effects x1, x5, x10, x20, and c1 and interaction effects x1*x7, x6*x7, and c1*c3. Furthermore, you can see that only levels 3 and 7 of the classification variable c1 are systematically related to the response.

Because the error term for each observation is five times a value drawn from a standard normal distribution, the expected error variance is 25. For the data in each role, you can compute an estimate of this error variance by forming the average square error (ASE) for the observations in the role. Output 12.1.1 shows the ASE for each role that you can compute with the following statements:


 proc summary data=analysisData;
    class role;
    ways 1;
    var error;
    output out=ASE uss=uss n=n;
 data ASE; set ASE;
    OracleASE = uss / n;
    label OracleASE = 'Oracle ASE';
    keep Role OracleASE;
 proc print data=ASE label noobs;
 run;

 proc print data=ASE label noobs;
 run;

Output 12.1.1: Oracle ASE Values by Role

Role Oracle ASE
TEST 25.5784
TRAIN 25.4008
VAL 25.8993


The ASE values shown Output 12.1.1 are labeled as Oracle ASE because you need to know the true underlying model if you want to compute these values from the response and underlying regressors. In a modeling context, a good predictive model produces values that are close to these oracle values. An overfit model produces a smaller ASE on the training data but higher values on the validation and test data. An underfit model exhibits higher values for all data roles.

Suppose you suspect that the dependent variable depends on both main effects and two-way interactions. You can use the following statements to select a model:

 proc hpreg data=analysisData;
    partition roleVar=role(train='TRAIN' validate='VAL' test='TEST');
    class c1 c2 c3(order=data);
    model y =  c1|c2|c3|x1|x2|x3|x4|x5|x5|x6|x7|x8|x9|x10
              |x11|x12|x13|x14|x15|x16|x17|x18|x19|x20 @2 /stb;
    selection method = stepwise(select=sl sle=0.1 sls=0.15 choose=validate)
                       hierarchy=single details=steps;
 run;

A PARTITION statement assigns observations to training, validation, and testing roles based on the values of the input variable named role. The SELECTION statement requests STEPWISE selection based on significance level where the SLE and SLS values are set to use the defaults of PROC REG. The CHOOSE=VALIDATE option selects the model that yields the smallest ASE value on the validation data.

The Number Of Observation table in Output 12.1.2 confirms that there are 5,000 observations for each data role. The Dimensions table shows that the selection is from 278 effects with a total of 661 parameters.

Output 12.1.2: Number of Observations, Class Levels, and Dimensions

The HPREG Procedure

Number of Observations Read 15000
Number of Observations Used 15000
Number of Observations Used for Training 5000
Number of Observations Used for Validation 5000
Number of Observations Used for Testing 5000

Class Level Information
Class Levels Values
c1 8 1 2 3 4 5 6 7 8
c2 4 0 1 2 3
c3 5 tiny small average big huge

Dimensions
Number of Effects 278
Number of Parameters 661


Output 12.1.3 shows the Selection Summary table. You see that 18 steps are done, at which point all effects in the model are significant at the SLS value of 0.15 and all the remaining effects if added individually would not be significant at the SLE significance level of 0.1. However, because you have specified the CHOOSE=VALIDATE option, the model at step 18 is not used as the selected model. Instead the model at step 10 (where the validation ASE achieves a local minimum value) is selected. The Stop Reason, Selection Reason, and Selected Effects in Output 12.1.4 provide this information.

Output 12.1.3: Selection Summary

The HPREG Procedure

Selection Summary
Step Effect
Entered
Number
Effects In
Validation
ASE
p Value
0 Intercept 1 98.3895 1.0000
1 c1 2 34.8572 <.0001
2 x7 3 32.5531 <.0001
3 x6 4 31.0646 <.0001
4 x20 5 29.7078 <.0001
5 x6*x7 6 29.2210 <.0001
6 x10 7 28.6683 <.0001
7 x1 8 28.3250 <.0001
8 x5 9 27.9766 <.0001
9 c3 10 27.8288 <.0001
10 c1*c3 11 25.9701* <.0001
11 x10*c1 12 26.0696 0.0109
12 x4 13 26.1594 0.0128
13 x4*x10 14 26.1814 0.0035
14 x20*c1 15 26.3294 0.0156
15 x1*c3 16 26.3945 0.0244
16 x1*x7 17 26.3632 0.0270
17 x7*x10 18 26.4120 0.0313
18 x1*x20 19 26.4330 0.0871

* Optimal Value of Criterion



Output 12.1.4: Stopping and Selection Reasons

Selection stopped because all candidates for removal are significant at the 0.15 level and no candidate for entry is significant at the 0.1 level.

The model at step 10 is selected where Validation ASE is 25.9701.

Selected Effects: Intercept c1 c3 c1*c3 x1 x5 x6 x7 x6*x7 x10 x20


You can see that the selected effects include all the main effects in the true model and two of the three true interaction terms. Furthermore, the selected model does not include any variables that are not in the true model. Note that these statements are not true of the larger model at the final step of the selection process.

Output 12.1.5 shows the fit statistics of the selected model. You can see that the ASE values on the training, validation, and test data are all similar, which is indicative of a reasonable predictive model. In this case where the true model is known, you can see that all three ASE values are close to oracle values for the true model, as shown in Output 12.1.1.

Output 12.1.5: Fit Statistics for the Selected Model

Root MSE 5.03976
R-Square 0.74483
Adj R-Sq 0.74246
AIC 21222
AICC 21223
SBC 16527
ASE (Train) 25.16041
ASE (Validate) 25.97010
ASE (Test) 25.83436


Because you specified the DETAILS=STEPS option in the SELECTION statement, you can see the Fit Statistics for the model at each step of the selection process. Output 12.1.6 shows these fit statistics for final model at step 18. You see that for this model, the ASE value on the training data is smaller than the ASE values on the validation and test data. This is indicative an overfit model that might not generalize well to new data. You see the ASE values on the validation and test data are now worse in comparison to the oracle values than the values for the selected model at step 10.

Output 12.1.6: Fit Statistics for the Model at Step 18

Root MSE 5.01386
R-Square 0.74862
Adj R-Sq 0.74510
AIC 21194
AICC 21196
SBC 16648
ASE (Train) 24.78688
ASE (Validate) 26.43304
ASE (Test) 26.07078


Output 12.1.7 shows part of the Parameter Estimates table for the selected model at step 10 that includes the estimates for the main effect c1. Because the STB option is specified in the MODEL statement, this table includes standardized estimates.

Output 12.1.7: Part of the Parameter Estimates Table for the Selected Model

Parameter Estimates
Parameter DF Estimate Standardized
Estimate
Standard Error t Value Pr > |t|
Intercept 1 9.479114 0 0.422843 22.42 <.0001
c1 1 1 0.279417 0.009306 0.297405 0.94 0.3475
c1 2 1 0.615589 0.020502 0.297332 2.07 0.0385
c1 3 1 25.678601 0.855233 0.297280 86.38 <.0001
c1 4 1 0.420360 0.014000 0.297283 1.41 0.1574
c1 5 1 0.473986 0.015786 0.297265 1.59 0.1109
c1 6 1 0.394044 0.013124 0.297299 1.33 0.1851
c1 7 1 8.469793 0.282089 0.297345 28.48 <.0001
c1 8 0 0 0 . . .


The magnitudes of the standardized estimates and the t statistics of the parameters of the effect c1 reveal that only levels 3 and 7 of this effect contribute appreciably to the model. This suggests that a more parsimonious model with similar or better predictive power might be obtained if parameters that correspond to the levels of c1 can enter or leave the model independently. You request this with the SPLIT option in the CLASS statement as shown in the following statements:

 proc hpreg data=analysisData;
    partition roleVar=role(train='TRAIN' validate='VAL' test='TEST');
    class c1(split) c2 c3(order=data);
    model y =  c1|c2|c3|x1|x2|x3|x4|x5|x5|x6|x7|x8|x9|x10
              |x11|x12|x13|x14|x15|x16|x17|x18|x19|x20 @2 /stb;
    selection method = stepwise(select=sl sle=0.1 sls=0.15 choose=validate)
                       hierarchy=single details=steps;
 run;

Output 12.1.8 shows the Dimensions table. You can see that because the columns in the design matrix that correspond to levels of c1 are treated as separate effects, the selection is now from 439 effects, even though the number of parameters is unchanged.

Output 12.1.8: Dimensions with c1 Split

The HPREG Procedure

Dimensions
Number of Effects 278
Number of Effects after Splits 439
Number of Parameters 661


Output 12.1.9 shows the selected effects. You can see that as anticipated the selected model now depends on only levels 3 and 7 of c1.

Output 12.1.9: Selected Effects with c1 Split

Selected Effects: Intercept c1_3 c1_7 c3 c1_3*c3 x1 x5 x6 x7 x6*x7 x10 x20


Finally, the fit statistics for the selected model are shown Output 12.1.10.

Output 12.1.10: Fit Statistics for the Selected Model with c1 Split

Root MSE 5.04060
R-Square 0.74325
Adj R-Sq 0.74238
AIC 21195
AICC 21195
SBC 16311
ASE (Train) 25.31622
ASE (Validate) 25.98055
ASE (Test) 25.76059


If you compare the ASE values for this model in Output 12.1.10 with the oracle values in Output 12.1.1 and the values for the model without splitting c1 in Output 12.1.5, you see that this more parsimonious model produces the best predictive performance on the test data of all the models considered in this example.