This example shows how you can use the SCREEN
option in the SELECTION
statement to greatly speed up model selection from a large number of regressors. In order to demonstrate the efficacy of
model selection with screening, this example uses simulated data in which the response y
depends systematically on a relatively small subset of a much larger set of regressors, which is described in Table 14.8.
Table 14.8: Complete Set of Regressors
Regressor Name |
Type |
Number of Levels |
In True Model |
---|---|---|---|
|
Continuous |
Yes |
|
|
Continuous |
Yes |
|
|
Continuous |
No |
|
|
Classification |
From two to five |
Yes |
|
Classification |
From two to five |
No |
The labels In
and Out
, which are part of the variable names, make it easy to identify whether the selected model succeeds or fails in capturing
the true underlying model. The regressors that are labeled xWeakIn1
and xWeakIn2
are predictive, but their influence is substantially smaller than the influence of the other regressors in the true model.
The following DATA step generates the data:
%let nObs = 50000; %let nContIn = 25; %let nContOut = 500; %let nClassIn = 5; %let nClassOut = 500; %let maxLevs = 5; %let noiseScale= 1; data ex4Data; array xIn{&nContIn}; array xOut{&nContOut}; array cIn{&nClassIn}; array cOut{&nClassOut}; drop i j sign nLevs xBeta; do i=1 to &nObs; sign = -1; xBeta = 0; do j=1 to dim(xIn); xIn{j} = ranuni(1); xBeta = xBeta + j*sign*xIn{j}; sign = -sign; end; do j=1 to dim(xOut); xOut{j} = ranuni(1); end; xWeakIn1 = ranuni(1); xWeakin2 = ranuni(1); xBeta = xBeta + 0.1*xWeakIn1+ 0.1*xWeakIn2; do j=1 to dim(cIn); nLevs = 2 + mod(j,&maxlevs-1); cIn{j} = 1+int(ranuni(1)*nLevs); xBeta = xBeta + j*sign*(cIn{j}-nLevs/2); sign = -sign; end; do j=1 to dim(cOut); nLevs = 2 + mod(j,&maxlevs-1); cOut{j} = 1+int(ranuni(1)*nLevs); end; y = xBeta + &noiseScale*rannor(1); output; end; run;
When you have insufficient prior knowledge of what effects need to be included in a parsimonious predictive model, a reasonable starting point is to use model selection to build a such a model. In such cases, you might want to consider a large number of possible model effects, even though you know that a successful model that generalizes well for predicting unseen data depends on a relatively small number of effects. In such cases, you can dramatically reduce the computational task by including screening in the model selection process. The following statements show how you do this:
proc hpreg data=ex4Data; class c: ; model y = x: c: ; selection method=forward screen(details=all)=100 20; performance details; run;
The ordered pair of integers that is specified in the SCREEN option in the SELECTION statement requests that screening be used to reduce the set of regressors to 100 regressors at the first screening stage and to 20 regressors at the second screening stage. This information is reflected in the "Screening Information" table shown in Output 14.4.1.
The "Number Of Observations" table in Output 14.4.2 confirms that the data contain 50,000 observations and the "Dimensions" table shows that the selection is from 1,033 effects that have a total of 2,295 parameters.
Because you specified the DETAILS=ALL suboption of the SCREEN option, you obtain the "Screening" table in Output 14.4.3, which shows how the screened subset of 100 effects is obtained at the first screening stage. For display purposes, some ranks in this table have been suppressed.
The "Screened Effects" table shown in Output 14.4.4 lists the effects from which a model is selected at the first screening stage.
Output 14.4.4: First Stage Screened Effects
Screened Effects: | xIn25 xIn24 xIn23 xIn22 xIn21 xIn20 xIn19 xIn18 xIn17 xIn16 xIn15 xIn14 xIn13 cIn5 xIn12 cIn3 xIn11 xIn10 xIn8 xIn9 xIn7 cIn4 cIn2 xIn6 xIn5 xIn4 xIn3 cIn1 xIn2 cOut498 cOut110 cOut450 cOut441 cOut272 xOut82 cOut45 cOut6 cOut281 cOut134 cOut15 xOut310 xOut252 xOut485 xOut365 cOut138 cOut123 cOut337 cOut195 cOut423 cOut283 cOut62 cOut114 xOut489 cOut14 cOut158 cOut437 xOut64 cOut301 cOut311 cOut187 cOut431 cOut464 cOut388 cOut213 cOut46 xOut329 cOut403 cOut305 cOut171 cOut85 cOut99 cOut249 xOut267 cOut455 cOut457 cOut271 cOut78 xOut93 cOut259 cOut417 cOut258 cOut326 cOut291 cOut263 cOut107 cOut402 cOut17 cOut237 cOut129 cOut198 cOut58 cOut428 cOut135 cOut206 cOut139 cOut113 cOut486 xOut338 cOut363 cOut194 |
---|
You see that the magnitude of the pairwise correlations of effects xIn1
, xWeakIn1
, and xWeakIn2
with response are too small for those effects to be included as candidates for selection at the first screening stage.
The first stage continues with forward selection from the screened effects that are shown in Output 14.4.4. The effects in the selected model at this stage are shown in Output 14.4.5.
You see that the selected model at this stage includes only effects that are systematically related to the response. If you had requested that only a single-stage screening method be used by specifying the SINGLESTAGE suboption of the SCREEN option, then the selected model at this stage would have been the final selected model. However, multistage screening is used in this example. The second stage repeats the steps of the first stage except that the modeled response is the residuals from the selected model at the first stage.
Output 14.4.6 shows the screening details at the second stage. You see that 20 effects are chosen by screening at this stage as specified.
Because the selected effects from the first stage are orthogonal to the residuals at the first stage, none of these effects
are in the screened subset. Furthermore, you see that although the effects xIn1
, xWeakIn1
, and xWeakIn2
are weakly correlated with y
, they are the most strongly correlated effects with the residuals from the first stage.
Output 14.4.6: Second Stage Screening Details
Effect Screening for Stage 1 Residuals | ||
---|---|---|
Rank | Effect | Maximum Absolute Correlation |
1 | xIn1 | 0.27373 |
2 | xWeakIn1 | 0.02352 |
3 | xWeakin2 | 0.02132 |
4 | cOut295 | 0.01524 |
5 | cOut35 | 0.01443 |
6 | cOut323 | 0.01417 |
7 | cOut202 | 0.01406 |
8 | xOut6 | 0.01401 |
9 | cOut154 | 0.01263 |
10 | cOut54 | 0.01160 |
11 | cOut181 | 0.01159 |
12 | cOut115 | 0.01150 |
13 | cOut403 | 0.01144 |
14 | xOut332 | 0.01142 |
15 | xOut409 | 0.01141 |
16 | cOut267 | 0.01137 |
17 | cOut374 | 0.01132 |
18 | cOut254 | 0.01128 |
19 | xOut204 | 0.01121 |
20 | cOut147 | 0.01120 |
21 | xOut113 | 0.01116* |
22 | xOut427 | 0.01115* |
23 | cOut259 | 0.01111* |
24 | cOut170 | 0.01106* |
25 | cOut107 | 0.01102* |
* Screened Out |
Output 14.4.7 shows the selected effects at the second screening stage. You see that the selected effects are precisely the remaining effects
that are systematically predictive of y
but that were not in the screened subset at the first screening stage.
In the third and final screening stage, model selection is performed from the union of the screened effects from the first stage (which are shown in Output 14.4.4) and the selected effects from the second stage (which are shown in Output 14.4.7). The selected effects from this final stage are shown in Output 14.4.8.
You see that the final selected model contains all the true underlying model effects and just one noise effect (xOut6
). Because you specified the DETAILS option in the PERFORMANCE
statement, the "Timing" table shown in Output 14.4.9 is displayed.
You see that even though the selected model was obtained by selecting from thousands of effects, screening enabled the entire modeling task to be completed in about 10 seconds. You can perform the same model selection without screening as shown in the following statements:
proc hpreg data=ex4Data; class c: ; model y = x: c: ; selection method=forward; performance details; run;
In this case, the model that is selected without screening is identical to model that is obtained with screening. However, there is no guarantee that you will get identical selected models. Output 14.4.10 shows the "Timing" table for the model selection without screening.
You see that the model selection without screening took about 83 seconds, which is substantially slower than the approximately 10 seconds it took when screening was included in the selection process.