This example shows how you can use the SCREEN=
option to speed up model selection when you have a large number of regressors. In order to demonstrate the efficiency in
screening model selection, this example uses simulated data in which the response y
depends systematically on a relatively small subset of a much larger set of regressors, which is described in Table 49.13.
Table 49.13: Complete Set of Regressors
Regressor Name |
Type |
Number of Levels |
In True Model |
---|---|---|---|
|
Continuous |
Yes |
|
|
Continuous |
No |
|
|
Classification |
2–5 |
Yes |
|
Classification |
2–5 |
No |
The labels In
and Out
, which are part of the regressor names, make it easy to identify whether the selected model succeeds or fails in capturing
the true underlying model.
The following DATA step generates the data:
%let nObs = 5000; %let nContIn = 5; %let nContOut = 1000; %let nClassIn = 2; %let nClassOut = 1000; %let maxLevs = 5; %let noiseScale= 1; data ex7Data; array xIn{&nContIn}; array xOut{&nContOut}; array cIn{&nClassIn}; array cOut{&nClassOut}; drop i j sign nLevs xBeta; do i=1 to &nObs; sign = -1; xBeta = 0; do j=1 to dim(xIn); xIn{j} = ranuni(1); xBeta = xBeta + sqrt(j)*sign*xIn{j}; sign = -sign; end; do j=1 to dim(xOut); xOut{j} = ranuni(1); end; do j=1 to dim(cIn); nLevs = 2 + mod(j,&maxlevs-1); cIn{j} = 1+int(ranuni(1)*nLevs); xBeta = xBeta + j*sign*(cIn{j}-nLevs/2); sign = -sign; end; do j=1 to dim(cOut); nLevs = 2 + mod(j,&maxlevs-1); cOut{j} = 1+int(ranuni(1)*nLevs); end; y = xBeta + &noiseScale*rannor(1); output; end; run;
The following statements apply the LASSO method for model selection but do not specify any screening technique:
proc glmselect data=ex7Data; class c:; model y = x: c:/ selection=lasso; run;
Output 49.7.1 and Output 49.7.2 show the number of observations and the number of effects, respectively. You can see that 5,000 observations are used for training and the number of effects after splits is 4,513.
Output 49.7.1: Number of Training Observations Table
Output 49.7.2: Dimensions
Output 49.7.3 shows the "LASSO Selection Summary" table. You can see that the selected model contains all the predictive effects and successfully excludes all the noise effects.
Output 49.7.3: LASSO Selection Summary Table
The GLMSELECT Procedure |
Without the SCREEN= option |
Step | Effect Entered |
Effect Removed |
Number Effects In |
SBC |
---|---|---|---|---|
0 | Intercept | 1 | 10413.6418 | |
1 | cIn2_1 | 2 | 10335.5423 | |
2 | cIn2_4 | 3 | 7237.0701 | |
3 | cIn1_1 | 4 | 7186.9642 | |
4 | cIn1_3 | 5 | 6753.6076 | |
5 | xIn5 | 6 | 6609.0322 | |
6 | cIn2_3 | 7 | 6510.2639 | |
7 | xIn4 | 8 | 5797.1607 | |
8 | xIn3 | 9 | 5487.4432 | |
9 | xIn2 | 10 | 2894.0166 | |
10 | xIn1 | 11 | 310.9340* | |
* Optimal Value of Criterion |
You can improve the computational performance of the LASSO method by using the SCREEN=SASVI option as follows:
proc glmselect data=ex7Data; class c:; model y = x: c:/ selection=lasso(screen=sasvi); run;
Output 49.7.4 shows that the safe screening approach SASVI is used.
Output 49.7.4: Screening Information
Output 49.7.5 shows the "LASSO Selection Summary" table, which is exactly the same as Output 49.7.3. If you compare the "Parameter Estimates" tables that correspond to Output 49.7.3 and Output 49.7.5, you can see that the parameter estimates are exactly the same, because SASVI is a safe screening approach that guarantees the same solution, usually in less time. PROC GLMSELECT runs about twice as fast when SCREEN=SASVI as it runs when no screening is specified.
Output 49.7.5: LASSO Selection Summary Table
The GLMSELECT Procedure |
SCREEN=SASVI |
Step | Effect Entered |
Effect Removed |
Number Effects In |
SBC |
---|---|---|---|---|
0 | Intercept | 1 | 10413.6418 | |
1 | cIn2_1 | 2 | 10335.5423 | |
2 | cIn2_4 | 3 | 7237.0701 | |
3 | cIn1_1 | 4 | 7186.9642 | |
4 | cIn1_3 | 5 | 6753.6076 | |
5 | xIn5 | 6 | 6609.0322 | |
6 | cIn2_3 | 7 | 6510.2639 | |
7 | xIn4 | 8 | 5797.1607 | |
8 | xIn3 | 9 | 5487.4432 | |
9 | xIn2 | 10 | 2894.0166 | |
10 | xIn1 | 11 | 310.9340* | |
* Optimal Value of Criterion |
As an alternative to screening by SASVI, you can specify SCREEN=SIS for sure independence screening as follows:
proc glmselect data=ex7Data; class c:; model y = x: c:/ selection=lasso(screen=sis(keepnum=15)); run;
Output 49.7.6 shows that sure independence screening is used and that the number of effects to be kept before applying LASSO for model selection is 15. The last two rows of Output 49.7.6 show the actual number of kept effects and the corresponding ratio with the total number of effects.
Output 49.7.6: Screening Information
Output 49.7.7 shows the absolute correlation between the effects and the response that corresponds to the leading 20 effects. Because you specified KEEPNUM=15, only the leading 15 effects are kept for model selection by LASSO. In this example, you can see that all the predictive effects are among the top 15. Note that if the screening stage does not keep an effect, that effect has no chance of being selected in the final model.
Output 49.7.7: LASSO Selection Summary Table
The GLMSELECT Procedure |
SCREEN=SIS |
Effect Screening by Correlation | ||
---|---|---|
Rank | Effect | Maximum Absolute Correlation |
1 | cIn2_1 | 0.621286 |
2 | cIn2_4 | 0.611998 |
3 | cIn1_1 | 0.261449 |
4 | cIn1_3 | 0.259374 |
5 | cIn2_3 | 0.215103 |
6 | xIn5 | 0.212433 |
7 | xIn4 | 0.201704 |
8 | cIn2_2 | 0.195087 |
9 | xIn3 | 0.174133 |
10 | xIn2 | 0.170705 |
11 | xIn1 | 0.083742 |
12 | cOut447_4 | 0.061958 |
13 | cOut238_1 | 0.052981 |
14 | cOut538_4 | 0.052165 |
15 | cOut747_5 | 0.051390 |
16 | xOut345 | 0.050114* |
17 | cOut511_5 | 0.047524* |
18 | cOut630_2 | 0.047440* |
19 | cOut672_2 | 0.045937* |
20 | cOut672_1 | 0.045937* |
* Screened Out Effect |
Output 49.7.8 shows the "LASSO Selection Summary" table. As in Output 49.7.3, all the predictive effects enter into the model. However, one noise effect also enters into the model, because SCREEN=SIS does not necessarily guarantee the same solution as the solution that results when SCREEN=NONE. For this example, PROC GLMSELECT runs only slightly faster when SCREEN=SIS than it does when SCREEN=SASVI, although it runs about twice as fast as it does when SCREEN=NONE. However, for problems that have more predictors or that use much more computationally intense CHOOSE= criterion, sure independence screening (SIS) can run faster by orders of magnitude.
Output 49.7.8: LASSO Selection Summary Table
The GLMSELECT Procedure |
SCREEN=SIS |
Step | Effect Entered |
Effect Removed |
Number Effects In |
SBC |
---|---|---|---|---|
0 | Intercept | 1 | 10413.6418 | |
1 | cIn2_1 | 2 | 10335.5423 | |
2 | cIn2_4 | 3 | 7237.0701 | |
3 | cIn1_1 | 4 | 7186.9642 | |
4 | cIn1_3 | 5 | 6753.6076 | |
5 | xIn5 | 6 | 6609.0322 | |
6 | cIn2_3 | 7 | 6510.2639 | |
7 | xIn4 | 8 | 5797.1607 | |
8 | xIn3 | 9 | 5487.4432 | |
9 | xIn2 | 10 | 2894.0166 | |
10 | xIn1 | 11 | 245.9693 | |
11 | cOut747_5 | 12 | 238.4427* | |
* Optimal Value of Criterion |