This example shows how you can use the SCREEN=
option to speed up model selection when you have a large number of regressors. In order to demonstrate the efficiency in
screening model selection, this example uses simulated data in which the response y
depends systematically on a relatively small subset of a much larger set of regressors, which is described in Table 48.13.
Table 48.13: Complete Set of Regressors
Regressor Name 
Type 
Number of Levels 
In True Model 


Continuous 
Yes 


Continuous 
No 


Classification 
2–5 
Yes 

Classification 
2–5 
No 
The labels In
and Out
, which are part of the regressor names, make it easy to identify whether the selected model succeeds or fails in capturing
the true underlying model.
The following DATA step generates the data:
%let nObs = 5000; %let nContIn = 5; %let nContOut = 1000; %let nClassIn = 2; %let nClassOut = 1000; %let maxLevs = 5; %let noiseScale= 1; data ex7Data; array xIn{&nContIn}; array xOut{&nContOut}; array cIn{&nClassIn}; array cOut{&nClassOut}; drop i j sign nLevs xBeta; do i=1 to &nObs; sign = 1; xBeta = 0; do j=1 to dim(xIn); xIn{j} = ranuni(1); xBeta = xBeta + sqrt(j)*sign*xIn{j}; sign = sign; end; do j=1 to dim(xOut); xOut{j} = ranuni(1); end; do j=1 to dim(cIn); nLevs = 2 + mod(j,&maxlevs1); cIn{j} = 1+int(ranuni(1)*nLevs); xBeta = xBeta + j*sign*(cIn{j}nLevs/2); sign = sign; end; do j=1 to dim(cOut); nLevs = 2 + mod(j,&maxlevs1); cOut{j} = 1+int(ranuni(1)*nLevs); end; y = xBeta + &noiseScale*rannor(1); output; end; run;
The following statements apply the LASSO method for model selection but do not specify any screening technique:
proc glmselect data=ex7Data; class c:; model y = x: c:/ selection=lasso; run;
Output 48.7.1 and Output 48.7.2 show the number of observations and the number of effects, respectively. You can see that 5,000 observations are used for training and the number of effects after splits is 4,513.
Output 48.7.3 shows the "LASSO Selection Summary" table. You can see that the selected model contains all the predictive effects and successfully excludes all the noise effects.
Output 48.7.3: LASSO Selection Summary Table
The GLMSELECT Procedure 
Without the SCREEN= option 
Step  Effect Entered 
Effect Removed 
Number Effects In 
SBC 

0  Intercept  1  10413.6418  
1  cIn2_1  2  10335.5423  
2  cIn2_4  3  7237.0701  
3  cIn1_1  4  7186.9642  
4  cIn1_3  5  6753.6076  
5  xIn5  6  6609.0322  
6  cIn2_3  7  6510.2639  
7  xIn4  8  5797.1607  
8  xIn3  9  5487.4432  
9  xIn2  10  2894.0166  
10  xIn1  11  310.9340*  
* Optimal Value of Criterion 
You can improve the computational performance of the LASSO method by using the SCREEN=SASVI option as follows:
proc glmselect data=ex7Data; class c:; model y = x: c:/ selection=lasso(screen=sasvi); run;
Output 48.7.4 shows that the safe screening approach SASVI is used.
Output 48.7.5 shows the "LASSO Selection Summary" table, which is exactly the same as Output 48.7.3. If you compare the "Parameter Estimates" tables that correspond to Output 48.7.3 and Output 48.7.5, you can see that the parameter estimates are exactly the same, because SASVI is a safe screening approach that guarantees the same solution, usually in less time. PROC GLMSELECT runs about twice as fast when SCREEN=SASVI as it runs when no screening is specified.
Output 48.7.5: LASSO Selection Summary Table
The GLMSELECT Procedure 
SCREEN=SASVI 
Step  Effect Entered 
Effect Removed 
Number Effects In 
SBC 

0  Intercept  1  10413.6418  
1  cIn2_1  2  10335.5423  
2  cIn2_4  3  7237.0701  
3  cIn1_1  4  7186.9642  
4  cIn1_3  5  6753.6076  
5  xIn5  6  6609.0322  
6  cIn2_3  7  6510.2639  
7  xIn4  8  5797.1607  
8  xIn3  9  5487.4432  
9  xIn2  10  2894.0166  
10  xIn1  11  310.9340*  
* Optimal Value of Criterion 
As an alternative to screening by SASVI, you can specify SCREEN=SIS for sure independence screening as follows:
proc glmselect data=ex7Data; class c:; model y = x: c:/ selection=lasso(screen=sis(keepnum=15)); run;
Output 48.7.6 shows that sure independence screening is used and that the number of effects to be kept before applying LASSO for model selection is 15. The last two rows of Output 48.7.6 show the actual number of kept effects and the corresponding ratio with the total number of effects.
Output 48.7.7 shows the absolute correlation between the effects and the response that corresponds to the leading 20 effects. Because you specified KEEPNUM=15, only the leading 15 effects are kept for model selection by LASSO. In this example, you can see that all the predictive effects are among the top 15. Note that if the screening stage does not keep an effect, that effect has no chance of being selected in the final model.
Output 48.7.7: LASSO Selection Summary Table
The GLMSELECT Procedure 
SCREEN=SIS 
Effect Screening by Correlation  

Rank  Effect  Maximum Absolute Correlation 
1  cIn2_1  0.621286 
2  cIn2_4  0.611998 
3  cIn1_1  0.261449 
4  cIn1_3  0.259374 
5  cIn2_3  0.215103 
6  xIn5  0.212433 
7  xIn4  0.201704 
8  cIn2_2  0.195087 
9  xIn3  0.174133 
10  xIn2  0.170705 
11  xIn1  0.083742 
12  cOut447_4  0.061958 
13  cOut238_1  0.052981 
14  cOut538_4  0.052165 
15  cOut747_5  0.051390 
16  xOut345  0.050114* 
17  cOut511_5  0.047524* 
18  cOut630_2  0.047440* 
19  cOut672_2  0.045937* 
20  cOut672_1  0.045937* 
* Screened Out Effect 
Output 48.7.8 shows the "LASSO Selection Summary" table. As in Output 48.7.3, all the predictive effects enter into the model. However, one noise effect also enters into the model, because SCREEN=SIS does not necessarily guarantee the same solution as the solution that results when SCREEN=NONE. For this example, PROC GLMSELECT runs only slightly faster when SCREEN=SIS than it does when SCREEN=SASVI, although it runs about twice as fast as it does when SCREEN=NONE. However, for problems that have more predictors or that use much more computationally intense CHOOSE= criterion, sure independence screening (SIS) can run faster by orders of magnitude.
Output 48.7.8: LASSO Selection Summary Table
The GLMSELECT Procedure 
SCREEN=SIS 
Step  Effect Entered 
Effect Removed 
Number Effects In 
SBC 

0  Intercept  1  10413.6418  
1  cIn2_1  2  10335.5423  
2  cIn2_4  3  7237.0701  
3  cIn1_1  4  7186.9642  
4  cIn1_3  5  6753.6076  
5  xIn5  6  6609.0322  
6  cIn2_3  7  6510.2639  
7  xIn4  8  5797.1607  
8  xIn3  9  5487.4432  
9  xIn2  10  2894.0166  
10  xIn1  11  245.9693  
11  cOut747_5  12  238.4427*  
* Optimal Value of Criterion 