The GLMSELECT Procedure

Example 49.7 LASSO with Screening

This example shows how you can use the SCREEN= option to speed up model selection when you have a large number of regressors. In order to demonstrate the efficiency in screening model selection, this example uses simulated data in which the response y depends systematically on a relatively small subset of a much larger set of regressors, which is described in Table 49.13.

Table 49.13: Complete Set of Regressors

Regressor Name

Type

Number of Levels

In True Model

xIn1–xIn5

Continuous

 

Yes

xOut1–xOut1000

Continuous

 

No

cIn1–cIn2

Classification

2–5

Yes

cOut1–cOut1000

Classification

2–5

No


The labels In and Out, which are part of the regressor names, make it easy to identify whether the selected model succeeds or fails in capturing the true underlying model.

The following DATA step generates the data:

%let nObs      = 5000;
%let nContIn   = 5;
%let nContOut  = 1000;
%let nClassIn  = 2;
%let nClassOut = 1000;
%let maxLevs   = 5;
%let noiseScale= 1;  

 data ex7Data;
   array xIn{&nContIn}; 
   array xOut{&nContOut}; 
   array cIn{&nClassIn}; 
   array cOut{&nClassOut};

   drop i j sign nLevs xBeta;
 
   do i=1 to &nObs;  
      sign  = -1;
      xBeta = 0;
      do j=1 to dim(xIn); 
         xIn{j} = ranuni(1);
         xBeta  = xBeta + sqrt(j)*sign*xIn{j};
         sign   = -sign; 
      end; 
      do j=1 to dim(xOut); 
         xOut{j} = ranuni(1); 
      end;

      do j=1 to dim(cIn); 
         nLevs  = 2 + mod(j,&maxlevs-1);
         cIn{j} = 1+int(ranuni(1)*nLevs);
         xBeta  = xBeta + j*sign*(cIn{j}-nLevs/2); 
         sign   = -sign;  
      end; 

      do j=1 to dim(cOut);
         nLevs  = 2 + mod(j,&maxlevs-1); 
         cOut{j} = 1+int(ranuni(1)*nLevs); 
      end;

      y = xBeta + &noiseScale*rannor(1);
       
      output;  
  end; 
run;  

The following statements apply the LASSO method for model selection but do not specify any screening technique:

proc glmselect data=ex7Data;
   class c:;
   model y = x: c:/
         selection=lasso;
run;

Output 49.7.1 and Output 49.7.2 show the number of observations and the number of effects, respectively. You can see that 5,000 observations are used for training and the number of effects after splits is 4,513.

Output 49.7.1: Number of Training Observations Table

The GLMSELECT Procedure

Number of Observations Read 5000
Number of Observations Used 5000



Output 49.7.2: Dimensions

Dimensions
Number of Effects 2008
Number of Effects after Splits 4513
Number of Parameters 4513



Output 49.7.3 shows the "LASSO Selection Summary" table. You can see that the selected model contains all the predictive effects and successfully excludes all the noise effects.

Output 49.7.3: LASSO Selection Summary Table

The GLMSELECT Procedure
 
Without the SCREEN= option
 

Step Effect
Entered
Effect
Removed
Number
Effects In
SBC
0 Intercept   1 10413.6418
1 cIn2_1   2 10335.5423
2 cIn2_4   3 7237.0701
3 cIn1_1   4 7186.9642
4 cIn1_3   5 6753.6076
5 xIn5   6 6609.0322
6 cIn2_3   7 6510.2639
7 xIn4   8 5797.1607
8 xIn3   9 5487.4432
9 xIn2   10 2894.0166
10 xIn1   11 310.9340*
* Optimal Value of Criterion



You can improve the computational performance of the LASSO method by using the SCREEN=SASVI option as follows:

proc glmselect data=ex7Data;
   class c:;
   model y = x: c:/
         selection=lasso(screen=sasvi);
run;

Output 49.7.4 shows that the safe screening approach SASVI is used.

Output 49.7.4: Screening Information

The GLMSELECT Procedure

Screening Information
Screening Method SASVI (safe screening)



Output 49.7.5 shows the "LASSO Selection Summary" table, which is exactly the same as Output 49.7.3. If you compare the "Parameter Estimates" tables that correspond to Output 49.7.3 and Output 49.7.5, you can see that the parameter estimates are exactly the same, because SASVI is a safe screening approach that guarantees the same solution, usually in less time. PROC GLMSELECT runs about twice as fast when SCREEN=SASVI as it runs when no screening is specified.

Output 49.7.5: LASSO Selection Summary Table

The GLMSELECT Procedure
 
SCREEN=SASVI
 

Step Effect
Entered
Effect
Removed
Number
Effects In
SBC
0 Intercept   1 10413.6418
1 cIn2_1   2 10335.5423
2 cIn2_4   3 7237.0701
3 cIn1_1   4 7186.9642
4 cIn1_3   5 6753.6076
5 xIn5   6 6609.0322
6 cIn2_3   7 6510.2639
7 xIn4   8 5797.1607
8 xIn3   9 5487.4432
9 xIn2   10 2894.0166
10 xIn1   11 310.9340*
* Optimal Value of Criterion



As an alternative to screening by SASVI, you can specify SCREEN=SIS for sure independence screening as follows:

proc glmselect data=ex7Data;
   class c:;
   model y = x: c:/
         selection=lasso(screen=sis(keepnum=15));
run;

Output 49.7.6 shows that sure independence screening is used and that the number of effects to be kept before applying LASSO for model selection is 15. The last two rows of Output 49.7.6 show the actual number of kept effects and the corresponding ratio with the total number of effects.

Output 49.7.6: Screening Information

The GLMSELECT Procedure

Screening Information
Screening Method SIS (sure independence screening)
Screening Keep Number 15
Actual Kept Number 15
Actual Kept Ratio 0.003



Output 49.7.7 shows the absolute correlation between the effects and the response that corresponds to the leading 20 effects. Because you specified KEEPNUM=15, only the leading 15 effects are kept for model selection by LASSO. In this example, you can see that all the predictive effects are among the top 15. Note that if the screening stage does not keep an effect, that effect has no chance of being selected in the final model.

Output 49.7.7: LASSO Selection Summary Table

The GLMSELECT Procedure
 
SCREEN=SIS
 

Effect Screening by Correlation
Rank Effect Maximum Absolute Correlation
1 cIn2_1 0.621286
2 cIn2_4 0.611998
3 cIn1_1 0.261449
4 cIn1_3 0.259374
5 cIn2_3 0.215103
6 xIn5 0.212433
7 xIn4 0.201704
8 cIn2_2 0.195087
9 xIn3 0.174133
10 xIn2 0.170705
11 xIn1 0.083742
12 cOut447_4 0.061958
13 cOut238_1 0.052981
14 cOut538_4 0.052165
15 cOut747_5 0.051390
16 xOut345 0.050114*
17 cOut511_5 0.047524*
18 cOut630_2 0.047440*
19 cOut672_2 0.045937*
20 cOut672_1 0.045937*
* Screened Out Effect



Output 49.7.8 shows the "LASSO Selection Summary" table. As in Output 49.7.3, all the predictive effects enter into the model. However, one noise effect also enters into the model, because SCREEN=SIS does not necessarily guarantee the same solution as the solution that results when SCREEN=NONE. For this example, PROC GLMSELECT runs only slightly faster when SCREEN=SIS than it does when SCREEN=SASVI, although it runs about twice as fast as it does when SCREEN=NONE. However, for problems that have more predictors or that use much more computationally intense CHOOSE= criterion, sure independence screening (SIS) can run faster by orders of magnitude.

Output 49.7.8: LASSO Selection Summary Table

The GLMSELECT Procedure
 
SCREEN=SIS
 

Step Effect
Entered
Effect
Removed
Number
Effects In
SBC
0 Intercept   1 10413.6418
1 cIn2_1   2 10335.5423
2 cIn2_4   3 7237.0701
3 cIn1_1   4 7186.9642
4 cIn1_3   5 6753.6076
5 xIn5   6 6609.0322
6 cIn2_3   7 6510.2639
7 xIn4   8 5797.1607
8 xIn3   9 5487.4432
9 xIn2   10 2894.0166
10 xIn1   11 245.9693
11 cOut747_5   12 238.4427*
* Optimal Value of Criterion