The GLMSELECT Procedure

Example 49.7 LASSO with Screening

This example shows how you can use the SCREEN= option to speed up model selection when you have a large number of regressors. In order to demonstrate the efficiency in screening model selection, this example uses simulated data in which the response y depends systematically on a relatively small subset of a much larger set of regressors, which is described in Table 49.13.

Table 49.13: Complete Set of Regressors

Regressor Name	Type	Number of Levels	In True Model
`xIn1–xIn5`	Continuous		Yes
`xOut1–xOut1000`	Continuous		No
`cIn1–cIn2`	Classification	2–5	Yes
`cOut1–cOut1000`	Classification	2–5	No

The labels In and Out, which are part of the regressor names, make it easy to identify whether the selected model succeeds or fails in capturing the true underlying model.

The following DATA step generates the data:

%let nObs      = 5000;
%let nContIn   = 5;
%let nContOut  = 1000;
%let nClassIn  = 2;
%let nClassOut = 1000;
%let maxLevs   = 5;
%let noiseScale= 1;  

 data ex7Data;
   array xIn{&nContIn}; 
   array xOut{&nContOut}; 
   array cIn{&nClassIn}; 
   array cOut{&nClassOut};

   drop i j sign nLevs xBeta;
 
   do i=1 to &nObs;  
      sign  = -1;
      xBeta = 0;
      do j=1 to dim(xIn); 
         xIn{j} = ranuni(1);
         xBeta  = xBeta + sqrt(j)*sign*xIn{j};
         sign   = -sign; 
      end; 
      do j=1 to dim(xOut); 
         xOut{j} = ranuni(1); 
      end;

      do j=1 to dim(cIn); 
         nLevs  = 2 + mod(j,&maxlevs-1);
         cIn{j} = 1+int(ranuni(1)*nLevs);
         xBeta  = xBeta + j*sign*(cIn{j}-nLevs/2); 
         sign   = -sign;  
      end; 

      do j=1 to dim(cOut);
         nLevs  = 2 + mod(j,&maxlevs-1); 
         cOut{j} = 1+int(ranuni(1)*nLevs); 
      end;

      y = xBeta + &noiseScale*rannor(1);
       
      output;  
  end; 
run;

The following statements apply the LASSO method for model selection but do not specify any screening technique:

proc glmselect data=ex7Data;
   class c:;
   model y = x: c:/
         selection=lasso;
run;

Output 49.7.1 and Output 49.7.2 show the number of observations and the number of effects, respectively. You can see that 5,000 observations are used for training and the number of effects after splits is 4,513.

Output 49.7.1: Number of Training Observations Table

The GLMSELECT Procedure

Number of Observations Read	5000
Number of Observations Used	5000

Output 49.7.2: Dimensions

Dimensions
Number of Effects	2008
Number of Effects after Splits	4513
Number of Parameters	4513

Output 49.7.3 shows the "LASSO Selection Summary" table. You can see that the selected model contains all the predictive effects and successfully excludes all the noise effects.

Output 49.7.3: LASSO Selection Summary Table

The GLMSELECT Procedure

Without the SCREEN= option

Step	Effect Entered	Number Effects In	SBC
0	Intercept	1	10413.6418
1	cIn2_1	2	10335.5423
2	cIn2_4	3	7237.0701
3	cIn1_1	4	7186.9642
4	cIn1_3	5	6753.6076
5	xIn5	6	6609.0322
6	cIn2_3	7	6510.2639
7	xIn4	8	5797.1607
8	xIn3	9	5487.4432
9	xIn2	10	2894.0166
10	xIn1	11	310.9340*
* Optimal Value of Criterion

You can improve the computational performance of the LASSO method by using the SCREEN=SASVI option as follows:

proc glmselect data=ex7Data;
   class c:;
   model y = x: c:/
         selection=lasso(screen=sasvi);
run;

Output 49.7.4 shows that the safe screening approach SASVI is used.

Output 49.7.4: Screening Information

The GLMSELECT Procedure

Screening Information
Screening Method	SASVI (safe screening)

Output 49.7.5 shows the "LASSO Selection Summary" table, which is exactly the same as Output 49.7.3. If you compare the "Parameter Estimates" tables that correspond to Output 49.7.3 and Output 49.7.5, you can see that the parameter estimates are exactly the same, because SASVI is a safe screening approach that guarantees the same solution, usually in less time. PROC GLMSELECT runs about twice as fast when SCREEN=SASVI as it runs when no screening is specified.

Output 49.7.5: LASSO Selection Summary Table

The GLMSELECT Procedure

SCREEN=SASVI

Step	Effect Entered	Number Effects In	SBC
0	Intercept	1	10413.6418
1	cIn2_1	2	10335.5423
2	cIn2_4	3	7237.0701
3	cIn1_1	4	7186.9642
4	cIn1_3	5	6753.6076
5	xIn5	6	6609.0322
6	cIn2_3	7	6510.2639
7	xIn4	8	5797.1607
8	xIn3	9	5487.4432
9	xIn2	10	2894.0166
10	xIn1	11	310.9340*
* Optimal Value of Criterion

As an alternative to screening by SASVI, you can specify SCREEN=SIS for sure independence screening as follows:

proc glmselect data=ex7Data;
   class c:;
   model y = x: c:/
         selection=lasso(screen=sis(keepnum=15));
run;

Output 49.7.6 shows that sure independence screening is used and that the number of effects to be kept before applying LASSO for model selection is 15. The last two rows of Output 49.7.6 show the actual number of kept effects and the corresponding ratio with the total number of effects.

Output 49.7.6: Screening Information

The GLMSELECT Procedure

Screening Information
Screening Method	SIS (sure independence screening)
Screening Keep Number	15
Actual Kept Number	15
Actual Kept Ratio	0.003

Output 49.7.7 shows the absolute correlation between the effects and the response that corresponds to the leading 20 effects. Because you specified KEEPNUM=15, only the leading 15 effects are kept for model selection by LASSO. In this example, you can see that all the predictive effects are among the top 15. Note that if the screening stage does not keep an effect, that effect has no chance of being selected in the final model.

Output 49.7.7: LASSO Selection Summary Table

The GLMSELECT Procedure

SCREEN=SIS

Effect Screening by Correlation
Rank	Effect	Maximum Absolute Correlation
1	cIn2_1	0.621286
2	cIn2_4	0.611998
3	cIn1_1	0.261449
4	cIn1_3	0.259374
5	cIn2_3	0.215103
6	xIn5	0.212433
7	xIn4	0.201704
8	cIn2_2	0.195087
9	xIn3	0.174133
10	xIn2	0.170705
11	xIn1	0.083742
12	cOut447_4	0.061958
13	cOut238_1	0.052981
14	cOut538_4	0.052165
15	cOut747_5	0.051390
16	xOut345	0.050114*
17	cOut511_5	0.047524*
18	cOut630_2	0.047440*
19	cOut672_2	0.045937*
20	cOut672_1	0.045937*
* Screened Out Effect

Output 49.7.8 shows the "LASSO Selection Summary" table. As in Output 49.7.3, all the predictive effects enter into the model. However, one noise effect also enters into the model, because SCREEN=SIS does not necessarily guarantee the same solution as the solution that results when SCREEN=NONE. For this example, PROC GLMSELECT runs only slightly faster when SCREEN=SIS than it does when SCREEN=SASVI, although it runs about twice as fast as it does when SCREEN=NONE. However, for problems that have more predictors or that use much more computationally intense CHOOSE= criterion, sure independence screening (SIS) can run faster by orders of magnitude.

Output 49.7.8: LASSO Selection Summary Table

The GLMSELECT Procedure

SCREEN=SIS

Step	Effect Entered	Number Effects In	SBC
0	Intercept	1	10413.6418
1	cIn2_1	2	10335.5423
2	cIn2_4	3	7237.0701
3	cIn1_1	4	7186.9642
4	cIn1_3	5	6753.6076
5	xIn5	6	6609.0322
6	cIn2_3	7	6510.2639
7	xIn4	8	5797.1607
8	xIn3	9	5487.4432
9	xIn2	10	2894.0166
10	xIn1	11	245.9693
11	cOut747_5	12	238.4427*
* Optimal Value of Criterion