The HPGENSELECT Procedure

Example 52.4 Model Selection by the LASSO Method

This example shows how you can use PROC HPGENSELECT to perform model selection among Poisson regression models by using the LASSO method in single-machine and distributed modes. For more information about the execution modes of SAS High-Performance Statistics procedures, see the section Processing Modes in SAS/STAT 14.1 User's Guide: High-Performance Procedures. The focus of this example is to show how you use the LASSO method and how you can switch the modes of execution of PROC HPGENSELECT. The following DATA step generates the data for this example. There are 1,000,000 observations in the data set, and the response yPoisson is a Poisson variable with a mean that depends on 20 of the 100 regressors.

%let nObs      = 1000000;
%let nContIn   = 20;
%let nContOut  = 80;
%let Seed      = 12345;

data ex4Data;
array xIn{&nContIn};
array xOut{&nContOut};

drop i j sign xBeta expXbeta;

seed = &Seed;
do i=1 to &nObs;
   sign  = -1;
       xBeta = 0;
       do j=1 to dim(xIn);
           call ranuni(seed,xIn[j]);
           xBeta  = xBeta + j*sign*xIn{j};
           sign   = -sign;
       end;

       do j=1 to dim(xOut);
          call ranuni(seed,xOut[j]);
       end;

       call ranuni(seed,xSubtle);
       call ranuni(seed,xTiny);

       xBeta = xBeta + 0.1*xSubtle + 0.05*xTiny;
       expXbeta = exp(xBeta/20);
       call ranpoi(seed,expXbeta,yPoisson);
       output;
end;
run;

The following statements use PROC HPGENSELECT to select a model by using the LASSO method and only the first 10,000 observations:

proc hpgenselect data=ex4Data(Obs=10000);
   model yPoisson = x: / dist=Poisson;
   selection method=Lasso(choose=SBC) details=all;
   performance details;
run;

Output 52.4.1 shows the "Performance Information" table. This shows that the HPGENSELECT procedure executed in single-machine mode on four threads because the client machine has four CPUs. You can select a certain number of threads on any machine involved in the computations by using the NTHREADS= option in the PERFORMANCE statement.

Output 52.4.1: Performance Information

The HPGENSELECT Procedure

Performance Information
Execution Mode Single-Machine
Number of Threads 4



Output 52.4.2 shows the models fit by maximizing the penalized log likelihoods for a sequence of regularization parameters. For each step in the sequence, Output 52.4.2 shows you the effects that are added or removed and the fit statistics AIC, AICC, and BIC (SBC) for each model. Unlike other methods, such as forward selection, LASSO selection includes zero or more effects in each step.

Output 52.4.2: Selection Details

Selection Details
Step Description Effects
In Model
Lambda AIC AICC BIC
0 Initial Model 1 1 39592.399 39592.399 39599.609
1 xIn17 entered 5 0.8 38533.060 38533.066 38569.111
  xIn18 entered 5 0.8 38533.060 38533.066 38569.111
  xIn19 entered 5 0.8 38533.060 38533.066 38569.111
  xIn20 entered 5 0.8 38533.060 38533.066 38569.111
2 xIn14 entered 8 0.64 36731.498 36731.513 36789.181
  xIn15 entered 8 0.64 36731.498 36731.513 36789.181
  xIn16 entered 8 0.64 36731.498 36731.513 36789.181
3 xIn11 entered 11 0.512 34887.979 34888.005 34967.292
  xIn12 entered 11 0.512 34887.979 34888.005 34967.292
  xIn13 entered 11 0.512 34887.979 34888.005 34967.292
4 xIn8 entered 12 0.4096 33321.737 33321.769 33408.262
5 xIn7 entered 15 0.3277 32043.428 32043.476 32151.583
  xIn9 entered 15 0.3277 32043.428 32043.476 32151.583
  xIn10 entered 15 0.3277 32043.428 32043.476 32151.583
6 xIn6 entered 16 0.2621 31135.411 31135.465 31250.776
7 xIn5 entered 17 0.2097 30494.512 30494.573 30617.088
8 xIn4 entered 18 0.1678 30062.397 30062.465 30192.183
9 xIn3 entered 19 0.1342 29761.384 29761.460 29898.380
10 xIn2 entered 20 0.1074 29563.603 29563.687 29707.810
11   20 0.0859 29432.716 29432.800 29576.922
12   20 0.0687 29348.856 29348.940 29493.063
13 xOut8 entered 22 0.055 29297.961 29298.062 29456.588
  xOut66 entered 22 0.055 29297.961 29298.062 29456.588
14 xIn1 entered 26 0.044 29265.882 29266.022 29453.350*
  xOut34 entered 26 0.044 29265.882 29266.022 29453.350*
  xOut37 entered 26 0.044 29265.882 29266.022 29453.350*
  xOut65 entered 26 0.044 29265.882 29266.022 29453.350*
15 xOut7 entered 32 0.0352 29247.136 29247.348 29477.867
  xOut22 entered 32 0.0352 29247.136 29247.348 29477.867
  xOut29 entered 32 0.0352 29247.136 29247.348 29477.867
  xOut51 entered 32 0.0352 29247.136 29247.348 29477.867
  xOut57 entered 32 0.0352 29247.136 29247.348 29477.867
  xOut60 entered 32 0.0352 29247.136 29247.348 29477.867
16 xOut1 entered 44 0.0281 29244.461 29244.859 29561.716
  xOut6 entered 44 0.0281 29244.461 29244.859 29561.716
  xOut14 entered 44 0.0281 29244.461 29244.859 29561.716
  xOut18 entered 44 0.0281 29244.461 29244.859 29561.716
  xOut27 entered 44 0.0281 29244.461 29244.859 29561.716
  xOut30 entered 44 0.0281 29244.461 29244.859 29561.716
  xOut33 entered 44 0.0281 29244.461 29244.859 29561.716
  xOut53 entered 44 0.0281 29244.461 29244.859 29561.716
  xOut56 entered 44 0.0281 29244.461 29244.859 29561.716
  xOut59 entered 44 0.0281 29244.461 29244.859 29561.716
  xOut62 entered 44 0.0281 29244.461 29244.859 29561.716
  xTiny entered 44 0.0281 29244.461 29244.859 29561.716
17 xOut11 entered 56 0.0225 29245.665 29246.307 29649.444
  xOut24 entered 56 0.0225 29245.665 29246.307 29649.444
  xOut35 entered 56 0.0225 29245.665 29246.307 29649.444
  xOut36 entered 56 0.0225 29245.665 29246.307 29649.444
  xOut43 entered 56 0.0225 29245.665 29246.307 29649.444
  xOut45 entered 56 0.0225 29245.665 29246.307 29649.444
  xOut49 entered 56 0.0225 29245.665 29246.307 29649.444
  xOut54 entered 56 0.0225 29245.665 29246.307 29649.444
  xOut69 entered 56 0.0225 29245.665 29246.307 29649.444
  xOut75 entered 56 0.0225 29245.665 29246.307 29649.444
  xOut76 entered 56 0.0225 29245.665 29246.307 29649.444
  xOut79 entered 56 0.0225 29245.665 29246.307 29649.444
18 xOut12 entered 63 0.018 29241.974 29242.786 29696.225
  xOut19 entered 63 0.018 29241.974 29242.786 29696.225
  xOut28 entered 63 0.018 29241.974 29242.786 29696.225
  xOut48 entered 63 0.018 29241.974 29242.786 29696.225
  xOut61 entered 63 0.018 29241.974 29242.786 29696.225
  xOut78 entered 63 0.018 29241.974 29242.786 29696.225
  xSubtle entered 63 0.018 29241.974 29242.786 29696.225
19 xOut4 entered 69 0.0144 29241.231 29242.203 29738.744
  xOut39 entered 69 0.0144 29241.231 29242.203 29738.744
  xOut40 entered 69 0.0144 29241.231 29242.203 29738.744
  xOut44 entered 69 0.0144 29241.231 29242.203 29738.744
  xOut67 entered 69 0.0144 29241.231 29242.203 29738.744
  xOut71 entered 69 0.0144 29241.231 29242.203 29738.744
20 xOut2 entered 79 0.0115 29251.884 29253.158 29821.500
  xOut3 entered 79 0.0115 29251.884 29253.158 29821.500
  xOut16 entered 79 0.0115 29251.884 29253.158 29821.500
  xOut21 entered 79 0.0115 29251.884 29253.158 29821.500
  xOut31 entered 79 0.0115 29251.884 29253.158 29821.500
  xOut41 entered 79 0.0115 29251.884 29253.158 29821.500
  xOut46 entered 79 0.0115 29251.884 29253.158 29821.500
  xOut50 entered 79 0.0115 29251.884 29253.158 29821.500
  xOut72 entered 79 0.0115 29251.884 29253.158 29821.500
  xOut77 entered 79 0.0115 29251.884 29253.158 29821.500

* Optimal Value of Criterion




The model in step 14 had the smallest Schwarz Bayesian criterion (BIC in Output 52.4.2), and it was chosen as the final model because the CHOOSE=SBC option was specified in the SELECTION statement. Output 52.4.3 shows the parameter estimates for the selected model. You can see that the LASSO selection in which the final model was chosen based on the SBC criterion retains all 20 of the true effects but also keeps several extraneous effects.

Output 52.4.3: Parameter Estimates for the Selected Model

Parameter Estimates
Parameter DF Estimate
Intercept 1 -0.008605
xIn1 1 -0.004242
xIn2 1 0.076690
xIn3 1 -0.141401
xIn4 1 0.160843
xIn5 1 -0.219377
xIn6 1 0.268402
xIn7 1 -0.336859
xIn8 1 0.384260
xIn9 1 -0.407838
xIn10 1 0.397287
xIn11 1 -0.516073
xIn12 1 0.559396
xIn13 1 -0.547200
xIn14 1 0.631820
xIn15 1 -0.692216
xIn16 1 0.783101
xIn17 1 -0.790773
xIn18 1 0.875372
xIn19 1 -0.837447
xIn20 1 0.954752
xOut8 1 0.013643
xOut34 1 0.002167
xOut37 1 0.001344
xOut65 1 -0.007790
xOut66 1 0.015540



Output 52.4.4 shows timing information for the PROC HPGENSELECT run. This table is produced when you specify the DETAILS option in the PERFORMANCE statement. You can see that, in this case, the majority of time is spent in performing model selection.

Output 52.4.4: Timing

Procedure Task Timing
Task Seconds Percent
Reading and Levelizing Data 0.09 0.09%
Candidate model fit 0.03 0.03%
Performing Model Selection 100.67 99.88%



You can switch to running PROC HPGENSELECT in distributed mode by specifying valid values for the NODES=, INSTALL=, and HOST= options in the PERFORMANCE statement. An alternative to specifying the INSTALL= and HOST= options in the PERFORMANCE statement is to set appropriate values for the GRIDHOST and GRIDINSTALLLOC environment variables by using OPTIONS SET commands. For more information about setting these options or environment variables, see the section Processing Modes in SAS/STAT 14.1 User's Guide: High-Performance Procedures.

The following statements provide an example. All 1,000,000 observations are used in this example of running PROC HPGENSELECT in distributed mode. To run these statements successfully, you need to set the macro variables GRIDHOST and GRIDINSTALLLOC to resolve to appropriate values, or you can replace the references to macro variables with appropriate values.

proc hpgenselect data=ex4Data;
   model yPoisson = x: / dist=Poisson;
   selection method=Lasso(choose=SBC) details=all;
   performance details nodes = 10
               host="&GRIDHOST" install="&GRIDINSTALLLOC";
run;

The Execution Mode row in the "Performance Information" table shown in Output 52.4.5 indicates that the calculations were performed in a distributed environment that used 10 nodes, each of which used 32 threads.

Output 52.4.5: Performance Information in Distributed Mode

Performance Information
Host Node << your grid host >>
Install Location << your grid install location >>
Execution Mode Distributed
Number of Compute Nodes 10
Number of Threads per Node 32



Another indication of distributed execution is the following message in the SAS log, which is issued by all high-performance statistical procedures:

NOTE: The HPGENSELECT procedure is executing in the distributed
      computing environment with 10 worker nodes.

Output 52.4.6 shows timing information for this distributed run of PROC HPGENSELECT. As in the case of the single-machine mode, the majority of time in distributed mode is spent in performing the model selection.

Output 52.4.6: Timing

Procedure Task Timing
Task Seconds Percent
Distributing Data 7.43 5.58%
Reading and Levelizing Data 0.13 0.10%
Candidate model fit 0.07 0.05%
Performing Model Selection 125.62 94.27%



Output 52.4.7 shows the models that were fit by maximizing the penalized log likelihoods for a sequence of regularization parameters. In this case, the model in the last step had the smallest SBC statistic and was the selected model.

Output 52.4.7: Selection Details

Selection Details
Step Description Effects
In Model
Lambda AIC AICC BIC
0 Initial Model 1 1 3931380.08 3931380.08 3931391.90
1 xIn16 entered 6 0.8 3811013.61 3811013.61 3811084.50
  xIn17 entered 6 0.8 3811013.61 3811013.61 3811084.50
  xIn18 entered 6 0.8 3811013.61 3811013.61 3811084.50
  xIn19 entered 6 0.8 3811013.61 3811013.61 3811084.50
  xIn20 entered 6 0.8 3811013.61 3811013.61 3811084.50
2 xIn13 entered 9 0.64 3611604.92 3611604.92 3611711.26
  xIn14 entered 9 0.64 3611604.92 3611604.92 3611711.26
  xIn15 entered 9 0.64 3611604.92 3611604.92 3611711.26
3 xIn11 entered 11 0.512 3424886.61 3424886.61 3425016.58
  xIn12 entered 11 0.512 3424886.61 3424886.61 3425016.58
4 xIn9 entered 13 0.4096 3273653.05 3273653.05 3273806.65
  xIn10 entered 13 0.4096 3273653.05 3273653.05 3273806.65
5 xIn7 entered 15 0.3277 3160535.72 3160535.72 3160712.95
  xIn8 entered 15 0.3277 3160535.72 3160535.72 3160712.95
6 xIn6 entered 16 0.2621 3080686.19 3080686.19 3080875.24
7 xIn5 entered 17 0.2097 3025122.51 3025122.51 3025323.38
8 xIn4 entered 18 0.1678 2987411.53 2987411.53 2987624.21
9 xIn3 entered 19 0.1342 2962303.01 2962303.01 2962527.50
10   19 0.1074 2945736.43 2945736.43 2945960.92
11 xIn2 entered 20 0.0859 2934645.35 2934645.36 2934881.66
12   20 0.0687 2927476.58 2927476.58 2927712.89
13   20 0.055 2922886.50 2922886.50 2923122.81
14 xIn1 entered 21 0.044 2919875.95 2919875.95 2920124.08
15   21 0.0352 2917896.15 2917896.15 2918144.28
16   21 0.0281 2916627.57 2916627.57 2916875.69
17   21 0.0225 2915816.54 2915816.54 2916064.66
18   21 0.018 2915297.37 2915297.37 2915545.49
19   21 0.0144 2914965.03 2914965.03 2915213.16
20   21 0.0115 2914752.31 2914752.31 2915000.44*

* Optimal Value of Criterion




Output 52.4.8 shows the parameter estimates for the selected model. Selecting the final model based on the SBC criterion retains all 20 of the true effects and none of the extraneous effects.

Output 52.4.8: Parameter Estimates

Parameter Estimates
Parameter DF Estimate
Intercept 1 0.011850
xIn1 1 -0.037295
xIn2 1 0.090866
xIn3 1 -0.137335
xIn4 1 0.188475
xIn5 1 -0.238960
xIn6 1 0.287461
xIn7 1 -0.336226
xIn8 1 0.389728
xIn9 1 -0.438611
xIn10 1 0.486188
xIn11 1 -0.539309
xIn12 1 0.587680
xIn13 1 -0.637463
xIn14 1 0.684980
xIn15 1 -0.735036
xIn16 1 0.790630
xIn17 1 -0.836793
xIn18 1 0.885972
xIn19 1 -0.934294
xIn20 1 0.985428