This example shows how you can use PROC HPGENSELECT to perform model selection among Poisson regression models by using the
LASSO method in single-machine and distributed modes. For more information about the execution modes of SAS High-Performance Statistics procedures, see the section Processing Modes in SAS/STAT 14.1 User's Guide: High-Performance Procedures. The focus of this example is to show how you use the LASSO method and how you can switch the modes of execution of PROC
HPGENSELECT. The following DATA step generates the data for this example. There are 1,000,000 observations in the data set,
and the response yPoisson
is a Poisson variable with a mean that depends on 20 of the 100 regressors.
%let nObs = 1000000; %let nContIn = 20; %let nContOut = 80; %let Seed = 12345; data ex4Data; array xIn{&nContIn}; array xOut{&nContOut}; drop i j sign xBeta expXbeta; seed = &Seed; do i=1 to &nObs; sign = -1; xBeta = 0; do j=1 to dim(xIn); call ranuni(seed,xIn[j]); xBeta = xBeta + j*sign*xIn{j}; sign = -sign; end; do j=1 to dim(xOut); call ranuni(seed,xOut[j]); end; call ranuni(seed,xSubtle); call ranuni(seed,xTiny); xBeta = xBeta + 0.1*xSubtle + 0.05*xTiny; expXbeta = exp(xBeta/20); call ranpoi(seed,expXbeta,yPoisson); output; end; run;
The following statements use PROC HPGENSELECT to select a model by using the LASSO method and only the first 10,000 observations:
proc hpgenselect data=ex4Data(Obs=10000); model yPoisson = x: / dist=Poisson; selection method=Lasso(choose=SBC) details=all; performance details; run;
Output 52.4.1 shows the "Performance Information" table. This shows that the HPGENSELECT procedure executed in single-machine mode on four threads because the client machine has four CPUs. You can select a certain number of threads on any machine involved in the computations by using the NTHREADS= option in the PERFORMANCE statement.
Output 52.4.1: Performance Information
Output 52.4.2 shows the models fit by maximizing the penalized log likelihoods for a sequence of regularization parameters. For each step in the sequence, Output 52.4.2 shows you the effects that are added or removed and the fit statistics AIC, AICC, and BIC (SBC) for each model. Unlike other methods, such as forward selection, LASSO selection includes zero or more effects in each step.
Output 52.4.2: Selection Details
Selection Details | ||||||
---|---|---|---|---|---|---|
Step | Description | Effects In Model |
Lambda | AIC | AICC | BIC |
0 | Initial Model | 1 | 1 | 39592.399 | 39592.399 | 39599.609 |
1 | xIn17 entered | 5 | 0.8 | 38533.060 | 38533.066 | 38569.111 |
xIn18 entered | 5 | 0.8 | 38533.060 | 38533.066 | 38569.111 | |
xIn19 entered | 5 | 0.8 | 38533.060 | 38533.066 | 38569.111 | |
xIn20 entered | 5 | 0.8 | 38533.060 | 38533.066 | 38569.111 | |
2 | xIn14 entered | 8 | 0.64 | 36731.498 | 36731.513 | 36789.181 |
xIn15 entered | 8 | 0.64 | 36731.498 | 36731.513 | 36789.181 | |
xIn16 entered | 8 | 0.64 | 36731.498 | 36731.513 | 36789.181 | |
3 | xIn11 entered | 11 | 0.512 | 34887.979 | 34888.005 | 34967.292 |
xIn12 entered | 11 | 0.512 | 34887.979 | 34888.005 | 34967.292 | |
xIn13 entered | 11 | 0.512 | 34887.979 | 34888.005 | 34967.292 | |
4 | xIn8 entered | 12 | 0.4096 | 33321.737 | 33321.769 | 33408.262 |
5 | xIn7 entered | 15 | 0.3277 | 32043.428 | 32043.476 | 32151.583 |
xIn9 entered | 15 | 0.3277 | 32043.428 | 32043.476 | 32151.583 | |
xIn10 entered | 15 | 0.3277 | 32043.428 | 32043.476 | 32151.583 | |
6 | xIn6 entered | 16 | 0.2621 | 31135.411 | 31135.465 | 31250.776 |
7 | xIn5 entered | 17 | 0.2097 | 30494.512 | 30494.573 | 30617.088 |
8 | xIn4 entered | 18 | 0.1678 | 30062.397 | 30062.465 | 30192.183 |
9 | xIn3 entered | 19 | 0.1342 | 29761.384 | 29761.460 | 29898.380 |
10 | xIn2 entered | 20 | 0.1074 | 29563.603 | 29563.687 | 29707.810 |
11 | 20 | 0.0859 | 29432.716 | 29432.800 | 29576.922 | |
12 | 20 | 0.0687 | 29348.856 | 29348.940 | 29493.063 | |
13 | xOut8 entered | 22 | 0.055 | 29297.961 | 29298.062 | 29456.588 |
xOut66 entered | 22 | 0.055 | 29297.961 | 29298.062 | 29456.588 | |
14 | xIn1 entered | 26 | 0.044 | 29265.882 | 29266.022 | 29453.350* |
xOut34 entered | 26 | 0.044 | 29265.882 | 29266.022 | 29453.350* | |
xOut37 entered | 26 | 0.044 | 29265.882 | 29266.022 | 29453.350* | |
xOut65 entered | 26 | 0.044 | 29265.882 | 29266.022 | 29453.350* | |
15 | xOut7 entered | 32 | 0.0352 | 29247.136 | 29247.348 | 29477.867 |
xOut22 entered | 32 | 0.0352 | 29247.136 | 29247.348 | 29477.867 | |
xOut29 entered | 32 | 0.0352 | 29247.136 | 29247.348 | 29477.867 | |
xOut51 entered | 32 | 0.0352 | 29247.136 | 29247.348 | 29477.867 | |
xOut57 entered | 32 | 0.0352 | 29247.136 | 29247.348 | 29477.867 | |
xOut60 entered | 32 | 0.0352 | 29247.136 | 29247.348 | 29477.867 | |
16 | xOut1 entered | 44 | 0.0281 | 29244.461 | 29244.859 | 29561.716 |
xOut6 entered | 44 | 0.0281 | 29244.461 | 29244.859 | 29561.716 | |
xOut14 entered | 44 | 0.0281 | 29244.461 | 29244.859 | 29561.716 | |
xOut18 entered | 44 | 0.0281 | 29244.461 | 29244.859 | 29561.716 | |
xOut27 entered | 44 | 0.0281 | 29244.461 | 29244.859 | 29561.716 | |
xOut30 entered | 44 | 0.0281 | 29244.461 | 29244.859 | 29561.716 | |
xOut33 entered | 44 | 0.0281 | 29244.461 | 29244.859 | 29561.716 | |
xOut53 entered | 44 | 0.0281 | 29244.461 | 29244.859 | 29561.716 | |
xOut56 entered | 44 | 0.0281 | 29244.461 | 29244.859 | 29561.716 | |
xOut59 entered | 44 | 0.0281 | 29244.461 | 29244.859 | 29561.716 | |
xOut62 entered | 44 | 0.0281 | 29244.461 | 29244.859 | 29561.716 | |
xTiny entered | 44 | 0.0281 | 29244.461 | 29244.859 | 29561.716 | |
17 | xOut11 entered | 56 | 0.0225 | 29245.665 | 29246.307 | 29649.444 |
xOut24 entered | 56 | 0.0225 | 29245.665 | 29246.307 | 29649.444 | |
xOut35 entered | 56 | 0.0225 | 29245.665 | 29246.307 | 29649.444 | |
xOut36 entered | 56 | 0.0225 | 29245.665 | 29246.307 | 29649.444 | |
xOut43 entered | 56 | 0.0225 | 29245.665 | 29246.307 | 29649.444 | |
xOut45 entered | 56 | 0.0225 | 29245.665 | 29246.307 | 29649.444 | |
xOut49 entered | 56 | 0.0225 | 29245.665 | 29246.307 | 29649.444 | |
xOut54 entered | 56 | 0.0225 | 29245.665 | 29246.307 | 29649.444 | |
xOut69 entered | 56 | 0.0225 | 29245.665 | 29246.307 | 29649.444 | |
xOut75 entered | 56 | 0.0225 | 29245.665 | 29246.307 | 29649.444 | |
xOut76 entered | 56 | 0.0225 | 29245.665 | 29246.307 | 29649.444 | |
xOut79 entered | 56 | 0.0225 | 29245.665 | 29246.307 | 29649.444 | |
18 | xOut12 entered | 63 | 0.018 | 29241.974 | 29242.786 | 29696.225 |
xOut19 entered | 63 | 0.018 | 29241.974 | 29242.786 | 29696.225 | |
xOut28 entered | 63 | 0.018 | 29241.974 | 29242.786 | 29696.225 | |
xOut48 entered | 63 | 0.018 | 29241.974 | 29242.786 | 29696.225 | |
xOut61 entered | 63 | 0.018 | 29241.974 | 29242.786 | 29696.225 | |
xOut78 entered | 63 | 0.018 | 29241.974 | 29242.786 | 29696.225 | |
xSubtle entered | 63 | 0.018 | 29241.974 | 29242.786 | 29696.225 | |
19 | xOut4 entered | 69 | 0.0144 | 29241.231 | 29242.203 | 29738.744 |
xOut39 entered | 69 | 0.0144 | 29241.231 | 29242.203 | 29738.744 | |
xOut40 entered | 69 | 0.0144 | 29241.231 | 29242.203 | 29738.744 | |
xOut44 entered | 69 | 0.0144 | 29241.231 | 29242.203 | 29738.744 | |
xOut67 entered | 69 | 0.0144 | 29241.231 | 29242.203 | 29738.744 | |
xOut71 entered | 69 | 0.0144 | 29241.231 | 29242.203 | 29738.744 | |
20 | xOut2 entered | 79 | 0.0115 | 29251.884 | 29253.158 | 29821.500 |
xOut3 entered | 79 | 0.0115 | 29251.884 | 29253.158 | 29821.500 | |
xOut16 entered | 79 | 0.0115 | 29251.884 | 29253.158 | 29821.500 | |
xOut21 entered | 79 | 0.0115 | 29251.884 | 29253.158 | 29821.500 | |
xOut31 entered | 79 | 0.0115 | 29251.884 | 29253.158 | 29821.500 | |
xOut41 entered | 79 | 0.0115 | 29251.884 | 29253.158 | 29821.500 | |
xOut46 entered | 79 | 0.0115 | 29251.884 | 29253.158 | 29821.500 | |
xOut50 entered | 79 | 0.0115 | 29251.884 | 29253.158 | 29821.500 | |
xOut72 entered | 79 | 0.0115 | 29251.884 | 29253.158 | 29821.500 | |
xOut77 entered | 79 | 0.0115 | 29251.884 | 29253.158 | 29821.500 |
* Optimal Value of Criterion |
The model in step 14 had the smallest Schwarz Bayesian criterion (BIC in Output 52.4.2), and it was chosen as the final model because the CHOOSE=SBC option was specified in the SELECTION statement. Output 52.4.3 shows the parameter estimates for the selected model. You can see that the LASSO selection in which the final model was chosen based on the SBC criterion retains all 20 of the true effects but also keeps several extraneous effects.
Output 52.4.3: Parameter Estimates for the Selected Model
Parameter Estimates | ||
---|---|---|
Parameter | DF | Estimate |
Intercept | 1 | -0.008605 |
xIn1 | 1 | -0.004242 |
xIn2 | 1 | 0.076690 |
xIn3 | 1 | -0.141401 |
xIn4 | 1 | 0.160843 |
xIn5 | 1 | -0.219377 |
xIn6 | 1 | 0.268402 |
xIn7 | 1 | -0.336859 |
xIn8 | 1 | 0.384260 |
xIn9 | 1 | -0.407838 |
xIn10 | 1 | 0.397287 |
xIn11 | 1 | -0.516073 |
xIn12 | 1 | 0.559396 |
xIn13 | 1 | -0.547200 |
xIn14 | 1 | 0.631820 |
xIn15 | 1 | -0.692216 |
xIn16 | 1 | 0.783101 |
xIn17 | 1 | -0.790773 |
xIn18 | 1 | 0.875372 |
xIn19 | 1 | -0.837447 |
xIn20 | 1 | 0.954752 |
xOut8 | 1 | 0.013643 |
xOut34 | 1 | 0.002167 |
xOut37 | 1 | 0.001344 |
xOut65 | 1 | -0.007790 |
xOut66 | 1 | 0.015540 |
Output 52.4.4 shows timing information for the PROC HPGENSELECT run. This table is produced when you specify the DETAILS option in the PERFORMANCE statement. You can see that, in this case, the majority of time is spent in performing model selection.
Output 52.4.4: Timing
You can switch to running PROC HPGENSELECT in distributed mode by specifying valid values for the NODES=, INSTALL=, and HOST= options in the PERFORMANCE statement. An alternative to specifying the INSTALL= and HOST= options in the PERFORMANCE statement is to set appropriate values for the GRIDHOST and GRIDINSTALLLOC environment variables by using OPTIONS SET commands. For more information about setting these options or environment variables, see the section Processing Modes in SAS/STAT 14.1 User's Guide: High-Performance Procedures.
The following statements provide an example. All 1,000,000 observations are used in this example of running PROC HPGENSELECT
in distributed mode. To run these statements successfully, you need to set the macro variables GRIDHOST
and GRIDINSTALLLOC
to resolve to appropriate values, or you can replace the references to macro variables with appropriate values.
proc hpgenselect data=ex4Data; model yPoisson = x: / dist=Poisson; selection method=Lasso(choose=SBC) details=all; performance details nodes = 10 host="&GRIDHOST" install="&GRIDINSTALLLOC"; run;
The Execution Mode row in the "Performance Information" table shown in Output 52.4.5 indicates that the calculations were performed in a distributed environment that used 10 nodes, each of which used 32 threads.
Output 52.4.5: Performance Information in Distributed Mode
Another indication of distributed execution is the following message in the SAS log, which is issued by all high-performance statistical procedures:
NOTE: The HPGENSELECT procedure is executing in the distributed computing environment with 10 worker nodes.
Output 52.4.6 shows timing information for this distributed run of PROC HPGENSELECT. As in the case of the single-machine mode, the majority of time in distributed mode is spent in performing the model selection.
Output 52.4.6: Timing
Output 52.4.7 shows the models that were fit by maximizing the penalized log likelihoods for a sequence of regularization parameters. In this case, the model in the last step had the smallest SBC statistic and was the selected model.
Output 52.4.7: Selection Details
Selection Details | ||||||
---|---|---|---|---|---|---|
Step | Description | Effects In Model |
Lambda | AIC | AICC | BIC |
0 | Initial Model | 1 | 1 | 3931380.08 | 3931380.08 | 3931391.90 |
1 | xIn16 entered | 6 | 0.8 | 3811013.61 | 3811013.61 | 3811084.50 |
xIn17 entered | 6 | 0.8 | 3811013.61 | 3811013.61 | 3811084.50 | |
xIn18 entered | 6 | 0.8 | 3811013.61 | 3811013.61 | 3811084.50 | |
xIn19 entered | 6 | 0.8 | 3811013.61 | 3811013.61 | 3811084.50 | |
xIn20 entered | 6 | 0.8 | 3811013.61 | 3811013.61 | 3811084.50 | |
2 | xIn13 entered | 9 | 0.64 | 3611604.92 | 3611604.92 | 3611711.26 |
xIn14 entered | 9 | 0.64 | 3611604.92 | 3611604.92 | 3611711.26 | |
xIn15 entered | 9 | 0.64 | 3611604.92 | 3611604.92 | 3611711.26 | |
3 | xIn11 entered | 11 | 0.512 | 3424886.61 | 3424886.61 | 3425016.58 |
xIn12 entered | 11 | 0.512 | 3424886.61 | 3424886.61 | 3425016.58 | |
4 | xIn9 entered | 13 | 0.4096 | 3273653.05 | 3273653.05 | 3273806.65 |
xIn10 entered | 13 | 0.4096 | 3273653.05 | 3273653.05 | 3273806.65 | |
5 | xIn7 entered | 15 | 0.3277 | 3160535.72 | 3160535.72 | 3160712.95 |
xIn8 entered | 15 | 0.3277 | 3160535.72 | 3160535.72 | 3160712.95 | |
6 | xIn6 entered | 16 | 0.2621 | 3080686.19 | 3080686.19 | 3080875.24 |
7 | xIn5 entered | 17 | 0.2097 | 3025122.51 | 3025122.51 | 3025323.38 |
8 | xIn4 entered | 18 | 0.1678 | 2987411.53 | 2987411.53 | 2987624.21 |
9 | xIn3 entered | 19 | 0.1342 | 2962303.01 | 2962303.01 | 2962527.50 |
10 | 19 | 0.1074 | 2945736.43 | 2945736.43 | 2945960.92 | |
11 | xIn2 entered | 20 | 0.0859 | 2934645.35 | 2934645.36 | 2934881.66 |
12 | 20 | 0.0687 | 2927476.58 | 2927476.58 | 2927712.89 | |
13 | 20 | 0.055 | 2922886.50 | 2922886.50 | 2923122.81 | |
14 | xIn1 entered | 21 | 0.044 | 2919875.95 | 2919875.95 | 2920124.08 |
15 | 21 | 0.0352 | 2917896.15 | 2917896.15 | 2918144.28 | |
16 | 21 | 0.0281 | 2916627.57 | 2916627.57 | 2916875.69 | |
17 | 21 | 0.0225 | 2915816.54 | 2915816.54 | 2916064.66 | |
18 | 21 | 0.018 | 2915297.37 | 2915297.37 | 2915545.49 | |
19 | 21 | 0.0144 | 2914965.03 | 2914965.03 | 2915213.16 | |
20 | 21 | 0.0115 | 2914752.31 | 2914752.31 | 2915000.44* |
* Optimal Value of Criterion |
Output 52.4.8 shows the parameter estimates for the selected model. Selecting the final model based on the SBC criterion retains all 20 of the true effects and none of the extraneous effects.
Output 52.4.8: Parameter Estimates
Parameter Estimates | ||
---|---|---|
Parameter | DF | Estimate |
Intercept | 1 | 0.011850 |
xIn1 | 1 | -0.037295 |
xIn2 | 1 | 0.090866 |
xIn3 | 1 | -0.137335 |
xIn4 | 1 | 0.188475 |
xIn5 | 1 | -0.238960 |
xIn6 | 1 | 0.287461 |
xIn7 | 1 | -0.336226 |
xIn8 | 1 | 0.389728 |
xIn9 | 1 | -0.438611 |
xIn10 | 1 | 0.486188 |
xIn11 | 1 | -0.539309 |
xIn12 | 1 | 0.587680 |
xIn13 | 1 | -0.637463 |
xIn14 | 1 | 0.684980 |
xIn15 | 1 | -0.735036 |
xIn16 | 1 | 0.790630 |
xIn17 | 1 | -0.836793 |
xIn18 | 1 | 0.885972 |
xIn19 | 1 | -0.934294 |
xIn20 | 1 | 0.985428 |