SUPPORT / SAMPLES & SAS NOTES
 

Support

Usage Note 60240: Regularization, regression penalties, LASSO, ridging, and elastic net

DetailsAboutRate It

Regularization methods can be applied in order to shrink model parameter estimates in situations of instability. As discussed by Agresti (2013), one such situation occurs when there is a large number of covariates, of which only a small subset are strongly associated with the response, and the sample size is not large. In this case, the maximum likelihood estimates can be inflated. Parameter estimates can also be inflated by collinearity (see here and here), by separation in the case of logistic regression, among other reasons.

In order to shrink the maximum likelihood estimates of model parameters toward zero, regularization methods can be applied. These methods consist of adding a penalty term of some form to the log likelihood function used in parameter estimation. Adding an L1 norm penalty (the sum of the absolute values of the estimates) is used in the LASSO method. Adding an L2 norm penalty (the sum of the squared estimates) is known as ridging. The elastic net method uses the L1 and L2 penalties in combination.

The LASSO method is particularly useful as an efficient way to do model selection since it tends to shrink some estimates all the way to zero producing a simpler model. Ridging shrinks all estimates, but generally not to zero. The drawback of shrinkage through penalization is that the estimates become biased and their variance shrinks. As a result, tests and confidence interval are not useful. However, the overall mean square error (which is the sum of the bias and variance) may be less after penalization. Another penalization method due to Firth, has been shown to reduce bias in the estimates.

With the methods discussed here, a tuning parameter multiplies the penalty term allowing control over the amount of penalization. Increasing the value of the tuning parameter increases the amount of shrinkage in the parameters. As with other smoothing methods that make use of a tuning parameter, such as generalized additive modeling, a value for the parameter must be chosen. This is done by fitting a series of models with a range of values and selecting the value that minimizes the mean square error. Typically, separate validation data are used for this assessment, or cross validation is used if separate data are not available. Information criteria, such as AIC or SBC, can also be used. This note discusses only the use of PROC NLMIXED to fit one model with a particular tuning parameter value. Note that fitting a series of models using this approach to evaluate many tuning parameter values would incur the full cost of fitting each model by maximum likelihood estimation. The SELECTION statement in the GLMSELECT and HPGENSELECT procedures employ more efficient ways to carry out this process.

 

Various regression penalties are available in SAS® procedures. See the LASSO, elastic net, ridge regression, and Firth items in this note. The LASSO (and related methods) and the elastic net as provided in PROC GLMSELECT for normal response models is illustrated and discussed in more detail in Gunes (2015).

This note illustrates how these regularization methods can be implemented using PROC NLMIXED by directly specifying the log likelihood, penalty, and tuning parameter. The following statements modify the DATA step in Gunes (2015) to add a binary variable to the simulated data that it creates. The first example below uses the normally distributed variable, y, as a response. The next example uses the binary variable, b, as the response in a logistic model. Note that the true model involves only x1, x2, and parts of the c1 predictor. The true model is 2 + 5*x2 - 17*x1*x2 + 6*(c1=2) + 5*(c1=5).

      data Simdata;
         drop i j;
         array x{5} x1-x5;
         do i=1 to 1000;
            do j=1 to 5;
               x{j} = ranuni(1); /* Continuous predictors */
            end;
            c1 = int(1.5+ranuni(1)*7); /* Classification predictors */
            c2 = 1 + mod(i,3);
            yTrue = 2 + 5*x2 - 17*x1*x2 + 6*(c1=2) + 5*(c1=5);
            y = yTrue + 2*rannor(1);
            p=logistic(yTrue);
            b=rantbl(1,1-p,p)-1;
            output Simdata;
         end;
      run;

Example 1: Model selection using LASSO

The statements that follow specify the same main effects and two-way interaction model as fit by Gunes (2015) in PROC GLMSELECT, but applies the LASSO method using PROC NLMIXED. The MU= statement specifies the model effects. The LL= statement defines the function to be maximized. It consists of the necessary part of the normal likelihood and an L1 penalty multiplied by tuning parameter L1Tune with value 0.1. Note that the names of all parameters in the model (other than the intercept) begin with "b." The sumabs(of b:) function therefore is the sum of the absolute values of the model parameters. The MAXITER= option is increased beyond its default value to allow convergence to occur. The ODS OUTPUT statement saves the parameter estimates of the final model in a data set.

      proc nlmixed data=Simdata maxiter=500;
         mu=int 
            /* main effects */
            + bx1*x1 + bx2*x2 + bx3*x3 + bx4*x4 + bx5*x5 
            + bc11*(c1=1) + bc12*(c1=2) + bc13*(c1=3) + bc14*(c1=4) + bc15*(c1=5) + bc16*(c1=6) + bc17*(c1=7)
            + bc21*(c2=1) + bc22*(c2=2)
      
            /* interaction terms among x's */
            + bx12*x1*x2 + bx13*x1*x3 + bx14*x1*x4 + bx15*x1*x5 + bx23*x2*x3 + bx24*x2*x4 + bx25*x2*x5 + bx34*x3*x4 + bx35*x3*x5 + bx45*x4*x5 
      
            /* x1*c1 interaction terms */
            + bc11x1*(c1=1)*x1 + bc12x1*(c1=2)*x1 + bc13x1*(c1=3)*x1 + bc14x1*(c1=4)*x1 + bc15x1*(c1=5)*x1 + bc16x1*(c1=6)*x1 + bc17x1*(c1=7)*x1
            + bc11x2*(c1=1)*x2 + bc12x2*(c1=2)*x2 + bc13x2*(c1=3)*x2 + bc14x2*(c1=4)*x2 + bc15x2*(c1=5)*x2 + bc16x2*(c1=6)*x2 + bc17x2*(c1=7)*x2
            + bc11x3*(c1=1)*x3 + bc12x3*(c1=2)*x3 + bc13x3*(c1=3)*x3 + bc14x3*(c1=4)*x3 + bc15x3*(c1=5)*x3 + bc16x3*(c1=6)*x3 + bc17x3*(c1=7)*x3
            + bc11x4*(c1=1)*x4 + bc12x4*(c1=2)*x4 + bc13x4*(c1=3)*x4 + bc14x4*(c1=4)*x4 + bc15x4*(c1=5)*x4 + bc16x4*(c1=6)*x4 + bc17x4*(c1=7)*x4
            + bc11x5*(c1=1)*x5 + bc12x5*(c1=2)*x5 + bc13x5*(c1=3)*x5 + bc14x5*(c1=4)*x5 + bc15x5*(c1=5)*x5 + bc16x5*(c1=6)*x5 + bc17x5*(c1=7)*x5
      
            /* x1*c2 interaction terms */
            + bc21x1*(c2=1)*x1 + bc22x1*(c2=2)*x1
            + bc21x2*(c2=1)*x2 + bc22x2*(c2=2)*x2
            + bc21x3*(c2=1)*x3 + bc22x3*(c2=2)*x3
            + bc21x4*(c2=1)*x4 + bc22x4*(c2=2)*x4
            + bc21x5*(c2=1)*x5 + bc22x5*(c2=2)*x5
      
            /* c1*c2 interaction terms */
            + bc11c21*(c1=1)*(c2=1) + bc12c21*(c1=2)*(c2=1) + bc13c21*(c1=3)*(c2=1) + bc14c21*(c1=4)*(c2=1) + bc15c21*(c1=5)*(c2=1) + bc16c21*(c1=6)*(c2=1) + bc17c21*(c1=7)*(c2=1)
            + bc11c22*(c1=1)*(c2=2) + bc12c22*(c1=2)*(c2=2) + bc13c22*(c1=3)*(c2=2) + bc14c22*(c1=4)*(c2=2) + bc15c22*(c1=5)*(c2=2) + bc16c22*(c1=6)*(c2=2) + bc17c22*(c1=7)*(c2=2)
            ;
         L1Tune=0.1;
         LL=-(y-mu)**2               /* kernel of normal log likelihood */
            - L1Tune*sumabs(of b:);  /* L1 penalty */
         model y ~ general(ll);
         ods output parameterestimates=pe;
         run;

Since the LASSO method shrinks many parameter estimates to zero, interest is in seeing those parameters that remain nonzero in value. The following statements find the nonzero subset of the estimates and list them in decreasing order of their absolute values.

      data pe;
         set pe;
         absEst=abs(Estimate);
         run;
      proc sort;
         by descending absEst;
         run;
      proc print data=pe label; 
         where absest>1e-3; 
         id parameter;
         var Estimate;
         run;

Notice that the LASSO method in PROC NLMIXED found all of the truly nonzero parameters in the final model. The only true nonzero effect included is x1 probably because it is involved in the nonzero x1*x2 interaction.

Parameter Estimate
bx12 -9.3293
bc12 5.1899
bc15 4.4225
int 4.2060
bx1 -3.2336
bx2 0.7879

Example 2: LASSO in a logistic model

A logistic model with the same linear predictor can be fit to the binary response, b, by simply using the binomial log likelihood. A smaller tuning parameter value, 0.01, is used.

      proc nlmixed data=Simdata maxiter=500;
         xb=int 
            /* main effects */
            + bx1*x1 + bx2*x2 + bx3*x3 + bx4*x4 + bx5*x5 
            + bc11*(c1=1) + bc12*(c1=2) + bc13*(c1=3) + bc14*(c1=4) + bc15*(c1=5) + bc16*(c1=6) + bc17*(c1=7)
            + bc21*(c2=1) + bc22*(c2=2)
      
            /* interaction terms among x's */
            + bx12*x1*x2 + bx13*x1*x3 + bx14*x1*x4 + bx15*x1*x5 + bx23*x2*x3 + bx24*x2*x4 + bx25*x2*x5 + bx34*x3*x4 + bx35*x3*x5 + bx45*x4*x5 
      
            /* x1*c1 interaction terms */
            + bc11x1*(c1=1)*x1 + bc12x1*(c1=2)*x1 + bc13x1*(c1=3)*x1 + bc14x1*(c1=4)*x1 + bc15x1*(c1=5)*x1 + bc16x1*(c1=6)*x1 + bc17x1*(c1=7)*x1
            + bc11x2*(c1=1)*x2 + bc12x2*(c1=2)*x2 + bc13x2*(c1=3)*x2 + bc14x2*(c1=4)*x2 + bc15x2*(c1=5)*x2 + bc16x2*(c1=6)*x2 + bc17x2*(c1=7)*x2
            + bc11x3*(c1=1)*x3 + bc12x3*(c1=2)*x3 + bc13x3*(c1=3)*x3 + bc14x3*(c1=4)*x3 + bc15x3*(c1=5)*x3 + bc16x3*(c1=6)*x3 + bc17x3*(c1=7)*x3
            + bc11x4*(c1=1)*x4 + bc12x4*(c1=2)*x4 + bc13x4*(c1=3)*x4 + bc14x4*(c1=4)*x4 + bc15x4*(c1=5)*x4 + bc16x4*(c1=6)*x4 + bc17x4*(c1=7)*x4
            + bc11x5*(c1=1)*x5 + bc12x5*(c1=2)*x5 + bc13x5*(c1=3)*x5 + bc14x5*(c1=4)*x5 + bc15x5*(c1=5)*x5 + bc16x5*(c1=6)*x5 + bc17x5*(c1=7)*x5
      
            /* x1*c2 interaction terms */
            + bc21x1*(c2=1)*x1 + bc22x1*(c2=2)*x1
            + bc21x2*(c2=1)*x2 + bc22x2*(c2=2)*x2
            + bc21x3*(c2=1)*x3 + bc22x3*(c2=2)*x3
            + bc21x4*(c2=1)*x4 + bc22x4*(c2=2)*x4
            + bc21x5*(c2=1)*x5 + bc22x5*(c2=2)*x5
      
            /* c1*c2 interaction terms */
            + bc11c21*(c1=1)*(c2=1) + bc12c21*(c1=2)*(c2=1) + bc13c21*(c1=3)*(c2=1) + bc14c21*(c1=4)*(c2=1) + bc15c21*(c1=5)*(c2=1) + bc16c21*(c1=6)*(c2=1) + bc17c21*(c1=7)*(c2=1)
            + bc11c22*(c1=1)*(c2=2) + bc12c22*(c1=2)*(c2=2) + bc13c22*(c1=3)*(c2=2) + bc14c22*(c1=4)*(c2=2) + bc15c22*(c1=5)*(c2=2) + bc16c22*(c1=6)*(c2=2) + bc17c22*(c1=7)*(c2=2)
            ;
         L1Tune=0.01;
         mu=logistic(xb);
         LL=b*log(mu)+(1-b)*log(1-mu)  /* binomial log likelihood */
            - L1Tune*sumabs(of b:);    /* L1 penalty */
         model b ~ general(ll);
         ods output parameterestimates=pe;
         run;

These statements are used again to find the nonzero subset of the estimates and list them in decreasing order of their absolute values.

      data pe;
         set pe;
         absEst=abs(Estimate);
         run;
      proc sort;
         by descending absEst;
         run;
      proc print data=pe label; 
         where absest>1e-3; 
         id parameter;
         var Estimate;
         run;

The LASSO method produces a final model with all of the truly nonzero parameters except for x2. PROC HPGENSELECT can also be used to apply the LASSO as illustrated in the next example.

Parameter Estimate
bx12 -4.5064
int 2.7440
bx1 -2.2653
bc12 2.2284
bc15 1.9215

Example 3: Collinearity in a logistic model

This note discusses the issue of collinearity in generalized linear models. In the example there, it is shown that the Temp predictor has very small range resulting in it being collinear with the intercept in the logistic model. This instability causes some of the model parameter estimates to be inflated. These statements fit the logistic model using the original three predictors.

      proc logistic data=remiss;
        model Remiss(event="1") = Li Temp Cell;
        run;

Notice the severely inflated Intercept and Temp parameter estimates and standard errors.

Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept 1 67.6339 56.8875 1.4135 0.2345
LI 1 3.8671 1.7783 4.7290 0.0297
TEMP 1 -82.0737 61.7124 1.7687 0.1835
CELL 1 9.6521 7.7511 1.5507 0.2130

The solution, discussed in the note, is to standardize the predictors. The following statements refit the model on the standardized predictors.

      proc standard data=remiss s=1 out=std; 
        var Li Temp Cell; 
        run;
      proc logistic data=std;
        model Remiss(event="1") = Li Temp Cell;
        run;

The standardization has removed the instability and the parameters and their standard errors now appear to be well-estimated. Notice that only the Li predictor appears to be significantly associated with the response.

Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept 1 -3.9917 2.2477 3.1540 0.0757
LI 1 1.8090 0.8319 4.7290 0.0297
TEMP 1 -1.2197 0.9171 1.7687 0.1835
CELL 1 1.8015 1.4467 1.5507 0.2130

The LASSO can reduce the inflation of the parameters and simultaneously find the important predictors of the response. These statements fit the logistic model and add an L1 penalty to the binomial log likelihood function.

      proc nlmixed data=remiss;
         xb=int + bLi*Li + bCell*Cell + bTemp*Temp;
         p=logistic(xb);
         L1Tune=0.02;
         LL=(remiss=1)*log(p)+(remiss=0)*log(1-p)
            - L1Tune*sumabs(of b:);
         model remiss ~ general(ll);
         ods output parameterestimates=pe;
         run;
      data pe;
         set pe;
         absEst=abs(Estimate);
         run;
      proc sort;
         by descending absEst;
         run;
      proc print data=pe label; 
         where absest>1e-3; 
         id parameter;
         var Estimate;
         run;

The results show that the Cell and Temp parameters shrank to zero, leaving Li as the only important predictor. This agrees with the results from the model fit to the standardized data.

Parameter Estimate
int -3.0533
bLi 2.2313

The LASSO can also be applied to the logistic model using PROC HPGENSELECT. This is done with the METHOD=LASSO option in the SELECTION statement. In the statements below, the AICC criterion is used to choose among models and to stop the LASSO process. Other criteria are available. See the HPGENSELECT documentation for details.

      proc hpgenselect data=remiss;
         model Remiss(event="1") = Li Temp Cell / dist=binary;
         selection method=lasso(choose=aicc stop=aicc);
         run;

The results are similar to those from NLMIXED above with only the Li predictor selected in the final model.

Parameter Estimates
Parameter DF Estimate
Intercept 1 -3.124353
LI 1 2.301934

A similar result can be obtained using the elastic net by replacing the LL= statement in the above code with these statements. An L2 penalty, which is placed on the sum of the squared parameters via the USS function, and its tuning parameter multiplier are added to the log likelihood.

         L2Tune=0.005;
         LL=(remiss=1)*log(p)+(remiss=0)*log(1-p)
            - L1Tune*sumabs(of b:)
            - L2Tune*uss(of b:);

Notice that some additional shrinkage is provided by the ridging penalty.

Parameter Estimate
int -2.5345
bLi 1.7529

Example 4: Ridging a subset of parameters in a Poisson model

As discussed in this note, collinearity among predictors when the response count is greater than zero causes parameter inflation similar to separation in a logistic model. These statements construct a data set in which the response variable is a count for which Poisson regression is appropriate. The true model is μ = exp(β1x1 + β3x3), where β13=1. Another variable, x2, is the same as x1 when the response, y, is greater than zero.

      data p;
         call streaminit(57356);
         do i=1 to 100;
            x1=rand("uniform");
            x3=rand("uniform");
            mu=exp(x1+x3);
            y=rand("poisson",mu);
            x2=x1+(y=0)*rand("uniform");
            output;
         end;
         run;

These statements attempt to fit a Poisson model to the data but result in warnings that convergence of the maximum likelihood estimation is questionable.

      proc genmod;
         model y=x1 x2 x3 / dist=poisson itprint;
         run;

The ITPRINT option shows that some of the model parameters are highly unstable. The x1 and x2 parameters (labeled Prm2 and Prm3) continue to increase in absolute value with each iteration while the intercept and x3 parameters (labeled Prm1 and Prm4) stabilize relatively quickly.

Iteration History For Parameter Estimates
Iter Ridge LogLikelihood Prm1 Prm2 Prm3 Prm4
0 0 59.7861914 0.3686912 2.8349615 -1.988579 0.9084452
1 0 66.2789287 0.2046498 4.1237811 -3.258122 0.9528515
2 0 68.0371605 0.2015666 5.8759176 -5.019757 0.9553002
3 0 69.1158066 0.2126122 8.3710573 -7.528428 0.9530929
4 0 69.9580222 0.2214939 12.56063 -11.73143 0.9546535
5 0 70.581316 0.2254217 19.388698 -18.56926 0.9601877
6 0 70.9155092 0.2254955 28.224928 -27.40898 0.9644019
7 0 71.0605919 0.2253295 38.106453 -37.29131 0.9656718
8 0 71.1175969 0.2252932 48.434891 -47.61989 0.9659039
9 0 71.1390573 0.225288 58.924791 -58.10981 0.9659361
10 0 71.1470113 0.2252873 69.467573 -68.6526 0.96594
11 0 71.1499442 0.2252873 80.026963 -79.21199 0.9659405
12 0 71.1510239 0.2252872 90.59149 -89.77651 0.9659405
13 0 71.1514212 0.2252872 101.1576 -100.3426 0.9659405
14 0 71.1515674 0.2252872 111.72419 -110.9092 0.9659405
15 0 71.1516212 0.2252872 122.29093 -121.476 0.9659405
16 0 71.151641 0.2252872 132.85772 -132.0427 0.9659405
17 0 71.1516482 0.2252872 143.42452 -142.6095 0.9659405
18 0 71.1516509 0.2252872 153.99134 -153.1764 0.9659405
19 0 71.1516519 0.2252872 164.55813 -163.7432 0.9659405
20 0 71.1516523 0.2252872 175.12493 -174.31 0.9659405
21 0 71.1516524 0.2252872 185.6917 -184.8767 0.9659405
22 0 71.1516524 0.2252872 196.2585 -195.4435 0.9659405
23 0 71.1516525 0.2252872 206.824 -206.009 0.9659405
24 0.06 71.1516525 0.2252872 206.824 -206.009 0.9659405
 
Analysis Of Maximum Likelihood Parameter Estimates
Parameter DF Estimate Standard
Error
Wald 95% Confidence Limits Wald Chi-Square Pr > ChiSq
Intercept 1 0.2253 0.1783 -0.1241 0.5747 1.60 0.2063
x1 1 206.8240 0.2003 206.4314 207.2166 1065868 <.0001
x2 0 -206.009 0.0000 -206.009 -206.009 . .
x3 1 0.9659 0.2077 0.5589 1.3730 21.63 <.0001
Scale 0 1.0000 0.0000 1.0000 1.0000    

The following statements illustrate the use of ridging on a subset of the parameters by introducing an L2 penalty in the Poisson log likelihood. Only the x1 and x2 parameters are penalized by using a penalty that is the sum of their squared values. With this added penalty, the model converges properly.

      proc nlmixed;
         L2Tune=0.04;
         mu=exp(int + b1*x1 + b2*x2 + b3*x3);
         LL=y*log(mu)-mu                /* kernel of Poisson log likelihood */
            - L2Tune*(b1**2 + b2**2);   /* L2 penalty */
         model y~general(ll);
         run;

Notice that ridging has drastically reduced the x1 and x2 parameters (labeled b1 and b2). The final parameters are fairly close to the true model values. However, standard errors and tests are not reliable in such a penalized model.

Parameter Estimates
Parameter Estimate Standard
Error
DF t Value Pr > |t| 95% Confidence Limits Gradient
int 0.1049 0.1740 100 0.60 0.5482 -0.2405 0.4502 1.552E-7
b1 0.9979 0.2324 100 4.29 <.0001 0.5368 1.4589 7.959E-9
b2 -0.2942 0.2235 100 -1.32 0.1911 -0.7375 0.1492 1.073E-7
b3 1.1631 0.2094 100 5.55 <.0001 0.7475 1.5786 3.647E-8

References

Agresti, A. (2013). Categorical Data Analysis, Third Edition. New York: John Wiley & Sons Inc.

Gunes, F. (2015). "Penalized Regression Methods for Linear Models in SAS/STAT®." Cary, NC: SAS Institute Inc.



Operating System and Release Information

Product FamilyProductSystemSAS Release
ReportedFixed*
SAS SystemSAS/STATz/OS
z/OS 64-bit
OpenVMS VAX
Microsoft® Windows® for 64-Bit Itanium-based Systems
Microsoft Windows Server 2003 Datacenter 64-bit Edition
Microsoft Windows Server 2003 Enterprise 64-bit Edition
Microsoft Windows XP 64-bit Edition
Microsoft® Windows® for x64
OS/2
Microsoft Windows 8 Enterprise 32-bit
Microsoft Windows 8 Enterprise x64
Microsoft Windows 8 Pro 32-bit
Microsoft Windows 8 Pro x64
Microsoft Windows 8.1 Enterprise 32-bit
Microsoft Windows 8.1 Enterprise x64
Microsoft Windows 8.1 Pro 32-bit
Microsoft Windows 8.1 Pro x64
Microsoft Windows 10
Microsoft Windows 95/98
Microsoft Windows 2000 Advanced Server
Microsoft Windows 2000 Datacenter Server
Microsoft Windows 2000 Server
Microsoft Windows 2000 Professional
Microsoft Windows NT Workstation
Microsoft Windows Server 2003 Datacenter Edition
Microsoft Windows Server 2003 Enterprise Edition
Microsoft Windows Server 2003 Standard Edition
Microsoft Windows Server 2003 for x64
Microsoft Windows Server 2008
Microsoft Windows Server 2008 R2
Microsoft Windows Server 2008 for x64
Microsoft Windows Server 2012 Datacenter
Microsoft Windows Server 2012 R2 Datacenter
Microsoft Windows Server 2012 R2 Std
Microsoft Windows Server 2012 Std
Microsoft Windows XP Professional
Windows 7 Enterprise 32 bit
Windows 7 Enterprise x64
Windows 7 Home Premium 32 bit
Windows 7 Home Premium x64
Windows 7 Professional 32 bit
Windows 7 Professional x64
Windows 7 Ultimate 32 bit
Windows 7 Ultimate x64
Windows Millennium Edition (Me)
Windows Vista
Windows Vista for x64
64-bit Enabled AIX
64-bit Enabled HP-UX
64-bit Enabled Solaris
ABI+ for Intel Architecture
AIX
HP-UX
HP-UX IPF
IRIX
Linux
Linux for x64
Linux on Itanium
OpenVMS Alpha
OpenVMS on HP Integrity
Solaris
Solaris for x64
Tru64 UNIX
* For software releases that are not yet generally available, the Fixed Release is the software release in which the problem is planned to be fixed.