Regularization methods can be applied in order to shrink model parameter estimates in situations of instability. As discussed by Agresti (2013), one such situation occurs when there is a large number of covariates, of which only a small subset are strongly associated with the response, and the sample size is not large. In this case, the maximum likelihood estimates can be inflated. Parameter estimates can also be inflated by collinearity (see here and here), by separation in the case of logistic regression, among other reasons.
In order to shrink the maximum likelihood estimates of model parameters toward zero, regularization methods can be applied. These methods consist of adding a penalty term of some form to the log likelihood function used in parameter estimation. Adding an L1 norm penalty (the sum of the absolute values of the estimates) is used in the LASSO method. Adding an L2 norm penalty (the sum of the squared estimates) is known as ridging. The elastic net method uses the L1 and L2 penalties in combination.
The LASSO method is particularly useful as an efficient way to do model selection since it tends to shrink some estimates all the way to zero producing a simpler model. Ridging shrinks all estimates, but generally not to zero. The drawback of shrinkage through penalization is that the estimates become biased and their variance shrinks. As a result, tests and confidence interval are not useful. However, the overall mean square error (which is the sum of the bias and variance) may be less after penalization. Another penalization method due to Firth, has been shown to reduce bias in the estimates.
With the methods discussed here, a tuning parameter multiplies the penalty term allowing control over the amount of penalization. Increasing the value of the tuning parameter increases the amount of shrinkage in the parameters. As with other smoothing methods that make use of a tuning parameter, such as generalized additive modeling, a value for the parameter must be chosen. This is done by fitting a series of models with a range of values and selecting the value that minimizes the mean square error. Typically, separate validation data are used for this assessment, or cross validation is used if separate data are not available. Information criteria, such as AIC or SBC, can also be used. This note discusses only the use of PROC NLMIXED to fit one model with a particular tuning parameter value. Note that fitting a series of models using this approach to evaluate many tuning parameter values would incur the full cost of fitting each model by maximum likelihood estimation. The SELECTION statement in the GLMSELECT and HPGENSELECT procedures employ more efficient ways to carry out this process.
Various regression penalties are available in SAS® procedures. See the LASSO, elastic net, ridge regression, and Firth items in this note. The LASSO (and related methods) and the elastic net as provided in PROC GLMSELECT for normal response models is illustrated and discussed in more detail in Gunes (2015).
This note illustrates how these regularization methods can be implemented using PROC NLMIXED by directly specifying the log likelihood, penalty, and tuning parameter. The following statements modify the DATA step in Gunes (2015) to add a binary variable to the simulated data that it creates. The first example below uses the normally distributed variable, y, as a response. The next example uses the binary variable, b, as the response in a logistic model. Note that the true model involves only x1, x2, and parts of the c1 predictor. The true model is 2 + 5*x2 - 17*x1*x2 + 6*(c1=2) + 5*(c1=5).
data Simdata; drop i j; array x{5} x1-x5; do i=1 to 1000; do j=1 to 5; x{j} = ranuni(1); /* Continuous predictors */ end; c1 = int(1.5+ranuni(1)*7); /* Classification predictors */ c2 = 1 + mod(i,3); yTrue = 2 + 5*x2 - 17*x1*x2 + 6*(c1=2) + 5*(c1=5); y = yTrue + 2*rannor(1); p=logistic(yTrue); b=rantbl(1,1-p,p)-1; output Simdata; end; run;
The statements that follow specify the same main effects and two-way interaction model as fit by Gunes (2015) in PROC GLMSELECT, but applies the LASSO method using PROC NLMIXED. The MU= statement specifies the model effects. The LL= statement defines the function to be maximized. It consists of the necessary part of the normal likelihood and an L1 penalty multiplied by tuning parameter L1Tune with value 0.1. Note that the names of all parameters in the model (other than the intercept) begin with "b." The sumabs(of b:) function therefore is the sum of the absolute values of the model parameters. The MAXITER= option is increased beyond its default value to allow convergence to occur. The ODS OUTPUT statement saves the parameter estimates of the final model in a data set.
proc nlmixed data=Simdata maxiter=500; mu=int /* main effects */ + bx1*x1 + bx2*x2 + bx3*x3 + bx4*x4 + bx5*x5 + bc11*(c1=1) + bc12*(c1=2) + bc13*(c1=3) + bc14*(c1=4) + bc15*(c1=5) + bc16*(c1=6) + bc17*(c1=7) + bc21*(c2=1) + bc22*(c2=2) /* interaction terms among x's */ + bx12*x1*x2 + bx13*x1*x3 + bx14*x1*x4 + bx15*x1*x5 + bx23*x2*x3 + bx24*x2*x4 + bx25*x2*x5 + bx34*x3*x4 + bx35*x3*x5 + bx45*x4*x5 /* x1*c1 interaction terms */ + bc11x1*(c1=1)*x1 + bc12x1*(c1=2)*x1 + bc13x1*(c1=3)*x1 + bc14x1*(c1=4)*x1 + bc15x1*(c1=5)*x1 + bc16x1*(c1=6)*x1 + bc17x1*(c1=7)*x1 + bc11x2*(c1=1)*x2 + bc12x2*(c1=2)*x2 + bc13x2*(c1=3)*x2 + bc14x2*(c1=4)*x2 + bc15x2*(c1=5)*x2 + bc16x2*(c1=6)*x2 + bc17x2*(c1=7)*x2 + bc11x3*(c1=1)*x3 + bc12x3*(c1=2)*x3 + bc13x3*(c1=3)*x3 + bc14x3*(c1=4)*x3 + bc15x3*(c1=5)*x3 + bc16x3*(c1=6)*x3 + bc17x3*(c1=7)*x3 + bc11x4*(c1=1)*x4 + bc12x4*(c1=2)*x4 + bc13x4*(c1=3)*x4 + bc14x4*(c1=4)*x4 + bc15x4*(c1=5)*x4 + bc16x4*(c1=6)*x4 + bc17x4*(c1=7)*x4 + bc11x5*(c1=1)*x5 + bc12x5*(c1=2)*x5 + bc13x5*(c1=3)*x5 + bc14x5*(c1=4)*x5 + bc15x5*(c1=5)*x5 + bc16x5*(c1=6)*x5 + bc17x5*(c1=7)*x5 /* x1*c2 interaction terms */ + bc21x1*(c2=1)*x1 + bc22x1*(c2=2)*x1 + bc21x2*(c2=1)*x2 + bc22x2*(c2=2)*x2 + bc21x3*(c2=1)*x3 + bc22x3*(c2=2)*x3 + bc21x4*(c2=1)*x4 + bc22x4*(c2=2)*x4 + bc21x5*(c2=1)*x5 + bc22x5*(c2=2)*x5 /* c1*c2 interaction terms */ + bc11c21*(c1=1)*(c2=1) + bc12c21*(c1=2)*(c2=1) + bc13c21*(c1=3)*(c2=1) + bc14c21*(c1=4)*(c2=1) + bc15c21*(c1=5)*(c2=1) + bc16c21*(c1=6)*(c2=1) + bc17c21*(c1=7)*(c2=1) + bc11c22*(c1=1)*(c2=2) + bc12c22*(c1=2)*(c2=2) + bc13c22*(c1=3)*(c2=2) + bc14c22*(c1=4)*(c2=2) + bc15c22*(c1=5)*(c2=2) + bc16c22*(c1=6)*(c2=2) + bc17c22*(c1=7)*(c2=2) ; L1Tune=0.1; LL=-(y-mu)**2 /* kernel of normal log likelihood */ - L1Tune*sumabs(of b:); /* L1 penalty */ model y ~ general(ll); ods output parameterestimates=pe; run;
Since the LASSO method shrinks many parameter estimates to zero, interest is in seeing those parameters that remain nonzero in value. The following statements find the nonzero subset of the estimates and list them in decreasing order of their absolute values.
data pe; set pe; absEst=abs(Estimate); run; proc sort; by descending absEst; run; proc print data=pe label; where absest>1e-3; id parameter; var Estimate; run;
Notice that the LASSO method in PROC NLMIXED found all of the truly nonzero parameters in the final model. The only true nonzero effect included is x1 probably because it is involved in the nonzero x1*x2 interaction.
|
A logistic model with the same linear predictor can be fit to the binary response, b, by simply using the binomial log likelihood. A smaller tuning parameter value, 0.01, is used.
proc nlmixed data=Simdata maxiter=500; xb=int /* main effects */ + bx1*x1 + bx2*x2 + bx3*x3 + bx4*x4 + bx5*x5 + bc11*(c1=1) + bc12*(c1=2) + bc13*(c1=3) + bc14*(c1=4) + bc15*(c1=5) + bc16*(c1=6) + bc17*(c1=7) + bc21*(c2=1) + bc22*(c2=2) /* interaction terms among x's */ + bx12*x1*x2 + bx13*x1*x3 + bx14*x1*x4 + bx15*x1*x5 + bx23*x2*x3 + bx24*x2*x4 + bx25*x2*x5 + bx34*x3*x4 + bx35*x3*x5 + bx45*x4*x5 /* x1*c1 interaction terms */ + bc11x1*(c1=1)*x1 + bc12x1*(c1=2)*x1 + bc13x1*(c1=3)*x1 + bc14x1*(c1=4)*x1 + bc15x1*(c1=5)*x1 + bc16x1*(c1=6)*x1 + bc17x1*(c1=7)*x1 + bc11x2*(c1=1)*x2 + bc12x2*(c1=2)*x2 + bc13x2*(c1=3)*x2 + bc14x2*(c1=4)*x2 + bc15x2*(c1=5)*x2 + bc16x2*(c1=6)*x2 + bc17x2*(c1=7)*x2 + bc11x3*(c1=1)*x3 + bc12x3*(c1=2)*x3 + bc13x3*(c1=3)*x3 + bc14x3*(c1=4)*x3 + bc15x3*(c1=5)*x3 + bc16x3*(c1=6)*x3 + bc17x3*(c1=7)*x3 + bc11x4*(c1=1)*x4 + bc12x4*(c1=2)*x4 + bc13x4*(c1=3)*x4 + bc14x4*(c1=4)*x4 + bc15x4*(c1=5)*x4 + bc16x4*(c1=6)*x4 + bc17x4*(c1=7)*x4 + bc11x5*(c1=1)*x5 + bc12x5*(c1=2)*x5 + bc13x5*(c1=3)*x5 + bc14x5*(c1=4)*x5 + bc15x5*(c1=5)*x5 + bc16x5*(c1=6)*x5 + bc17x5*(c1=7)*x5 /* x1*c2 interaction terms */ + bc21x1*(c2=1)*x1 + bc22x1*(c2=2)*x1 + bc21x2*(c2=1)*x2 + bc22x2*(c2=2)*x2 + bc21x3*(c2=1)*x3 + bc22x3*(c2=2)*x3 + bc21x4*(c2=1)*x4 + bc22x4*(c2=2)*x4 + bc21x5*(c2=1)*x5 + bc22x5*(c2=2)*x5 /* c1*c2 interaction terms */ + bc11c21*(c1=1)*(c2=1) + bc12c21*(c1=2)*(c2=1) + bc13c21*(c1=3)*(c2=1) + bc14c21*(c1=4)*(c2=1) + bc15c21*(c1=5)*(c2=1) + bc16c21*(c1=6)*(c2=1) + bc17c21*(c1=7)*(c2=1) + bc11c22*(c1=1)*(c2=2) + bc12c22*(c1=2)*(c2=2) + bc13c22*(c1=3)*(c2=2) + bc14c22*(c1=4)*(c2=2) + bc15c22*(c1=5)*(c2=2) + bc16c22*(c1=6)*(c2=2) + bc17c22*(c1=7)*(c2=2) ; L1Tune=0.01; mu=logistic(xb); LL=b*log(mu)+(1-b)*log(1-mu) /* binomial log likelihood */ - L1Tune*sumabs(of b:); /* L1 penalty */ model b ~ general(ll); ods output parameterestimates=pe; run;
These statements are used again to find the nonzero subset of the estimates and list them in decreasing order of their absolute values.
data pe; set pe; absEst=abs(Estimate); run; proc sort; by descending absEst; run; proc print data=pe label; where absest>1e-3; id parameter; var Estimate; run;
The LASSO method produces a final model with all of the truly nonzero parameters except for x2. PROC HPGENSELECT can also be used to apply the LASSO as illustrated in the next example.
|
This note discusses the issue of collinearity in generalized linear models. In the example there, it is shown that the Temp predictor has very small range resulting in it being collinear with the intercept in the logistic model. This instability causes some of the model parameter estimates to be inflated. These statements fit the logistic model using the original three predictors.
proc logistic data=remiss; model Remiss(event="1") = Li Temp Cell; run;
Notice the severely inflated Intercept and Temp parameter estimates and standard errors.
|
The solution, discussed in the note, is to standardize the predictors. The following statements refit the model on the standardized predictors.
proc standard data=remiss s=1 out=std; var Li Temp Cell; run; proc logistic data=std; model Remiss(event="1") = Li Temp Cell; run;
The standardization has removed the instability and the parameters and their standard errors now appear to be well-estimated. Notice that only the Li predictor appears to be significantly associated with the response.
|
The LASSO can reduce the inflation of the parameters and simultaneously find the important predictors of the response. These statements fit the logistic model and add an L1 penalty to the binomial log likelihood function.
proc nlmixed data=remiss; xb=int + bLi*Li + bCell*Cell + bTemp*Temp; p=logistic(xb); L1Tune=0.02; LL=(remiss=1)*log(p)+(remiss=0)*log(1-p) - L1Tune*sumabs(of b:); model remiss ~ general(ll); ods output parameterestimates=pe; run; data pe; set pe; absEst=abs(Estimate); run; proc sort; by descending absEst; run; proc print data=pe label; where absest>1e-3; id parameter; var Estimate; run;
The results show that the Cell and Temp parameters shrank to zero, leaving Li as the only important predictor. This agrees with the results from the model fit to the standardized data.
|
The LASSO can also be applied to the logistic model using PROC HPGENSELECT. This is done with the METHOD=LASSO option in the SELECTION statement. In the statements below, the AICC criterion is used to choose among models and to stop the LASSO process. Other criteria are available. See the HPGENSELECT documentation for details.
proc hpgenselect data=remiss; model Remiss(event="1") = Li Temp Cell / dist=binary; selection method=lasso(choose=aicc stop=aicc); run;
The results are similar to those from NLMIXED above with only the Li predictor selected in the final model.
|
A similar result can be obtained using the elastic net by replacing the LL= statement in the above code with these statements. An L2 penalty, which is placed on the sum of the squared parameters via the USS function, and its tuning parameter multiplier are added to the log likelihood.
L2Tune=0.005; LL=(remiss=1)*log(p)+(remiss=0)*log(1-p) - L1Tune*sumabs(of b:) - L2Tune*uss(of b:);
Notice that some additional shrinkage is provided by the ridging penalty.
|
As discussed in this note, collinearity among predictors when the response count is greater than zero causes parameter inflation similar to separation in a logistic model. These statements construct a data set in which the response variable is a count for which Poisson regression is appropriate. The true model is μ = exp(β1x1 + β3x3), where β1=β3=1. Another variable, x2, is the same as x1 when the response, y, is greater than zero.
data p; call streaminit(57356); do i=1 to 100; x1=rand("uniform"); x3=rand("uniform"); mu=exp(x1+x3); y=rand("poisson",mu); x2=x1+(y=0)*rand("uniform"); output; end; run;
These statements attempt to fit a Poisson model to the data but result in warnings that convergence of the maximum likelihood estimation is questionable.
proc genmod; model y=x1 x2 x3 / dist=poisson itprint; run;
The ITPRINT option shows that some of the model parameters are highly unstable. The x1 and x2 parameters (labeled Prm2 and Prm3) continue to increase in absolute value with each iteration while the intercept and x3 parameters (labeled Prm1 and Prm4) stabilize relatively quickly.
|
The following statements illustrate the use of ridging on a subset of the parameters by introducing an L2 penalty in the Poisson log likelihood. Only the x1 and x2 parameters are penalized by using a penalty that is the sum of their squared values. With this added penalty, the model converges properly.
proc nlmixed; L2Tune=0.04; mu=exp(int + b1*x1 + b2*x2 + b3*x3); LL=y*log(mu)-mu /* kernel of Poisson log likelihood */ - L2Tune*(b1**2 + b2**2); /* L2 penalty */ model y~general(ll); run;
Notice that ridging has drastically reduced the x1 and x2 parameters (labeled b1 and b2). The final parameters are fairly close to the true model values. However, standard errors and tests are not reliable in such a penalized model.
|
References
Agresti, A. (2013). Categorical Data Analysis, Third Edition. New York: John Wiley & Sons Inc.
Gunes, F. (2015). "Penalized Regression Methods for Linear Models in SAS/STAT®." Cary, NC: SAS Institute Inc.
Product Family | Product | System | SAS Release | |
Reported | Fixed* | |||
SAS System | SAS/STAT | z/OS | ||
z/OS 64-bit | ||||
OpenVMS VAX | ||||
Microsoft® Windows® for 64-Bit Itanium-based Systems | ||||
Microsoft Windows Server 2003 Datacenter 64-bit Edition | ||||
Microsoft Windows Server 2003 Enterprise 64-bit Edition | ||||
Microsoft Windows XP 64-bit Edition | ||||
Microsoft® Windows® for x64 | ||||
OS/2 | ||||
Microsoft Windows 8 Enterprise 32-bit | ||||
Microsoft Windows 8 Enterprise x64 | ||||
Microsoft Windows 8 Pro 32-bit | ||||
Microsoft Windows 8 Pro x64 | ||||
Microsoft Windows 8.1 Enterprise 32-bit | ||||
Microsoft Windows 8.1 Enterprise x64 | ||||
Microsoft Windows 8.1 Pro 32-bit | ||||
Microsoft Windows 8.1 Pro x64 | ||||
Microsoft Windows 10 | ||||
Microsoft Windows 95/98 | ||||
Microsoft Windows 2000 Advanced Server | ||||
Microsoft Windows 2000 Datacenter Server | ||||
Microsoft Windows 2000 Server | ||||
Microsoft Windows 2000 Professional | ||||
Microsoft Windows NT Workstation | ||||
Microsoft Windows Server 2003 Datacenter Edition | ||||
Microsoft Windows Server 2003 Enterprise Edition | ||||
Microsoft Windows Server 2003 Standard Edition | ||||
Microsoft Windows Server 2003 for x64 | ||||
Microsoft Windows Server 2008 | ||||
Microsoft Windows Server 2008 R2 | ||||
Microsoft Windows Server 2008 for x64 | ||||
Microsoft Windows Server 2012 Datacenter | ||||
Microsoft Windows Server 2012 R2 Datacenter | ||||
Microsoft Windows Server 2012 R2 Std | ||||
Microsoft Windows Server 2012 Std | ||||
Microsoft Windows XP Professional | ||||
Windows 7 Enterprise 32 bit | ||||
Windows 7 Enterprise x64 | ||||
Windows 7 Home Premium 32 bit | ||||
Windows 7 Home Premium x64 | ||||
Windows 7 Professional 32 bit | ||||
Windows 7 Professional x64 | ||||
Windows 7 Ultimate 32 bit | ||||
Windows 7 Ultimate x64 | ||||
Windows Millennium Edition (Me) | ||||
Windows Vista | ||||
Windows Vista for x64 | ||||
64-bit Enabled AIX | ||||
64-bit Enabled HP-UX | ||||
64-bit Enabled Solaris | ||||
ABI+ for Intel Architecture | ||||
AIX | ||||
HP-UX | ||||
HP-UX IPF | ||||
IRIX | ||||
Linux | ||||
Linux for x64 | ||||
Linux on Itanium | ||||
OpenVMS Alpha | ||||
OpenVMS on HP Integrity | ||||
Solaris | ||||
Solaris for x64 | ||||
Tru64 UNIX |
Type: | Usage Note |
Priority: | |
Topic: | Analytics ==> Regression SAS Reference ==> Procedures ==> HPGENSELECT SAS Reference ==> Procedures ==> NLMIXED |
Date Modified: | 2018-01-22 14:15:57 |
Date Created: | 2017-04-03 15:02:41 |