Researchers who conduct clinical trials often have to measure the concentration of a drug in a patient’s blood over a specified interval of time. The measured data represents the body’s reaction as evidenced by an increase or decrease in the concentration of a particular substance. Another example occurs when scientists gather data to measure the concentration of a toxin from a chemical spill that poses an acute risk of cancer and neurological effects as well as many other negative health impacts including heart damage, lung disease, kidney disease, reproductive problems, gastrointestinal illness, birth defects, or impaired bone growth in children. Such response measures are proportions that vary continuously over the zero to one range (or zero to 100 when percentages).
While the binomial distribution is generally applicable to a dependent (response) variable that represents proportions from aggregated binary responses, it might not apply to underlying continuous data. The binomial distribution is typically used when the proportions represent a count of events resulting from a known number of trials. Response measurements, like those above, that are proportions relative to a standard (such as proportion of control) or percentages of a total amount (such as proportion reduction in body fat, proportion of area damaged in a forest fire, proportion of chemical absorbed, or the amount of funds charged off out of a balance) might not be binomial since they do not represent a set of independent trials. When the response in a regression model is a proportion or a percentage, it can be difficult to decide on an appropriate modeling approach. While some researchers use the simplest approach and fit a linear regression model, ordinary linear regression can predict invalid values that are less than zero or greater than one. Proportion data typically exhibit a sigmoidal, or Sshaped, curve with asymptotes at the limits of zero and one when plotted against a predictor. In general, ordinary regression does not capture this relationship, but it might be reasonable if all of the data fall in the middle of the range (between about 0.3 and 0.7) where the curve is approximately linear. However, there are other approaches that can be used.
Two ways to model continuous proportions include using the beta distribution in the GLIMMIX procedure or defining a nonlinear model using the NLIN or NLMIXED procedures in SAS/STAT^{®} software. The following examples illustrate these approaches using the GLIMMIX and NLIN procedures to model continuous proportions or percentages. Two other models are the fractional logistic model, which uses a quasilikelihood approach, and the 4 or 5parameter logistic model also known as E_{max} models. These models are discussed and illustrated in this note.
The following statements create the data set with tissue damage proportions, Y, at various concentrations of the chemical, CONC. Note that the concentrations are approximately equally spaced on the common log scale.
data damage; input conc y; lconc=log10(conc); datalines; 0.1 .08 0.25 .09 0.5 .11 1.0 .20 2 .30 4 .53 5.0 .63 8.0 .71 10.0 .73 25.0 .84 50.0 .85 100.0 .86 ;
Normal model: REG and NLIN procedures
These statements fit a simple linear regression model using the REG procedure. The OUTPUT statement saves the predicted responses in data set REGOUT.
proc reg data=damage; model y = lconc; output out=regout p=rpredy; quit;
The model produces the following graphical results. Even though the fit plot does not look too bad and R^{2} is high (0.9119), the diagnostic plots indicate possible problems with the model fit. The residuals are not randomly scattered about zero, the residual quantile plot indicates possible violation of the normality assumption, and the plot of the response variable Y against the predicted values look more like an Sshaped curve than a straight line. Also, examination of the data set of predicted values shows that one of the values falls outside the [0,1] range.

Under the assumption that the observed proportions are from a normal distribution, the following statements estimate a nonlinear logistic curve using the NLIN procedure. The logistic function is 1/(1+e^{x}) and maps values of x to the interval [0,1] for any real value of x. Since it produces values between zero and one, it is a reasonable candidate for a model on proportions.
proc nlin data=damage plots=all hougaard; parms p1=1 p2=0.5; xbeta=p1 + p2*lconc; model y=1/(1 + exp(xbeta)); output out=nlinout p=npredy lclm=lower uclm=upper; run;
Following are the estimated parameters of the twoparameter logistic model. The skewness measures of 0.2889 and 0.3498 indicate that the parameter estimators exhibit skewed behavior and that their standard errors and confidence intervals should be used cautiously for inferences.

Notice that the fitted model does a better job of following the curvature of the observed proportions though the diagnostic plots still indicate problems similar to the linear regression model above.

Beta model: GLIMMIX procedure
Some researchers model continuous proportions using the beta distribution. Like proportions, the beta distribution supports a range from zero to one. The beta distribution can also accommodate right or left skewness, two conditions that the normal distribution cannot account for because of its symmetry. The GLIMMIX procedure fits a betalogistic model using maximum likelihood estimation when the DIST=BETA and LINK=LOGIT options are specified in the MODEL statement. The logit link ensures that the predicted mean stays within bounds (0,1). However, this model does not permit proportions equal to 0 or 1. Such observations are ignored by GLIMMIX when fitting the beta model.
The following statements fit the betalogistic model and save the predicted values and 95% confidence limits in data set GMXOUT.
proc glimmix data=damage; model y = lconc / dist=beta link=logit solution; output out=gmxout pred(ilink)=gpredy lcl(ilink)=lower ucl(ilink)=upper; run;
The Dimensions table indicates one covariance parameter was estimated that corresponds to the scale parameter. The scale parameter (59.4261) is displayed in the Parameter Estimates table and is inversely related to the variance of the response variable. For models without random effects, a good model fit is indicated when the Pearson chisquare divided by its degrees of freedom is close to one. For this model the statistic is 1.38 suggesting that the model fits the data well. The regression parameters of the beta regression model are interpretable as log odds ratios when the logit link is used. The significant parameter estimate for lconc (p<0.0001) suggests that as the log concentration increases by one unit, the log odds of tissue damage increases by 1.8406. Equivalently, e^{1.8406}=6.3 is the estimated odds ratio indicating that every one unit increase in the log concentration increases the odds of tissue damage by 6.3.

These statements plot the fitted model.
proc sgplot data=gmxout noautolegend; band upper=upper lower=lower x=conc; scatter y=y x=conc; series y=gpredy x=conc; xaxis type=log label="Concentration"; yaxis label="Proportion"; title "Beta fit plot"; run;
As with the logistic model fit with NLIN, the beta model does a reasonable job of following the data curvature.
The following statements create a comparative plot of the models fit using the REG, NLIN and GLIMMIX procedures.
data combo; merge regout nlinout gmxout; run; proc sgplot data=combo; scatter y=y x=conc; series y=rpredy x=conc / lineattrs=(color=green) name="Linear Regression" legendlabel="Linear Regression"; series y=npredy x=conc / lineattrs=(color=blue) name="Nonlinear logit Regression" legendlabel="Nonlinear logit Regression"; series y=gpredy x=conc / lineattrs=(color=red) name="Beta Regression" legendlabel="Beta Regression"; xaxis type=log label="Concentration"; yaxis label="Proportion"; keylegend "Linear Regression" "Nonlinear logit Regression" "Beta Regression"; title "Comparison of Models"; run;
The normallogistic model from NLIN and the betalogistic model from GLIMMIX capture some of the curvature in the data. However, these data seem to plateau at both low and high proportions. Another model, the 4parameter logistic model can model data that is limited to a portion of the [0,1] range, and is illustrated in this note.
The beta model can be used to fit more complex models to continuous proportion data. For example, the following statements fit a random effects model to simulated data collected on individuals in cities.
proc glimmix method=quad plots=studentpanel; class city; model y = x / dist=beta link=logit solution; random intercept / subject=city solution; run;
Product Family  Product  System  SAS Release  
Reported  Fixed*  
SAS System  SAS/STAT  z/OS  
z/OS 64bit  
OpenVMS VAX  
Microsoft® Windows® for 64Bit Itaniumbased Systems  
Microsoft Windows Server 2003 Datacenter 64bit Edition  
Microsoft Windows Server 2003 Enterprise 64bit Edition  
Microsoft Windows XP 64bit Edition  
Microsoft® Windows® for x64  
OS/2  
Microsoft Windows 8 Enterprise 32bit  
Microsoft Windows 8 Enterprise x64  
Microsoft Windows 8 Pro 32bit  
Microsoft Windows 8 Pro x64  
Microsoft Windows 8.1 Enterprise 32bit  
Microsoft Windows 8.1 Enterprise x64  
Microsoft Windows 8.1 Pro  
Microsoft Windows 8.1 Pro 32bit  
Microsoft Windows 10  
Microsoft Windows 95/98  
Microsoft Windows 2000 Advanced Server  
Microsoft Windows 2000 Datacenter Server  
Microsoft Windows 2000 Server  
Microsoft Windows 2000 Professional  
Microsoft Windows NT Workstation  
Microsoft Windows Server 2003 Datacenter Edition  
Microsoft Windows Server 2003 Enterprise Edition  
Microsoft Windows Server 2003 Standard Edition  
Microsoft Windows Server 2003 for x64  
Microsoft Windows Server 2008  
Microsoft Windows Server 2008 R2  
Microsoft Windows Server 2008 for x64  
Microsoft Windows Server 2012 Datacenter  
Microsoft Windows Server 2012 R2 Datacenter  
Microsoft Windows Server 2012 R2 Std  
Microsoft Windows Server 2012 Std  
Microsoft Windows XP Professional  
Windows 7 Enterprise 32 bit  
Windows 7 Enterprise x64  
Windows 7 Home Premium 32 bit  
Windows 7 Home Premium x64  
Windows 7 Professional 32 bit  
Windows 7 Professional x64  
Windows 7 Ultimate 32 bit  
Windows 7 Ultimate x64  
Windows Millennium Edition (Me)  
Windows Vista  
Windows Vista for x64  
64bit Enabled AIX  
64bit Enabled HPUX  
64bit Enabled Solaris  
ABI+ for Intel Architecture  
AIX  
HPUX  
HPUX IPF  
IRIX  
Linux  
Linux for x64  
Linux on Itanium  
OpenVMS Alpha  
OpenVMS on HP Integrity  
Solaris  
Solaris for x64  
Tru64 UNIX 
Type:  Usage Note 
Priority:  
Topic:  SAS Reference ==> Procedures ==> GLIMMIX SAS Reference ==> Procedures ==> NLIN SAS Reference ==> Procedures ==> REG Analytics ==> Regression 
Date Modified:  20160203 16:40:44 
Date Created:  20160120 15:09:19 