When modeling response data consisting of proportions (or percentages), the observed values can be continuous or represent a summarized (or aggregated) binary response. For example, an observed proportion of 0.3 might represent 3 out of 10 subjects responding positively at a particular dose of a drug. At the subject level, the response is binary (positive or negative). If your data are aggregated binary data and you have the numerator and denominator counts making up the proportions, then you can fit a logistic model in procedures such as LOGISTIC, PROBIT, GENMOD, GAM, ADAPTIVEREG and others by using the events/trials syntax in the MODEL statement. These models assume the proportions represent a set of independent Bernoulli trials and have a binomial distribution.
However, if the values are proportions of area covered or affected by some agent, or are proportions of a mixture, then they do not represent a set of trials and might not have a binomial distribution. Modeling approaches for such data include assuming the proportions are from normal or beta distributions as illustrated in this note. An approach that does not require specification of a distribution for the proportions is the fractional logistic model that uses a quasilikelihood function for estimation. Also commonly used are 4 and 5parameter logistic models that have a particular nonlinear form and directly estimate the ED50, the value that produces half of the predictor's effect. These models can be fit assuming a particular distribution or can also use the quasilikelihood approach. They also allow for the response proportion to be restricted to a portion of the [0,1] range. For example, there might be a natural, nonzero response rate that occurs without exposure to the agent. This allows for a lower asymptote that is greater than zero. Similarly, it might not be possible for the maximum response proportion to reach one, requiring an upper asymptote less than one.
The data below result from an experiment on the effects of a caustic chemical. The measured response is the proportion of tissue damage that occurs at a particular site. Since damage could occur due to other causes, a nonzero natural response is expected. Also, at high enough concentrations, the chemical is likely to cause substantial, but not complete, damage. So, the maximum response is expected to be less than one.
The following statements create the data set with tissue damage proportions, Y, at various concentrations of the chemical, CONC. Note that the concentrations are approximately equally spaced on the common log scale.
data damage; input conc y; lconc=log10(conc); datalines; 0.1 .08 0.25 .09 0.5 .11 1.0 .20 2 .30 4 .53 5.0 .63 8.0 .71 10.0 .73 25.0 .84 50.0 .85 100.0 .86 ;
Fractional logistic model
The distribution of these proportions is not known, but if you assume that their variance behaves proportional to the variance of binomial data, then estimation can be done using a quasilikelihood function. The quasilikelihood behaves much like a log likelihood function when estimating a generalized linear or nonlinear model. For continuous proportion responses, a quasilikelihood resembling the binomial log likelihood is:
1 φ log[μ^{y}(1μ)^{1y}] ,
where the scale parameter, φ, is included allowing the variance to be proportional to the binomial:
Var(y) = φμ(1μ)
The scale parameter can be estimated using the Pearson chisquare statistic.
Estimation of model parameters using this quasilikelihood can be done using the GLIMMIX and NLMIXED procedures. The fractional logistic model is a linear logistic model and is most easily fit in PROC GLIMMIX. Since proportions are bounded between 0 and 1, it is natural to use the logit link function. To estimate the scale parameter, the random _residual_; statement is specified. The SOLUTION option displays a table of parameter estimates. The DDFM=NONE and CHISQ options are used to produce largesample Wald tests for the parameters and fixed effects. Predicted proportions and confidence limits are saved to a data set by the OUTPUT statement.
proc glimmix data=damage; model y = lconc / dist=binomial link=logit solution ddfm=none chisq; random _residual_; output out=fracout pred(ilink)=pred lcl(ilink)=lower ucl(ilink)=upper; run;
Following are the estimated parameters of the model and Type 3 tests.

The same model can be fit in PROC LOGISTIC by creating two new variables and two observations for each observation in the original data. The analysis is then done by weighted logistic regression as described by Liu & Xin (2014). In one observation, a new binary response variable is assigned the event level and a new weight variable is set equal to the observed proportion. In the second observation, the new binary response is assigned the nonevent level and the weight variable is set equal to one minus the observed proportion. The following DATA step makes these modifications to the tissue damage data. The Y2 variable is the new binary response variable and WT is the new weight variable. Also added is the OBS variable that contains the number of the original observation that has the same value in each pair of new observations.
data damage2; set damage; obs=_n_; y2=1; wt=y; output; y2=0; wt=1y; output; run;
Using these new data, the following PROC LOGISTIC step produces the same results as the PROC GLIMMIX step above using weighted logistic regression. Note that the SCALE=P and AGGREGATE=(OBS) options are specified in order to apply a scale parameter estimated by the Pearson statistic as done in the GLIMMIX analysis. The OUTPUT statement provides the same predicted values and confidence limits as from the GLIMMIX analysis. Note that only one set of the new observations is needed and is saved by the OUTPUT statement.
proc logistic data=damage2; model y2(event="1") = lconc / scale=p aggregate=(obs); weight wt; output out=fracout(where=(y2=1)) p=pred u=upper l=lower; run;
These statements plot the fitted fractional logistic model and 95% confidence bands for the mean response.
proc sgplot data=fracout noautolegend; band upper=upper lower=lower x=conc; scatter y=y x=conc; series y=pred x=conc; xaxis type=log label="Concentration"; yaxis label="Proportion" values=(0 to 1 by .2); title "Fractional Logit Model"; run;
Another example of the fractional logistic model is Quasilikelihood Estimation for Proportions with Unknown Distribution in the Examples section of the GLIMMIX documentation.
4 and 5parameter logistic models
The 4parameter logistic model is also known as the E_{max} model. It is a nonlinear model of the form
y = β_{U} + β_{L}β_{U} 1+(conc/β_{50})^{βS} ,
where β_{L} and β_{U} are the lower and upper asymptotes, β_{50} is the ED50 (the x value producing half of the effect from lower to upper asymptote — not necessarily a 50% response unless the asymptotes are 0 and 1), and β_{S} is the slope.
The 4parameter model is symmetric in its tails. A model that allows for asymmetry is the 5parameter model of the form
y = β_{U} + β_{L}β_{U} [1+(conc/β_{50})^{βS}]^{βA} ,
where β_{A} is the asymmetry parameter.
Since these models are nonlinear in their parameters, PROC NLMIXED is used. Model estimation is done by specifying the quasilikelihood function and using it in the GENERAL distribution option in the MODEL statement. The PARMS statement below provides initial estimates for the parameters. For these data it is sufficient to provide starting estimates for the upper and lower asymptotes. The other parameters are started at 1 by default. For some data and models you might need to specify good starting values for all model parameters. The 4parameter model for the mean, MU, is specified in the next line. The quasilikelihood function, QL, is defined in the next two lines and specified in the MODEL statement. The PREDICT statement produces predicted mean values, ^ μ_{i} , for the analyzed data. The ODS OUTPUT statement saves the parameter estimates of the model in a data set.
proc nlmixed data=damage; parms bu=.9 bl=.05; mu = bu + (blbu)/(1+(conc/b50)**bs); l = mu**y * (1mu)**(1y); ql = log(l); model y ~ general(ql); predict mu out=out1; ods output parameterestimates=pe; run;
Following are estimated parameters of the initial 4parameter logistic model.

However, the above analysis does not estimate a scale parameter. The following statements estimate φ by using the predicted values from the initial model and the observed values to compute the Pearson chisquare statistic divided by its degrees of freedom. The degrees of freedom are np where p is the number of model parameters. The first SQL step below saves the number of model parameters in macro variable NPARM. The second SQL step computes the Pearson statistic:
φ = 1 np Σ_{i} (y_{i}  ^ μ_{i} )^{2} ^ μ_{i} (1  ^ μ_{i} )
The estimated value of φ is displayed and saved in macro variable PHI.
proc sql noprint; select count(Parameter) into :nparm from pe; quit; title "Scale parameter"; title2 "Pearson Chisquare / DF"; proc sql; select sum(phi_i) format=best12. as Scale into :phi from (select (ypred)**2/(pred*(1pred)*(count(y)&nparm)) as phi_i from out1); quit; title;

To use the estimated scale parameter, the model is refit and 1/φ is included as a multiplier in the quasilikelihood function. Predicted mean values and 95% confidence limits are again saved using the PREDICT statement.
proc nlmixed data=damage; parms bu=.9 bl=.05; mu = bu + (blbu)/(1+(conc/b50)**bs); l = mu**y * (1mu)**(1y); ql = (1/&phi)*log(l); model y ~ general(ql); predict mu out=FPLout; run;
Following is the table of parameter estimates and standard errors for the final 4parameter logistic model.

These statements plot the fitted 4parameter logit model and 95% confidence bands for the mean response.
proc sgplot data=fplout noautolegend; band upper=upper lower=lower x=conc; scatter y=y x=conc; series y=pred x=conc; xaxis type=log label="Concentration"; yaxis label="Proportion" values=(0 to 1 by .2); title "4parameter Logit Model"; run;
Comparison of models
The following statements plot the fractional and 4parameter models fit in the GLIMMIX and NLMIXED steps above.
data twomodels; set fplout (in=fpl) fracout (in=frac); length Model $10; if fpl then Model="4PLogit"; else if frac then Model="Fractional"; run; proc sgplot data=twomodels; scatter y=y x=conc; series y=pred x=conc / group=Model; xaxis type=log label="Concentration"; yaxis label="Proportion"; title "4Parameter and Fractional Models"; run;
Notice that the 4parameter model fits much better in the tails due to its ability to use asymptotes less than 1 and greater than 0.
Using the predicted values saved from fitting the normal, OLS model and beta model discussed in this note, the following statements can be used to produce a plot showing the fits of all four models. Note that the beta and fractional logistic models are nearly identical in their fit to these data.
data fourmodels; set fplout (in=fpl) gmxout (in=beta rename=(gpredy=pred)) fracout (in=frac) nlinout (in=norm rename=(npredy=pred)); length Model $10; if norm then Model="Normal"; else if fpl then Model="4PLogit"; else if beta then Model="Beta"; else if frac then Model="Fractional"; run; proc sgplot data=fourmodels; scatter y=y x=conc; series y=pred x=conc / group=Model; xaxis type=log label="Concentration"; yaxis label="Proportion"; title "Four Models for Continuous Proportions"; run;
The following panel of plots shows the fits of the four models along with 95% confidence bands for the mean values. Also shown in each plot is the estimated concentration producing a 50% response (y=0.5) under each model. The normal, beta, and fractional models are all linear in the parameters, and the response function is the logit. They all have the form:
log( y 1y) = β_{0} + β_{1}log_{10}(conc)
For y=0.5, the logit is log(1)=0. Solving for concentration, the estimate of the ED50 is 10^{β0/β1}. A confidence interval for the ED50 could be obtained using the methods discussed in this note. For the 4parameter model, the ED50 parameter that is directly estimated by PROC NLMIXED is the estimated concentration that yields 50% of the effect from the lower to upper asymptote. For these data, that means that the ED50 estimate (3.25) yields a (0.8625+0.0769)/2 = 0.47 tissue damage proportion. Using the form of the 4parameter model, the estimator for the concentration yielding a y=0.5 proportion is
β_{50}( β_{L}β_{U} 0.5β_{U} 1)^{1/βS}
From the fitted model, the estimated concentration yielding 50% tissue damage is 3.58.
__________
References
McCullagh, P. and Nelder, J.A. (1989), Generalized Linear Models 2d ed., London: Chapman & Hall/CRC.
Papke, L.E. and Wooldridge, J.M. (1996), "Econometric methods for fractional response variables with an application to 401(k) plan participation rates," J. Applied Econometrics, 11, 619632.
Liu, W. and Xin, J. (2014), "Modeling fractional outcomes with SAS," Proceedings of the SAS Global Forum 2014 Conference, Cary, NC: SAS Institute Inc.
Product Family  Product  System  SAS Release  
Reported  Fixed*  
SAS System  SAS/STAT  z/OS  
z/OS 64bit  
OpenVMS VAX  
Microsoft® Windows® for 64Bit Itaniumbased Systems  
Microsoft Windows Server 2003 Datacenter 64bit Edition  
Microsoft Windows Server 2003 Enterprise 64bit Edition  
Microsoft Windows XP 64bit Edition  
Microsoft® Windows® for x64  
OS/2  
Microsoft Windows 8 Enterprise 32bit  
Microsoft Windows 8 Enterprise x64  
Microsoft Windows 8 Pro 32bit  
Microsoft Windows 8 Pro x64  
Microsoft Windows 8.1 Enterprise 32bit  
Microsoft Windows 8.1 Enterprise x64  
Microsoft Windows 8.1 Pro  
Microsoft Windows 8.1 Pro 32bit  
Microsoft Windows 95/98  
Microsoft Windows 2000 Advanced Server  
Microsoft Windows 2000 Datacenter Server  
Microsoft Windows 2000 Server  
Microsoft Windows 2000 Professional  
Microsoft Windows NT Workstation  
Microsoft Windows Server 2003 Datacenter Edition  
Microsoft Windows Server 2003 Enterprise Edition  
Microsoft Windows Server 2003 Standard Edition  
Microsoft Windows Server 2003 for x64  
Microsoft Windows Server 2008  
Microsoft Windows Server 2008 R2  
Microsoft Windows Server 2008 for x64  
Microsoft Windows Server 2012 Datacenter  
Microsoft Windows Server 2012 R2 Datacenter  
Microsoft Windows Server 2012 R2 Std  
Microsoft Windows Server 2012 Std  
Microsoft Windows XP Professional  
Windows 7 Enterprise 32 bit  
Windows 7 Enterprise x64  
Windows 7 Home Premium 32 bit  
Windows 7 Home Premium x64  
Windows 7 Professional 32 bit  
Windows 7 Professional x64  
Windows 7 Ultimate 32 bit  
Windows 7 Ultimate x64  
Windows Millennium Edition (Me)  
Windows Vista  
Windows Vista for x64  
64bit Enabled AIX  
64bit Enabled HPUX  
64bit Enabled Solaris  
ABI+ for Intel Architecture  
AIX  
HPUX  
HPUX IPF  
IRIX  
Linux  
Linux for x64  
Linux on Itanium  
OpenVMS Alpha  
OpenVMS on HP Integrity  
Solaris  
Solaris for x64  
Tru64 UNIX 
Type:  Usage Note 
Priority:  
Topic:  SAS Reference ==> Procedures ==> GLIMMIX SAS Reference ==> Procedures ==> NLMIXED Analytics ==> Regression 
Date Modified:  20210910 11:49:17 
Date Created:  20151102 15:47:12 