SUPPORT / SAMPLES & SAS NOTES
 

Support

Usage Note 67880: Generalized R-square for GEE and other models

DetailsAboutRate It

Zheng (2000) proposed a marginal R2 statistic, R 2 marg , that is applicable to Generalized Estimating Equations (GEE) models. It is a generalization of the R-square statistic as used in simple, ordinary least squares (OLS) regression models and in fact is the ordinary R2 in those models. Ballinger (2004) also discusses this statistic for assessing and comparing the fit of GEE models. In a non-GEE model for a binary response model, this statistic is available with the GOF option in the MODEL statement in PROC LOGISTIC and is labeled as Efron's R-square in the Model Fit Statistics table. See "Model Fitting Information" in the Details section of the LOGISTIC procedure documentation.

Note that for generalized linear models that do not involve clustering (repeated, correlated measures) and for some other models, an R2 statistic based on the variance function of the response distribution is available. See the description of the RsquareV macro. Zheng's generalized R-square statistic can also be used for generalized linear and other models as illustrated in this note.

R 2 marg is computed as

1 - SS(residuals)/SS(total) ,

where SS(residuals) is the sum of squares of the raw residuals, yit - ^ yit, i is the cluster (or subject) index, and t is the time (or repeated measure) index. SS(total) is the sum of squared deviations from the overall mean.

Zheng says that R 2 marg "shares the same interpretation and intuitive appeal as R2 and reduces to R2 for T = 1. It equals 1, its upper bound, when there is perfect prediction. It equals 0 when there is no association between the response and the predictors. It assumes a negative value when the variation is greater under the model of interest than under the null model, indicating poor prediction." T is the number of time points (repeated measures). She notes that as the sample size increases, R 2 marg approaches a value bounded by (0,1).

Concerning the fact that the GEE covariance matrix is not involved in the computation of R 2 marg , Zheng says, "In our opinion, goodness of fit is concerned with the agreement between the response and the prediction. The covariance matrix is only relevant to the point that it affects the fitted value through the parameter estimates, but is not of interest by itself. A measure that incorporates the covariance matrix may assume a low value because of poor fit and/or inappropriate correlation structure, and therefore confounds goodness of fit with correlation modelling. Such a measure does not serve the unique purpose of summarizing the goodness of fit and therefore is not considered."

Example

Consider the respiratory data analyzed in Stokes et al. (2012). As described there, eligible patients in each of two centers were randomly assigned to active treatment or placebo. During treatment, respiratory status was determined at baseline and four visits and recorded on a five-point scale of 0 for terrible to 4 for excellent. Potential explanatory variables in addition to treatment were center, sex, and baseline respiratory status, as well as age (in years) at the time of study entry. The baseline and follow-up responses are actually measured on a five-point scale, from terrible to excellent. For this example, a dichotomized outcome is analyzed indicating whether the patient experienced a good or excellent response (dichot=1) or a worse response (dichot=0). Note that in order to compute R 2 marg the response must be numeric.

The following statements fit a logistic GEE model with the dichotomized baseline response, treatment, age, center, sex, and visit as predictors of the dichotomized response at the later visits.

     proc genmod data=resp2;
        class id center trt sex visit;
        model dichot(event="1") = di_base trt age center sex visit / link=logit dist=bin;
        repeated subject=id*center / type=un;
        output out=out resraw=res pred=pred;
        run;

These statements compute R 2 marg . The USS and CSS functions compute the uncorrected and corrected sums of squares of their argument variables, respectively.

     proc sql; 
        select 1-(uss(res)/css(dichot)) as R2marg from out;
        quit;

The resulting value of R 2 marg is 0.25.

The above computation of R 2 marg uses the simple, overall average of all responses to compute SS(total). However, R2 can be viewed as a comparison of the fitted model to a reference model. The reference model is generally thought of as the model containing only an intercept (the null model). For OLS models, the overall average response is the estimate of the mean under the null model. This is not necessarily the case in GEE models, though this might be the case when the independence correlation structure is used. The null GEE model is easily fit and its mean estimate used in the computation.

These statements illustrate that for the above model. The same model as above is fit first in GENMOD. The output data set, OUT, from this model is then used as input to the second GENMOD step, which fits the null GEE model. Note that it has no predictors and therefore contains only an intercept. The output data set from this step contains both the raw residuals, RES, from the model of interest and the predicted mean, NULLMEAN, from the null GEE model. Note that the value of NULLMEAN is the same for all observations. The function, uss(dichot-nullmean), computes SS(total) using the null mean rather than the simple average as above.

     proc genmod data=resp2;
        class id center trt sex visit;
        model dichot(event="1") = di_base trt age center sex visit / link=logit dist=bin;
        repeated subject=id*center / type=un;
        output out=out resraw=res;
        run;
     proc genmod data=out;
        class id center;
        model dichot(event="1") =  / link=logit dist=bin;
        repeated subject=id*center / type=un;
        output out=out2 pred=nullmean;
        run;
     proc sql; 
        select 1-(uss(res)/uss(dichot-nullmean)) as R2marg from out2;
        quit;

The resulting value of R 2 marg is still 0.25 since the difference between the overall average response, 0.55855, and the mean estimate from the null GEE model, 0.56174, is quite small. If the independence correlation structure, TYPE=IND, were used rather than the unstructured correlation matrix, then the mean estimate from the null GEE would be the same as the overall average response and the two values of R 2 marg would be identical.

Partial R2

A partial R2 can be obtained using the above concept of a reference model, but one that is a nested simplification of the model of interest rather than the null model. R 2 marg computed with a sub-model as the reference model is equivalent to the partial R2 for an OLS model as can be obtained with the PCORR2 option in PROC REG. The partial R2 is the proportion of total variation left over from the reference model that is accounted for by the full model.

The following statements assess the effect of adding age, sex, and visit to the model. The model containing only the dichotomized baseline response, treatment, and center serves as the reference model.

     proc genmod data=resp2;
        class id center trt sex visit;
        model dichot(event="1") = di_base trt age center sex visit / link=logit dist=bin;
        repeated subject=id*center / type=un;
        output out=out resraw=res;
        run;
     proc genmod data=out;
        class id center trt;
        model dichot(event="1") = di_base trt center / link=logit dist=bin;
        repeated subject=id*center / type=un;
        output out=out2 pred=nullmean;
        run;
     proc sql; 
        select 1-(uss(res)/uss(dichot-nullmean)) as R2marg from out2;
        quit;

The resulting partial R 2 marg is 0.017. This small value suggests that the addition of age, sex, and center explains only a small proportion of the variability that is left unexplained by the reference model.

References

Ballinger, G. A. 2004. “Using Generalized Estimating Equations for Longitudinal Data Analysis.” Organizational Research Methods. 7(2): 127–150.

Stokes, M. E., C. S. Davis, and G. G. Koch. 2012. Categorical Data Analysis Using SAS. 3rd ed. Cary, NC: SAS Institute Inc.

Zheng, B. 2000. “Summarizing the Goodness of Fit of Generalized Linear Models for Longitudinal Data.” Statistics in Medicine. 19(10): 1265–1275.



Operating System and Release Information

Product FamilyProductSystemSAS Release
ReportedFixed*
SAS SystemSAS/STATz/OS
z/OS 64-bit
OpenVMS VAX
Microsoft® Windows® for 64-Bit Itanium-based Systems
Microsoft Windows Server 2003 Datacenter 64-bit Edition
Microsoft Windows Server 2003 Enterprise 64-bit Edition
Microsoft Windows XP 64-bit Edition
Microsoft® Windows® for x64
OS/2
Microsoft Windows 8 Enterprise 32-bit
Microsoft Windows 8 Enterprise x64
Microsoft Windows 8 Pro 32-bit
Microsoft Windows 8 Pro x64
Microsoft Windows 8.1 Enterprise 32-bit
Microsoft Windows 8.1 Enterprise x64
Microsoft Windows 8.1 Pro 32-bit
Microsoft Windows 8.1 Pro x64
Microsoft Windows 10
Microsoft Windows 95/98
Microsoft Windows 2000 Advanced Server
Microsoft Windows 2000 Datacenter Server
Microsoft Windows 2000 Server
Microsoft Windows 2000 Professional
Microsoft Windows NT Workstation
Microsoft Windows Server 2003 Datacenter Edition
Microsoft Windows Server 2003 Enterprise Edition
Microsoft Windows Server 2003 Standard Edition
Microsoft Windows Server 2003 for x64
Microsoft Windows Server 2008
Microsoft Windows Server 2008 R2
Microsoft Windows Server 2008 for x64
Microsoft Windows Server 2012 Datacenter
Microsoft Windows Server 2012 R2 Datacenter
Microsoft Windows Server 2012 R2 Std
Microsoft Windows Server 2012 Std
Microsoft Windows Server 2016
Microsoft Windows Server 2019
Microsoft Windows XP Professional
Windows 7 Enterprise 32 bit
Windows 7 Enterprise x64
Windows 7 Home Premium 32 bit
Windows 7 Home Premium x64
Windows 7 Professional 32 bit
Windows 7 Professional x64
Windows 7 Ultimate 32 bit
Windows 7 Ultimate x64
Windows Millennium Edition (Me)
Windows Vista
Windows Vista for x64
64-bit Enabled AIX
64-bit Enabled HP-UX
64-bit Enabled Solaris
ABI+ for Intel Architecture
AIX
HP-UX
HP-UX IPF
IRIX
Linux
Linux for x64
Linux on Itanium
OpenVMS Alpha
OpenVMS on HP Integrity
Solaris
Solaris for x64
Tru64 UNIX
* For software releases that are not yet generally available, the Fixed Release is the software release in which the problem is planned to be fixed.