FOCUS AREAS


PDF

Generalized Estimating Equations

Introduction

The generalized estimating equations (GEEs) methodology, introduced by Liang and Zeger (1986), enables you to analyze correlated data that otherwise could be modeled as a generalized linear model. GEEs have become an important strategy in the analysis of correlated data. These data sets can arise from longitudinal studies, in which subjects are measured at different points in time, or from clustering, in which measurements are taken on subjects who share a common characteristic, such as belonging to the same litter. SAS/STAT software provides two procedures that enable you to perform GEE analysis: the GENMOD procedure and the GEE procedure. Both procedures implement the standard generalized estimating equation approach for longitudinal data; this approach is appropriate for complete data or when data are missing completely at random (MCAR). When the data are missing at random (MAR), the weighted GEE method, which is implemented in the GEE procedure, produces valid inference. The weighted GEE method is described by Molenberghs and Kenward (2007); Fitzmaurice, Laird, and Ware (2011); Mallinckrodt (2013); O’Kelly and Ratitch (2014).

The GENMOD Procedure

The GENMOD procedure enables you to perform GEE analysis by specifying a REPEATED statement in which you provide clustering information and a working correlation matrix. The generalized linear model estimates are used as the starting values. Both model-based and empirical standard errors of the parameter estimates are produced. Many correlation structures are available, including first-order autoregressive, exchangeable, independent, m-dependent, and unstructured. You can also input your own correlation structures. The GENMOD procedure also provides the following:

  • Type III tests for model effects

  • CONTRAST, LSMEANS, and ESTIMATE statements

  • alternating logistic regression estimation

  • models for ordinal data

The proportional odds model is a popular method of GEE analysis of ordinal data and is based on modeling cumulative logit functions. The GENMOD procedure also models cumulative probits and cumulative complementary log-log functions.

Example

A study of the effects of pollution on children produced the following data. The binary response indicates whether children exhibited symptoms during the period of study at ages 8, 9, 10, and 11. A logistic regression is fit to the data with the explanatory variables age, city of residence, and a passive smoking index. The correlations among the binary outcomes are modeled as exchangeable.

The following statements create the data set Children and fit a GEE model by using the GENMOD procedure.


data children;
  input id city$ @@;
  do i=1 to 4;
    input age smoke symptom @@;
    output;
  end;
  datalines;
1 steelcity   8 0 1  9 0 1  10 0 1  11 0 0
2 steelcity   8 2 1  9 2 1  10 2 1  11 1 0
3 steelcity   8 2 1  9 2 0  10 1 0  11 0 0
4 greenhills  8 0 0  9 1 1  10 1 1  11 0 0
5 steelcity   8 0 0  9 1 0  10 1 0  11 1 0
6 greenhills  8 0 1  9 0 0  10 0 0  11 0 1
7 steelcity   8 1 1  9 1 1  10 0 1  11 0 0
8 greenhills  8 1 0  9 1 0  10 1 0  11 2 0
9 greenhills  8 2 1  9 2 0  10 1 1  11 1 0
10 steelcity  8 0 0  9 0 0  10 0 0  11 1 0
11 steelcity  8 1 1  9 0 0  10 0 0  11 0 1
12 greenhills 8 0 0  9 0 0  10 0 0  11 0 0
13 steelcity  8 2 1  9 2 1  10 1 0  11 0 1
14 greenhills 8 0 1  9 0 1  10 0 0  11 0 0
15 steelcity  8 2 0  9 0 0  10 0 0  11 2 1
16 greenhills 8 1 0  9 1 0  10 0 0  11 1 0
17 greenhills 8 0 0  9 0 1  10 0 1  11 1 1
18 steelcity  8 1 1  9 2 1  10 0 0  11 1 0
19 steelcity  8 2 1  9 1 0  10 0 1  11 0 0
20 greenhills 8 0 0  9 0 1  10 0 1  11 0 0
21 steelcity  8 1 0  9 1 0  10 1 0  11 2 1
22 greenhills 8 0 1  9 0 1  10 0 0  11 0 0
23 steelcity  8 1 1  9 1 0  10 0 1  11 0 0
24 greenhills 8 1 0  9 1 1  10 1 1  11 2 1
25 greenhills 8 0 1  9 0 0  10 0 0  11 0 0
;
run;
proc genmod data=children;
  class id city smoke;
  model  symptom = city age smoke / dist=bin type3;
  repeated  subject=id / type=exch covb corrw;
  contrast 'Smoke=0 vs Smoke=1' smoke 1 -1 0;
run;

The REPEATED statement requests a GEE analysis. The SUBJECT=ID option identifies ID as the clustering variable, and the TYPE=EXCH option specifies an exchangeable correlation structure. The TYPE3 option in the MODEL statement requests Type III statistics for each effect in the model. The CONTRAST statement requests a test that compares the first and second levels of the SMOKE effect.

Output 1: GEE Analysis Results

The GENMOD Procedure
GEE Model Information
Correlation Structure Exchangeable
Subject Effect id (25 levels)
Number of Clusters 25
Correlation Matrix Dimension 4
Maximum Cluster Size 4
Minimum Cluster Size 4

Covariance Matrix (Model-Based)
  Prm1 Prm2 Prm4 Prm5 Prm6
Prm1 3.55587 -0.10887 -0.33581 -0.28538 -0.36770
Prm2 -0.10887 0.23866 0.003817 -0.06498 -0.03167
Prm4 -0.33581 0.003817 0.03567 -0.006532 0.002418
Prm5 -0.28538 -0.06498 -0.006532 0.47909 0.36456
Prm6 -0.36770 -0.03167 0.002418 0.36456 0.49376

Covariance Matrix (Empirical)
  Prm1 Prm2 Prm4 Prm5 Prm6
Prm1 3.83245 -0.54727 -0.36518 0.13082 -0.24490
Prm2 -0.54727 0.28779 0.05330 -0.16329 -0.03665
Prm4 -0.36518 0.05330 0.03752 -0.03672 -0.000823
Prm5 0.13082 -0.16329 -0.03672 0.37495 0.26015
Prm6 -0.24490 -0.03665 -0.000823 0.26015 0.43395

Algorithm converged.

Working Correlation Matrix
  Col1 Col2 Col3 Col4
Row1 1.0000 0.0758 0.0758 0.0758
Row2 0.0758 1.0000 0.0758 0.0758
Row3 0.0758 0.0758 1.0000 0.0758
Row4 0.0758 0.0758 0.0758 1.0000

Exchangeable Working Correlation
Correlation 0.0758342784

GEE Fit Criteria
QIC 135.3185
QICu 134.4346

Analysis Of GEE Parameter Estimates
Empirical Standard Error Estimates
Parameter   Estimate Standard
Error
95% Confidence Limits Z Pr > |Z|
Intercept   -4.2569 1.9577 -8.0938 -0.4199 -2.17 0.0297
city greenhil -0.0287 0.5365 -1.0802 1.0227 -0.05 0.9573
city steelcit 0.0000 0.0000 0.0000 0.0000 . .
age   0.3330 0.1937 -0.0467 0.7126 1.72 0.0856
smoke 0 1.6781 0.6123 0.4780 2.8783 2.74 0.0061
smoke 1 1.7418 0.6588 0.4507 3.0330 2.64 0.0082
smoke 2 0.0000 0.0000 0.0000 0.0000 . .

Score Statistics For Type 3 GEE Analysis
Source DF Chi-Square Pr > ChiSq
city 1 0.00 0.9575
age 1 2.77 0.0963
smoke 2 5.28 0.0715

Contrast Results for GEE Analysis
Contrast DF Chi-Square Pr > ChiSq Type
Smoke=0 vs Smoke=1 1 0.01 0.9089 Score


The GEE Procedure

For longitudinal studies, missing data are common, and they can be caused by dropouts or skipped visits. If missing responses depend on previous responses, the usual GEE approach can lead to biased estimates. So the GEE procedure also implements the weighted GEE method to handle missing responses that are caused by dropouts in longitudinal studies (Robins and Rotnitzky 1995; Preisser, Lohman, and Rathouz 2002).

The GEE procedure includes alternating logistic regression (ALR) analysis for binary and ordinal multinomial responses. In ordinary GEEs, the association between pairs of responses are modeled with correlations. The ALR approach provides an alternative by using the log odds ratio to model the association between pairs. For binary responses, the ALR algorithm of Carey, Zeger, and Diggle (1993) is implemented in both the GEE and GENMOD procedures. PROC GEE also implements the ALR algorithm of Heagerty and Zeger (1996), which extends the ALR approach to ordinal multinomial responses. An ordinary GEE with the independent working correlation structure is also available for both nominal and ordinal multinomial data.

Example

This example shows how you can use the GEE procedure to analyze longitudinal data that contain missing values. The data set is taken from a longitudinal study of women who used contraception during one year (Fitzmaurice, Laird, and Ware 2011). In this study, 1,151 women were randomly assigned to one of two treatments: 100 mg or 150 mg of depot medroxyprogesterone acetate (DMPA) at baseline and at three-month intervals. The response variable indicates the women’s amenorrhea status during the four consecutive three-month intervals. The question of interest is whether the treatment has an effect on the rate of amenorrhea over time. The example follows the analysis by Fitzmaurice, Laird, and Ware (2011).

The following statements create the data set Amenorrhea:

data Amenorrhea;
   input ID Dose Time Y@@;
   datalines;
   1     0       1     0
   1     0       2     .
   1     0       3     .
   1     0       4     .

   ... more lines ...

1150     1       4     1
1151     1       1     1
1151     1       2     1
1151     1       3     1
1151     1       4     1
;

The variables in the data are as follows:

  • ID: patient’s ID

  • Y: indicator of amenorrhea status (1 for amenorrhea; 0 otherwise)

  • Time: four consecutive three-month intervals with values 1, 2, 3, and 4

  • Dose: 0 for treatment with 100 mg injection; 1 for treatment with 150 mg injection

To prepare for the analysis, two additional variables are created:

  • Prevy: the patient’s amenorrhea status in the previous three-month interval. For the baseline visit, this is set to an arbitrary nonmissing value (0 here). In the 14.1 release of PROC GEE, this arbitrary value must be nonmissing and valid for the response variable—for example, it should be 0 or 1 for a binary response—but it does not otherwise affect the results.

  • Ctime: a copy of Time, which you can include in the marginal model as a continuous effect and also in the missingness model as a classification effect

The following statements add these two variables to the data set:

data Amenorrhea;
   set Amenorrhea;
   by ID;
   Prevy=lag(Y);
   if first.id then Prevy=0;
   Time=Time-1;
   Ctime=Time;
run;

Suppose $y_{ij}$ denotes the amenorrhea status of woman i at the jth visit, $j=1,\ldots ,4$, and suppose $\mu _{ij}=\mr{P}(y_{ij}=1)$ denotes the average rate of high dosage. To explore whether the treatment has an effect on the rate of amenorrhea over time, consider the following marginal model:

\[ \mr{logit}(\mu _{ij}) = \beta _0 + \beta _1 \mr{time}_{ij} + \beta _2 \mr{time}_{ij}^2+ \beta _3 \mr{dose}_ i + \beta _4 \mr{dose}_ i \times \mr{time} + \beta _5 \mr{dose}_ i \times \mr{time}^2 \]

Of the 1,151 women in this study, 576 are from the low-dose group, and 575 are from the high-dose group. In the low-dose group, 62.67% of the women completed the trial; in the high-dose group, 61.39% of the women completed this trial. Thus, both groups have substantial dropouts.

To obtain the weights for the weighted GEE analysis, consider the following logistic regression model for missingness:

\begin{align*} \mr{logit}p(r_{ij}=1|r_{ij-1}=1,\mr{dose}_{i},\mr{ctime}_{ij},y_{ij-1})=& \alpha _0 + \alpha _1 I(\mr{ctime}_{ij}=1) + \alpha _2 I(\mr{ctime}_{ij}=2) \\ & +\alpha _3 \mr{dose}_ i + \alpha _4 y_{ij-1}+ \alpha _5 \mr{dose}_ i \times y_{ij-1} \end{align*}

The following statements use the observation-specific weighted GEE method and the specified response and missingness models to analyze the data:

ods graphics on;
proc gee data=Amenorrhea desc plots=histogram;
   class ID Ctime;
   missmodel Ctime Prevy Dose Dose*Prevy / type=obslevel;
   model Y = Time Dose Time*Time Dose*Time Dose*Time*Time / dist=bin;
   repeated subject=ID / within=Ctime corr=cs;
run;

The MODEL statement specifies logistic regression and the model effects. The DESCEND option in the PROC GEE statement models the probability that Y = 1.

The REPEATED statement requests GEE analysis. The SUBJECT=ID option specifies that the variable ID indexes the subjects. The CORR=CS option specifies the compound symmetric working correlation structure.

The MISSMODEL statement requests the weighted GEE analysis. It specifies the logistic regression model for missingness. Note that no response variable is needed in weighted GEE analysis to specify a missingness model because the response is completely determined by the response variable in the MODEL statement. Without the MISSMODEL statement, PROC GEE would use the standard GEE approach, the same approach that PROC GENMOD provides. The TYPE=OBSLEVEL option requests observation-specific weights.

Figure 1 shows the parameter estimates for the missingness model. The estimate of $\alpha _4$ is –0.4514 with a p-value of 0.0053, which suggests that the possibility that a participant will drop out is related to her previous amenorrhea status. This suggests that the assumption of MAR is more appropriate than that of MCAR.

Figure 1: Parameter Estimates for the Missingness Model

Parameter Estimates for Missingness Model
Parameter   Estimate Standard
Error
95% Confidence Limits Z Pr > |Z|
Intercept   2.3967 0.1438 2.1149 2.6785 16.67 <.0001
Ctime 0 0.0000 0.0000 0.0000 0.0000 . .
Ctime 1 -0.7286 0.1439 -1.0106 -0.4466 -5.06 <.0001
Ctime 2 -0.5919 0.1469 -0.8798 -0.3040 -4.03 <.0001
Ctime 3 0.0000 0.0000 0.0000 0.0000 . .
Prevy   -0.4514 0.1619 -0.7687 -0.1341 -2.79 0.0053
Dose   0.0680 0.1313 -0.1893 0.3253 0.52 0.6046
Prevy*Dose   -0.2381 0.2196 -0.6685 0.1923 -1.08 0.2782


The classification variable Ctime has two levels whose estimates are equal to 0. One is the reference level Ctime = 3. The first level, Ctime = 0, also has an estimate of 0, because the first visit is always observed and the first level is never used in estimating the weights in the missing model.

Figure 2 displays the results of the weighted GEE analysis.

Figure 2: Parameter Estimates for Amenorrhea Data Analysis Using Weighted GEE

The GEE Procedure
Parameter Estimates for Response Model
with Empirical Standard Error Estimates
Parameter Estimate Standard
Error
95% Confidence Limits Z Pr > |Z|
Intercept -1.4965 0.1072 -1.7067 -1.2863 -13.95 <.0001
Time 0.5379 0.1334 0.2764 0.7994 4.03 <.0001
Dose 0.1061 0.1491 -0.1861 0.3983 0.71 0.4767
Time*Time -0.0037 0.0405 -0.0831 0.0757 -0.09 0.9275
Dose*Time 0.4092 0.1903 0.0362 0.7823 2.15 0.0315
Dose*Time*Time -0.1264 0.0577 -0.2395 -0.0134 -2.19 0.0284


The estimate of $\beta _4$ (the parameter estimate for the Dose*Time interaction) is 0.4092, which indicates that the change of amenorrhea rate over time depends on the dose of DMPA. Specifically, for women in the low-dose group the amenorrhea rates $\mu _{ij}$ at the four consecutive time intervals are 0.1830, 0.2764, 0.3928, and 0.5210, and for women in the high-dose group the amenorrhea rates are 0.1997, 0.3609, 0.4963, and 0.5701. In other words, the amenorrhea rate increases over time for both treatments, and the rates of increase are slightly different.

You can request subject-level weights by specifying the TYPE=SUBLEVEL option. The results (not shown here) from the subject-level weighted method are similar to the results from the observation-level weighted method. Both weighted GEE methods provide unbiased regression parameter estimates if the missingness model is specified correctly. Preisser, Lohman, and Rathouz (2002) note that the observation-level weighted GEE method produces more efficient estimates than the cluster-level weighted GEE method produces for incomplete longitudinal binary data.

Large weights can have impacts on the parameter estimates. Consequently, it is recommended that you check the distribution of the estimated weights. If there are large weights, you might consider trimming them by specifying the MAXWEIGHT= option in the MISSMODEL statement. Figure 3 shows that the estimated weights in this example range between 1 and 2.1, so no trimming is needed.

Figure 3: Histogram of Estimated Weights

Histogram of Estimated Weights


References

  • Carey, V., Zeger, S. L., and Diggle, P. J. (1993). “Modelling Multivariate Binary Data with Alternating Logistic Regressions.” Biometrika 80:517–526.

  • Fitzmaurice, G. M., Laird, N. M., and Ware, J. H. (2011). Applied Longitudinal Analysis. Hoboken, NJ: John Wiley & Sons.

  • Heagerty, P., and Zeger, S. L. (1996). “Marginal Regression Models for Clustered Ordinal Measurements.” Journal of the American Statistical Association 91:1024–1036.

  • Liang, K.-Y., and Zeger, S. L. (1986). “Longitudinal Data Analysis Using Generalized Linear Models.” Biometrika 73:13–22.

  • Mallinckrodt, C. (2013). Preventing and Treating Missing Data in Longitudinal Clinical Trials: A Practical Guide. Cambridge: Cambridge University Press.

  • Molenberghs, G., and Kenward, M. G. (2007). Missing Data in Clinical Studies. New York: John Wiley & Sons.

  • O’Kelly, M., and Ratitch, B. (2014). Clinical Trials with Missing Data: A Guide for Practitioners. Chichester, UK: John Wiley & Sons.

  • Preisser, J. S., Lohman, K. K., and Rathouz, P. J. (2002). “Performance of Weighted Estimating Equations for Longitudinal Binary Data with Drop-Outs Missing at Random.” Statistics in Medicine 21:3035–3054.

  • Robins, J. M., and Rotnitzky, A. (1995). “Semiparametric Efficiency in Multivariate Regression Models with Missing Data.” Journal of the American Statistical Association 90:122–129.