The GEE Procedure (Experimental)

Weighted Generalized Estimating Equations under the MAR Assumption

In longitudinal studies, response measurements are often missing because of skipped visits or dropouts. Suppose $r_{ij}$ is the indicator that the response $y_{ij}$ is observed, where $r_{ij}=1$ if $y_{ij}$ is observed and 0 otherwise. Missing data patterns can be classified into two types: dropout and intermittent. A dropout occurs if an individual skips a particular visit and then never comes back for subsequent visits. That is, if $r_{ij}=0$, then $r_{ik}=0$ for all $k>j$. Otherwise, the missing data pattern is intermittent. Intermittent patterns can be quite complicated; only dropout patterns are considered here.

The mechanism for missingness can be described by a statistical model for the probability of observing a missing value, and making the right assumption about the mechanism is crucial to methods that handle missing data. Missingness mechanisms are classified into three types: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR) (Rubin, 1976).

Assumptions about longitudinal data that include missing responses caused by dropouts are classified as follows:

  • The data are said to be MCAR if the probability of a missing response is independent of its past, current, and future responses conditional on the covariates. That is, $P(r_{ij}=0|\mb{Y}_{i},\mb{X}_{i})= P(r_{ij}=0|\mb{X}_{i})$.

  • The data are said to be MAR if the probability of a missing response is independent of its current and future responses conditional on the observed past responses and the covariates. That is, $P(r_{ij}=0|r_{ij-1}=1, X_ i,Y_ i)=P(r_{ij}=0|r_{ij-1}=1, X_ i,y_{i1}, \ldots , y_{ij-1})$. MAR is a weaker assumption than MCAR.

  • The data are said to be MNAR if the probability of a missing response depends on the unobserved responses. MNAR is the most general and the most problematic missing-data scenario.

The GEE procedure implements two different weighted methods (observation-specific and subject-specific) for estimating the regression parameter $\bbeta $ when dropouts occur. Both provide consistent estimates if the data are MAR.

Observation-Specific Weighted GEE Method

Suppose $w_{ij}$ is the weight for $y_{ij}$, which is defined as the inverse probability of observing $y_{ij}$. In other words, $w_{ij}=P(r_{ij}=1| X_ i,Y_ i)^{-1}$. Suppose $W_ i$ is a $T \times T$ diagonal matrix whose jth diagonal is $r_{ij}w_{ij}$. The responses for the ith subject are $\mb{Y}_{i}=(y_{i1}, y_{i2}, \ldots , y_{i T})^{\prime }$. Consider the following weighted generalized estimating equations (Robins and Rotnitzky, 1995; Preisser, Lohman, and Rathouz, 2002):

\[  \mb{S}_{ow}(\bbeta ) = \sum _{i=1}^ K \frac{\partial \bmu _{i}^{\prime }}{\partial \bbeta } \mb{V}_{i}^{-1} W_ i (\mb{Y}_{i}-\bmu _{i}(\bbeta )) = \mb{0}  \]

Unlike the standard generalized estimating equations, the weighted generalized estimating equations are unbiased when the observations are appropriately weighted and lead to consistent estimates of $\bbeta $.

The weights $w_{ij}$ are often unknown in practice and are estimated by a logistic regression model under the MAR assumption. Specifically, say that $\lambda _{ij}=P(r_{ij}=1|r_{ij-1}=1, X_ i,Y_ i)$ denotes the probability of observing the response $y_{ij}$ given its observed previous responses.

Under the MAR assumption,

\[  \lambda _{ij}=P(r_{ij}=1|r_{ij-1}=1, X_ i,Y_ i)=P(r_{ij}=1|r_{ij-1}=1, X_ i,Y_1, \ldots , Y_{j-1})  \]

Using the observed data, $\lambda _{ij}$ can be predicted from a logistic regression model,

\[  \mr{logit} \{ \lambda _{ij}\} = z_{ij}\balpha  \]

where the $z_{ij}$ are predictors that usually include the covariates $x_{ij}$, the past responses, and the indicators for visit times. The dropout process implies that the estimated probability of observing $y_{ij}$ can be expressed as a cumulative product of conditional probabilities:

\[  \hat P(r_{ij}=1| X_ i,Y_ i)= \lambda _{i1}(\hat\balpha ) \times \lambda _{i2}(\hat\balpha ) \times \cdots \times \lambda _{ij}(\hat\balpha )  \]

With the estimated weights $\hat w_{ij}= \hat P(r_{ij}=1| X_ i,Y_ i)^{-1}$, the regression parameter $\bbeta $ is estimated by solving the equation for $\mb{S}_{ow}(\bbeta )$.

The regression parameter $\bbeta $ can be estimated by solving for $\mb{S}_{ow}(\bbeta )$ after plugging in the estimated weights. The fitting algorithm is described in the section Fitting Algorithm for Weighted GEE.

Subject-Specific Weighted GEE Method

Unlike the observation-specific weighted method, which assigns an observation-specific weight to each observation, the subject-specific weighted method assigns a single weight to each subject. In other words, all the observations from a subject receive the same weight. Specifically, the subject-specific weighted method obtains the regression parameter estimates by solving the equations

\[  \mb{S}_{sw}(\bbeta ) = \sum _{i=1}^ K \mb{D}^\prime _ i \mb{V}_{i}^{-1} w_ i (\mb{Y}_{i}-\bmu _{i}(\bbeta )) = \mb{0}  \]

where the responses for the ith subject are $\mb{Y}_{i}=(y_{i1}, y_{i2}, \ldots , y_{i n_ i})^{\prime }$ and the weight $w_ i$ for subject i is the inverse probability of a subject i dropping out at the observed time (Fitzmaurice, Molenberghs, and Lipsitz, 1995; Preisser, Lohman, and Rathouz, 2002). Note that the weight $w_ i$ is a scalar, in contrast to the weight matrix $\mb{W}_ i$ that the observation-specific weighted GEE method uses.

The subject-specific weighted estimating equations are also unbiased when the subjects are appropriately weighted and lead to consistent estimates of the regression parameters $\bbeta $.

The weight $w_ i$ is usually unknown in practice and needs to be estimated. Suppose subject i drops out at time $m_ i=\sum _{j=1}^ T r_{ij}+1$. Assume that the first visit $y_{i1}$ is always observed with $r_{i1}=1$. Thus, the dropout times $m_ i$ range from 2 to T+1. Note that a dropout time of T+1 indicates that subject i completes all the $T$ visits and dropout does not occur.

The weight $w_ i$ is defined as follows: if subject i drops out before completing the last visit (that is, $m_ i \le T $), then $w_{i} = P(r_{im_ i}=0, r_{im_ i-1}=1| X_ i,Y_ i)^{-1}$; otherwise, the subject completes all the $T$ visits (that is, $m_ i=T+1$), and $w_{i} = P(r_{iT}=1| X_ i,Y_ i)^{-1}$.

Similar to the process for the observation-specific weighted method, the dropout process for the subject-specific weighted method implies that subject-specific weights can be estimated as a cumulative product of conditional probabilities:

\begin{align*}  \hat w_{i} & = P(r_{im_ i}=0, r_{im_ i-1}=1| X_ i,Y_ i)^{-1}=[\lambda _{i1}(\hat\balpha ) \times \cdots \times \lambda _{im_ i-1}(\hat\balpha ) \times (1-\lambda _{im_ i}(\hat\balpha ))]^{-1},\,  \mr{if}\,  m_ i \le T \\ \hat w_{i} & = P(r_{im_ i-1}=1| X_ i,Y_ i)^{-1} = [\lambda _{i1}(\hat\balpha ) \times \lambda _{i2}(\hat\balpha ) \times \cdots \times \lambda _{im_ i-1}(\hat\balpha )]^{-1}, \,  \mr{if}\,  m_ i=T+1 \end{align*}

Thus, the subject-specific weights $\hat w_{i}$ can be obtained after $\lambda _{ij}$ is estimated by fitting a logistic regression to the data $(r_{ij}, z_{ij})$.

The regression parameter $\bbeta $ from the subject-specific weighted GEE method can be estimated by solving for $\mb{S}_{sw}(\bbeta )$ after plugging in the estimated weights. The fitting algorithm is described in the section Fitting Algorithm for Weighted GEE. The subject-specific weighting scheme was originally developed for computational convenience. Preisser, Lohman, and Rathouz (2002) showed that the observation-level weighted GEE method produces more efficient estimates than the cluster-level weighted GEE method for incomplete longitudinal binary data.

Fitting Algorithm for Weighted GEE

The following fitting algorithm fits marginal models by using the observation-specific or the subject-specific weighted GEE method when the dropout process is missing at random:

  1. Fit a logistic regression to the data $(r_{ij}, z_{ij})$ to obtain an estimate of $\balpha $ and estimate the weights.

  2. Compute an initial estimate of $\bbeta $ by using an ordinary generalized linear model, assuming independence of the responses.

  3. Compute the working correlation matrix $\mb{R}$ based on the standardized residuals, the current estimate of $\bbeta $, and the specified structure of $\mb{R}$.

  4. Compute the estimated covariance matrix:

    \[  \mb{V}_{i} = \phi \mb{A}_ i^{\frac{1}{2}} \hat{\mb{R}}(\balpha ) \mb{A}_ i^{\frac{1}{2}}  \]
  5. Update $\hat\bbeta $:

    \[  \hat\bbeta _{r+1} = \hat\bbeta _{r} + \left[\sum _{i=1}^{K} \frac{\partial \bmu _ i}{\partial \bbeta }^{\prime } \mb{V}_{i}^{-1} \frac{\partial \bmu _ i}{\partial \bbeta } \right]^{-1} \left[ \sum _{i=1}^ K \frac{\partial \bmu _ i}{\partial \bbeta }^{\prime }\mb{V}_{i}^{-1} \mb{W}_ i (\mb{Y}_{i}-\bmu _ i) \right]  \]

    where $\mb{Y}_{i}, \bmu _ i, \mb{V}_{i}$, and $\mb{W}_ i$ are as follows:

    • For the observation-specific weighted method, $\mb{Y}_{i}=(y_{i1}, y_{i2}, \ldots , y_{i T})^{\prime }$; $\bmu _ i$ and $\mb{V}_{i}$ are its corresponding mean vector and working covariance matrix, respectively; and $\mb{W}_ i$ is a $T \times T$ diagonal matrix whose jth diagonal is $r_{ij}\hat w_{ij}$.

    • For the subject-specific weighted method, $\mb{Y}_{i}=(y_{i1}, y_{i2}, \ldots , y_{i n_ i})^{\prime }$; $\bmu _ i$ and $\mb{V}_{i}$ are its corresponding mean vector and working covariance matrix, respectively; and $\mb{W}_ i$ is a $n_ i \times n_ i$ diagonal matrix whose jth diagonal is $\hat w_{i}$.

  6. Repeat steps 3–5 until convergence.

Note that you can use the WEIGHT statement in the GENMOD procedure to perform a two-stage strategy that is often used in practice to obtain the weighted GEE estimates. You fit a logistic regression to the data $(r_{ij}, z_{ij})$ to obtain the weights as described in the preceding steps. Then you estimate $\bbeta $ by specifying the estimated weights in the WEIGHT statement in PROC GENMOD for the GEE analysis. For the subject-specific weighted GEE method, this approach is appropriate for any working correlation structure. However, for the observation-specific weighted method, this approach is appropriate only for the independent working correlation structure.

The two-stage approach results in standard errors that are larger than those that are produced by using the MISSMODEL statement in the GEE procedure (because PROC GENMOD treats the weights as fixed and known). Thus, the two-stage approach that uses PROC GENMOD results in conservative inference (Fitzmaurice, Laird, and Ware, 2011). The GEE procedure computes the parameter estimate covariances as described in (Fitzmaurice, Laird, and Ware, 2011) and Preisser, Lohman, and Rathouz (2002).

Missing Data

Suppose that each subject in a longitudinal study is measured at T times. In other words, for the ith subject you measure T responses $(y_{i1}, y_{i2}, \ldots , y_{i T})$ and T corresponding covariates $(\mb{x}_{i1}, \mb{x}_{i2}, , \ldots , x_{i T})$.

By default, the GEE procedure handles missing data in the same manner as the standard GEE method in the GENMOD procedure. The working correlation matrix is estimated from data that contain both intermittent and dropout types of missing values by using the all-available-pairs method, in which all nonmissing pairs of data are used in the moment estimators. The resulting covariances and standard errors are valid under the missing completely at random (MCAR) assumption. For more information, see the section Missing Data in Chapter 43: The GENMOD Procedure.

When you specify the MISSMODEL statement in the GEE procedure to use the weighted GEE method to analyze the data, the procedure uses observations that have missing values in the response, provided that the missing values for all subjects are caused by dropouts. If the missing values are intermittent for any of the subjects, then the weighted GEE method does not apply and the procedure terminates.

For the observation-specific weighted GEE method, the covariates for all the observations for a subject must be observed, regardless of whether the response is missing. For each subject, the input data set must provide T observations.

For the subject-specific weighted GEE method, the covariates for a subject who drops out at time k must be observed for the observations up to and including time k. The input data set must provide at least k observations for this subject. The covariates must be observed for all observations on a subject who completes the study, and the input data set must provide T observations for this subject.

For more information about how weighted GEE methods handle missing values, see Fitzmaurice, Laird, and Ware (2011) and Preisser, Lohman, and Rathouz (2002).