The GENMOD Procedure

Case Deletion Diagnostic Statistics

For ordinary generalized linear models, regression diagnostic statistics developed by Williams (1987) can be requested in an output data set or in the OBSTATS table by specifying the DIAGNOSTICS | INFLUENCE option in the MODEL statement. These diagnostics measure the influence of an individual observation on model fit, and generalize the one-step diagnostics developed by Pregibon (1981) for the logistic regression model for binary data.

Preisser and Qaqish (1996) further generalized regression diagnostics to apply to models for correlated data fit by generalized estimating equations (GEEs), where the influence of entire clusters of correlated observations, or the influence of individual observations within a cluster, is measured. These diagnostic statistics can be requested in an output data set or in the OBSTATS table if a model for correlated data is specified with a REPEATED statement.

The next two sections use the following notation:

$\hat{\bbeta }$

is the maximum likelihood estimate of the regression parameters $\bbeta $, or, in the case of correlated data, the solution of the GEEs.

$\hat{\bbeta }_{[i]}$

is the corresponding estimate evaluated with the ith observation deleted, or, in the case of correlated data, with the ith cluster deleted.

p

is the dimension of the regression parameter vector $\bbeta $.

$r_{pi}$

is the standardized Pearson residual $\frac{y_ i-\mu _ i}{\sqrt {v_ i(1-h_ i)}}$, where $v_ i$ is the variance of the ith response and $h_ i$ is the leverage defined in the section H |LEVERAGE.

$v_ i$

is the variance of response i, $\mr{Var}(Y_ i)=\phi V(\mu _ i)$, where $V(\mu )$ is the variance function and $\phi $ is the dispersion parameter.

$w_ i$

is the prior weight of the ith observation specified with the WEIGHT statement. If there is no WEIGHT statement, $w_ i=1$ for all i.

All unknown quantities are replaced by their estimated values in the following two sections.

Diagnostics for Ordinary Generalized Linear Models

The following statistics are available for generalized linear models.

DFBETA

The DFBETA statistic for measuring the influence of the ith observation is defined as the one-step approximation to the difference in the MLE of the regression parameter vector and the MLE of the regression parameter vector without the ith observation. This one-step approximation assumes a Fisher scoring step, and is given by

\[ \hat{\bbeta } - \hat{\bbeta }_{[i]} \approx \mr{DFBETA}_ i = (\bX ^{\prime }\bW \bX )^{-1}\bX _ i^{\prime }\bW _ i^{\frac{1}{2}}(1-h_ i)^{-\frac{1}{2}}r_{pi} \]

where $h_ i$ is the leverage defined in the section H |LEVERAGE.

DFBETAS

The standardized DFBETA statistic for assessing the influence of the ith observation on the jth regression parameter is defined as the DFBETA statistic for the jth parameter divided by its estimated standard deviation, where the standard deviation is estimated from all the data.

\[ \mr{DFBETAS}_{ij} = \mr{DFBETA}_{ij} / \hat{\sigma }(\beta _ j) \]
DOBS |COOKD |COOKSD

In normal linear regression, the influence of observation i can be measured by Cook’s distance (Cook and Weisberg 1982). A measure of influence of observation i for generalized linear models that is equivalent to Cook’s distance for normal linear regression is given by

\[ \mr{DOBS}_ i = p^{-1}h_ i(1-h_ i)^{-1}r^2_{pi} \]

where $h_ i$ is the leverage defined in the section H |LEVERAGE. This measure is the one-step approximation to $2p^{-1}[L(\hat{\bbeta })- L(\hat{\bbeta }_{[i]})]$, where $L(\bbeta )$ is the log likelihood evaluated at $\bbeta $.

H |LEVERAGE

The Fisher scores, or expected, weight for observation i is $w_{ei}=\frac{w_ i}{\phi V(\mu _ i)(g^{\prime }(\mu _ i))^2}$. Let $\bW $ be the diagonal matrix with $w_{ei}$ as the ith diagonal. The leverage $h_ i$ of the ith observation is defined as the ith diagonal element of the hat matrix

\[ \bH = \bW ^\frac {1}{2}\bX (\bX ^{\prime }\bW \bX )^{-1}\bX ^{\prime }\bW ^\frac {1}{2} \]

Diagnostics for Models Fit by Generalized Estimating Equations (GEEs)

The diagnostic statistics in this section were developed by Preisser and Qaqish (1996). See the section Generalized Estimating Equations for further information and notation for generalized estimating equations (GEEs). The following additional notation is used in this section.

Partition the design matrix $\mb{X}$ and response vector $\mb{Y}$ by cluster; that is, let $\mb{X} = (X_1^\prime , \ldots , X_ K^\prime )^\prime $, and $\mb{Y} = (Y_1^\prime , \ldots , Y_ K^\prime )^\prime $ corresponding to the K clusters.

Let $n_ i$ be the number of responses for cluster i, and denote by $N = \sum _{i=1}^ K n_ i$ the total number of observations. Denote by $A_ i$ the $n_ i \times n_ i$ diagonal matrix with $V(\mu _{ij})$ as the jth diagonal element. If there is a WEIGHT statement, the diagonal element of $A_ i$ is $V(\mu _{ij})/w_{ij}$, where $w_{ij}$ is the specified weight of the jth observation in the ith cluster. Let $\bB $ the $N\times N$ diagonal matrix with $g^{\prime }(\mu _{ij})$ as diagonal elements, $i=1,\ldots ,K$, $j=1,\ldots ,n_ i$. Let $\bB _ i$ the $n_ i \times n_ i$ diagonal matrix corresponding to cluster i with $g^{\prime }(\mu _{ij})$ as the jth diagonal element.

Let $\bW $ be the $N \times N$ block diagonal weight matrix whose ith block, corresponding to the ith cluster, is the $n_ i \times n_ i$ matrix

\[ \bW _{ei} = \bB _ i^{-1}\bA _ i^{-\frac{1}{2}}\bR _ i^{-1}(\hat{\balpha })\bA _ i^{-\frac{1}{2}}\bB _ i^{-1} \]

where $\bR _ i$ is the working correlation matrix for cluster i.

Let

\[ Q_ i = \bX _ i(\bX ^{\prime }\bW \bX )^{-1} \bX _ i^{\prime } \]

where $\bX _ i$ is the $n_ i \times p$ design matrix corresponding to cluster i.

Define the adjusted residual vector as

\[ \bE = \bB (\mb{Y}-\hat{\bmu }) \]

and $\bE _ i=\bB _ i(\mb{Y}_ i-\hat{\bmu }_ i)$, the estimated residual for the ith cluster.

Let the subscript $[i]$ denote estimates evaluated without the ith cluster, $[it]$ estimates evaluated using all the data except the tth observation of the ith cluster, and let $i[t]$ denote matrices corresponding to the ith cluster without the tth observation.

The following statistics are available for generalized estimating equation models.

CH |CLUSTERH |CLEVERAGE

The leverage of cluster i is contained in the matrix $\bH _ i=\bQ _{i}\bW _{ei}$, and is summarized by the trace of $\bH _ i$,

\[ \mathit{ch}_ i = \mr{tr}(\bH _ i) \]

The leverage $h_ i$ of the tth observation in the ith cluster is the tth diagonal element of $\bH _ i$.

DFBETAC

The effect of deleting cluster i on the estimated parameter vector is given by the following one-step approximation for $\hat{\bbeta } - \hat{\bbeta }_{[i]}$:

\[ \mr{DBETAC}_ i = (\bX ^{\prime }\bW \bX )^{-1}\bX _ i^{\prime }(\bW _{ei}^{-1} - \bQ _ i)^{-1}\bE _ i \]
DFBETACS

The cluster deletion statistic DFBETAC can be standardized using the variances of $\hat{\bbeta }$ based on the complete data. The standardized one-step approximation for the change in $\hat{\beta }_ j$ due to deletion of cluster i is

\[ \mr{DBETACS}_{ij} = \frac{\mr{DBETAC}_{ij}}{\hat{\phi }[(\bX ^{\prime }\bW \bX )^{-1}]_{jj}^\frac {1}{2}} \]
DFBETA

Partition the matrices $\bW _{ei}$ and $\bV _ i$ as

\[ \bW _{ei} = \left( \begin{array}{ll} W_{eit} & \bW _{eit[t]} \\ \bW _{ei[t]t} & \bW _{ei[t]} \\ \end{array} \right) \]
\[ \bV _{i} = \bW _{ei}^{-1} = \left( \begin{array}{ll} \bV _{it} & \bV _{it[t]} \\ \bV _{i[t]t} & \bV _{i[t]} \\ \end{array} \right) \]

and let $\bE _{it} = \bB _{it}(\bY _{it}-\hat{\mu }_{it})$ and $\bE _{i[t]}=\bB _{i[t]}(\bY _{i[t]}-\hat{\mu }_{i[t]})$.

The effect of deleting the tth observation from the ith cluster is given by the following one-step approximation to $\hat{\bbeta }-\hat{\bbeta }_{[it]}$:

\[ \mb{DBETAO}_{it} = (\bX ^{\prime }\bW \bX )^{-1}\tilde{\bX }_{it}^\prime \frac{\tilde{E}_{it}}{W_{eit}^{-1}-\tilde{Q}_{it}} \]

where $\tilde{\bX }_{it}=\bX _{it}-\bV _{it[t]}\bV _{i[t]}^{-1}\bX _{i[t]}$, $\tilde{Q}_{it}=\tilde{\bX }_{it}(\bX ^{\prime }\bW \bX )^{-1}\tilde{X}_{it}^\prime $, and $\tilde{E}_{it}=\bE _{it}-\bV _{it[t]}\bV _{i[t]}^{-1}\bE _{i[t]}$. Note that $W_{eit}$, $\tilde{Q}_{it}$, and $\tilde{E}_{it}$ are scalars.

DFBETAS

The observation deletion statistic DFBETA can be standardized using the variances of $\hat{\bbeta }$ based on the complete data. The standardized one-step approximation for the change in $\hat{\beta }_ j$ due to deletion of observation t in cluster i is

\[ \mr{DBETAOS}_{itj} = \frac{\mr{DBETAO}_{itj}}{\hat{\phi }[(\bX ^{\prime }\bW \bX )^{-1}]_{jj}^\frac {1}{2}} \]
DCLS |CLUSTERCOOKD |CLUSTERCOOKSD

A measure of the standardized influence of the subset m of observations on the overall fit is $(\hat{\bbeta }-\hat{\bbeta }_{[m]})^{\prime }(\bX ^{\prime }\bW \bX )(\hat{\bbeta }-\hat{\bbeta }_{[m]})/p\hat{\phi }$. For deletion of cluster i, this is approximated by

\[ \mr{DCLS}_ i = \bE _ i^{\prime }(\bW _{ei}^{-1}-\bQ _ i)^{-1})\bQ _ i(\bW _{ei}^{-1}-\bQ _ i)^{-1})\bE _ i/p\hat{\phi } \]
DOBS |COOKD |COOKSD

The measure of overall fit in the section DCLS |CLUSTERCOOKD |CLUSTERCOOKSD for the deletion of the tth observation in the ith cluster is approximated by

\[ \mr{DOBS}_{it} = \frac{\tilde{E}_{it}^{2}\tilde{Q}_{it}}{p\hat{\phi }(W_{eit}^{-1}-\tilde{Q}_{it})^2} \]

where $\tilde{E}_{it}$, $\tilde{Q}_{it}$, and $W_{eit}$ are defined in the section DFBETA. In the case of the independence working correlation, this is equal to the measure for ordinary generalized linear models defined in the section DOBS |COOKD |COOKSD.

MCLS |CLUSTERDFIT

A studentized distance measure of the type defined in the section DCLS |CLUSTERCOOKD |CLUSTERCOOKSD of the influence of the ith cluster is given by

\[ \mr{MCLS}_ i = \bE _ i^\prime (\bW _{ei}^{-1}-\bQ _ i)^{-1}\bH _ i\bE _ i/p\hat{\phi } \]