PROC GENMOD: Case Deletion Diagnostic Statistics :: SAS/STAT(R) 9.2 User's Guide, Second Edition

The GENMOD Procedure

Case Deletion Diagnostic Statistics

For ordinary generalized linear models, regression diagnostic statistics developed by Williams (1987) can be requested in an output data set or in the OBSTATS table by specifying the DIAGNOSTICS | INFLUENCE option in the MODEL statement. These diagnostics measure the influence of an individual observation on model fit, and generalize the one-step diagnostics developed by Pregibon (1981) for the logistic regression model for binary data.

Preisser and Qaqish (1996) further generalize regression diagnostics to apply to models for correlated data fit by generalized estimating equations (GEEs), where the influence of entire clusters of correlated observations is measured. These diagnostic statistics can be requested in an output data set or in the OBSTATS table if a model for correlated data is specified with a REPEATED statement.

The next two sections use the following notation:

$\text{[math]}$: is the maximum likelihood estimate of the regression parameters $\text{[math]}$ , or, in the case of correlated data, the solution of the GEEs.
$\text{[math]}$: is the corresponding estimate evaluated with the $\text{[math]}$ th observation deleted, or, in the case of correlated data, with the $\text{[math]}$ th cluster deleted.
$\text{[math]}$: is the dimension of the regression parameter vector $\text{[math]}$ .
$\text{[math]}$: is the standardized Pearson residual $\text{[math]}$ , where $\text{[math]}$ is the variance of the $\text{[math]}$ th response and $\text{[math]}$ is the leverage defined in the section H | LEVERAGE.
$\text{[math]}$: is the variance of response $\text{[math]}$ , $\text{[math]}$ , where $\text{[math]}$ is the variance function and $\text{[math]}$ is the dispersion parameter.
$\text{[math]}$: is the prior weight of the $\text{[math]}$ th observation specified with the WEIGHT statement. If there is no WEIGHT statement, $\text{[math]}$ for all $\text{[math]}$ .

All unknown quantities are replaced by their estimated values in the following two sections.

Diagnostics for Ordinary Generalized Linear Models

The following statistics are available for generalized linear models.

DFBETA

The DFBETA statistic for measuring the influence of the $\text{[math]}$ th observation is defined as the one-step approximation to the difference in the MLE of the regression parameter vector and the MLE of the regression parameter vector without the $\text{[math]}$ th observation. This one-step approximation assumes a Fisher scoring step, and is given by

$\text{[math]}$

where $\text{[math]}$ is the leverage defined in the section H | LEVERAGE.

DFBETAS

The standardized DFBETA statistic for assessing the influence of the $\text{[math]}$ th observation on the $\text{[math]}$ th regression parameter is defined as the DFBETA statistic for the $\text{[math]}$ th parameter divided by its estimated standard deviation, where the standard deviation is estimated from all the data.

$\text{[math]}$

DOBS | COOKD | COOKSD

In normal linear regression, the influence of observation $\text{[math]}$ can be measured by Cook’s distance (Cook and Weisberg; 1982). A measure of influence of observation $\text{[math]}$ for generalized linear models that is equivalent to Cook’s distance for normal linear regression is given by

$\text{[math]}$

where $\text{[math]}$ is the leverage defined in the section H | LEVERAGE. This measure is the one-step approximation to $\text{[math]}$ , where $\text{[math]}$ is the log likelihood evaluated at $\text{[math]}$ .

H | LEVERAGE

The Fisher scores, or expected, weight for observation $\text{[math]}$ is $\text{[math]}$ . Let $\text{[math]}$ be the diagonal matrix with $\text{[math]}$ as the $\text{[math]}$ th diagonal. The leverage $\text{[math]}$ of the $\text{[math]}$ th observation is defined as the $\text{[math]}$ th diagonal element of the hat matrix

$\text{[math]}$

Diagnostics for Models Fit by Generalized Estimating Equations (GEEs)

The diagnostic statistics in this section were developed by Preisser and Qaqish (1996). See the section Generalized Estimating Equations for further information and notation for generalized estimating equations (GEEs). The following additional notation is used in this section.

Partition the design matrix $\text{[math]}$ and response vector $\text{[math]}$ by cluster; that is, let $\text{[math]}$ , and $\text{[math]}$ corresponding to the $\text{[math]}$ clusters.

Let $\text{[math]}$ be the number of responses for cluster $\text{[math]}$ , and denote by $\text{[math]}$ the total number of observations. Denote by $\text{[math]}$ the $\text{[math]}$ diagonal matrix with $\text{[math]}$ as the $\text{[math]}$ th diagonal element. If there is a WEIGHT statement, the diagonal element of $\text{[math]}$ is $\text{[math]}$ , where $\text{[math]}$ is the specified weight of the $\text{[math]}$ th observation in the $\text{[math]}$ th cluster. Let $\text{[math]}$ the $\text{[math]}$ diagonal matrix with $\text{[math]}$ as diagonal elements, $\text{[math]}$ , $\text{[math]}$ . Let $\text{[math]}$ the $\text{[math]}$ diagonal matrix corresponding to cluster $\text{[math]}$ with $\text{[math]}$ as the $\text{[math]}$ th diagonal element.

Let $\text{[math]}$ be the $\text{[math]}$ block diagonal weight matrix whose $\text{[math]}$ th block, corresponding to the $\text{[math]}$ th cluster, is the $\text{[math]}$ matrix

$\text{[math]}$

where $\text{[math]}$ is the working correlation matrix for cluster $\text{[math]}$ .

Let

$\text{[math]}$

where $\text{[math]}$ is the $\text{[math]}$ design matrix corresponding to cluster $\text{[math]}$ .

Define the adjusted residual vector as

$\text{[math]}$

and $\text{[math]}$ , the estimated residual for the $\text{[math]}$ th cluster.

Let the subscript $\text{[math]}$ denote estimates evaluated without the $\text{[math]}$ th cluster, $\text{[math]}$ estimates evaluated using all the data except the $\text{[math]}$ th observation of the $\text{[math]}$ th cluster, and let $\text{[math]}$ denote matrices corresponding to the $\text{[math]}$ th cluster without the $\text{[math]}$ th observation.

The following statistics are available for generalized estimating equation models.

CH | CLUSTERH | CLEVERAGE

The leverage of cluster $\text{[math]}$ is contained in the matrix $\text{[math]}$ , and is summarized by the trace of $\text{[math]}$ ,

$\text{[math]}$

The leverage $\text{[math]}$ of the $\text{[math]}$ th observation in the $\text{[math]}$ th cluster is the $\text{[math]}$ th diagonal element of $\text{[math]}$ .

DFBETAC

The effect of deleting cluster $\text{[math]}$ on the estimated parameter vector is given by the following one-step approximation for $\text{[math]}$ :

$\text{[math]}$

DFBETACS

The cluster deletion statistic DFBETAC can be standardized using the variances of $\text{[math]}$ based on the complete data. The standardized one-step approximation for the change in $\text{[math]}$ due to deletion of cluster $\text{[math]}$ is

$\text{[math]}$

DFBETAO

Partition the matrices $\text{[math]}$ and $\text{[math]}$ as

$\text{[math]}$

and let $\text{[math]}$ and $\text{[math]}$ .

The effect of deleting the $\text{[math]}$ th observation from the $\text{[math]}$ th cluster is given by the following one-step approximation to $\text{[math]}$ :

$\text{[math]}$

where $\text{[math]}$ , $\text{[math]}$ , and $\text{[math]}$ . Note that $\text{[math]}$ , $\text{[math]}$ , and $\text{[math]}$ are scalars.

DFBETAOS

The observation deletion statistic DFBETAO can be standardized using the variances of $\text{[math]}$ based on the complete data. The standardized one-step approximation for the change in $\text{[math]}$ due to deletion of observation $\text{[math]}$ in cluster $\text{[math]}$ is

$\text{[math]}$

DCLS | CLUSTERCOOKD | CLUSTERCOOKSD

A measure of the standardized influence of the subset $\text{[math]}$ of observations on the overall fit is $\text{[math]}$ . For deletion of cluster $\text{[math]}$ , this is approximated by

$\text{[math]}$

DOBS | COOKD | COOKSD

The measure of overall fit in the section DCLS | CLUSTERCOOKD | CLUSTERCOOKSD for the deletion of the $\text{[math]}$ th observation in the $\text{[math]}$ th cluster is approximated by

$\text{[math]}$

where $\text{[math]}$ , $\text{[math]}$ , and $\text{[math]}$ are defined in the section DFBETAO. In the case of the independence working correlation, this is equal to the measure for ordinary generalized linear models defined in the section DOBS | COOKD | COOKSD.

MCLS | CLUSTERDFIT

A studentized distance measure of the type defined in the section DCLS | CLUSTERCOOKD | CLUSTERCOOKSD of the influence of the $\text{[math]}$ th cluster is given by

$\text{[math]}$

Top of Page