Case Deletion Diagnostic Statistics |
For ordinary generalized linear models, regression diagnostic statistics developed by Williams (1987) can be requested in an output data set or in the OBSTATS table by specifying the DIAGNOSTICS | INFLUENCE option in the MODEL statement. These diagnostics measure the influence of an individual observation on model fit, and generalize the one-step diagnostics developed by Pregibon (1981) for the logistic regression model for binary data.
Preisser and Qaqish (1996) further generalized regression diagnostics to apply to models for correlated data fit by generalized estimating equations (GEEs), where the influence of entire clusters of correlated observations, or the influence of individual observations within a cluster, is measured. These diagnostic statistics can be requested in an output data set or in the OBSTATS table if a model for correlated data is specified with a REPEATED statement.
The next two sections use the following notation:
is the maximum likelihood estimate of the regression parameters , or, in the case of correlated data, the solution of the GEEs.
is the corresponding estimate evaluated with the th observation deleted, or, in the case of correlated data, with the th cluster deleted.
is the dimension of the regression parameter vector .
is the standardized Pearson residual , where is the variance of the th response and is the leverage defined in the section H | LEVERAGE.
is the variance of response , , where is the variance function and is the dispersion parameter.
is the prior weight of the th observation specified with the WEIGHT statement. If there is no WEIGHT statement, for all .
All unknown quantities are replaced by their estimated values in the following two sections.
The following statistics are available for generalized linear models.
The DFBETA statistic for measuring the influence of the th observation is defined as the one-step approximation to the difference in the MLE of the regression parameter vector and the MLE of the regression parameter vector without the th observation. This one-step approximation assumes a Fisher scoring step, and is given by
where is the leverage defined in the section H | LEVERAGE.
The standardized DFBETA statistic for assessing the influence of the th observation on the th regression parameter is defined as the DFBETA statistic for the th parameter divided by its estimated standard deviation, where the standard deviation is estimated from all the data.
In normal linear regression, the influence of observation can be measured by Cook’s distance (Cook and Weisberg; 1982). A measure of influence of observation for generalized linear models that is equivalent to Cook’s distance for normal linear regression is given by
where is the leverage defined in the section H | LEVERAGE. This measure is the one-step approximation to , where is the log likelihood evaluated at .
The Fisher scores, or expected, weight for observation is . Let be the diagonal matrix with as the th diagonal. The leverage of the th observation is defined as the th diagonal element of the hat matrix
The diagnostic statistics in this section were developed by Preisser and Qaqish (1996). See the section Generalized Estimating Equations for further information and notation for generalized estimating equations (GEEs). The following additional notation is used in this section.
Partition the design matrix and response vector by cluster; that is, let , and corresponding to the clusters.
Let be the number of responses for cluster , and denote by the total number of observations. Denote by the diagonal matrix with as the th diagonal element. If there is a WEIGHT statement, the diagonal element of is , where is the specified weight of the th observation in the th cluster. Let the diagonal matrix with as diagonal elements, , . Let the diagonal matrix corresponding to cluster with as the th diagonal element.
Let be the block diagonal weight matrix whose th block, corresponding to the th cluster, is the matrix
where is the working correlation matrix for cluster .
Let
where is the design matrix corresponding to cluster .
Define the adjusted residual vector as
and , the estimated residual for the th cluster.
Let the subscript denote estimates evaluated without the th cluster, estimates evaluated using all the data except the th observation of the th cluster, and let denote matrices corresponding to the th cluster without the th observation.
The following statistics are available for generalized estimating equation models.
The leverage of cluster is contained in the matrix , and is summarized by the trace of ,
The leverage of the th observation in the th cluster is the th diagonal element of .
The effect of deleting cluster on the estimated parameter vector is given by the following one-step approximation for :
The cluster deletion statistic DFBETAC can be standardized using the variances of based on the complete data. The standardized one-step approximation for the change in due to deletion of cluster is
Partition the matrices and as
and let and .
The effect of deleting the th observation from the th cluster is given by the following one-step approximation to :
where , , and . Note that , , and are scalars.
The observation deletion statistic DFBETAO can be standardized using the variances of based on the complete data. The standardized one-step approximation for the change in due to deletion of observation in cluster is
A measure of the standardized influence of the subset of observations on the overall fit is . For deletion of cluster , this is approximated by
The measure of overall fit in the section DCLS | CLUSTERCOOKD | CLUSTERCOOKSD for the deletion of the th observation in the th cluster is approximated by
where , , and are defined in the section DFBETAO. In the case of the independence working correlation, this is equal to the measure for ordinary generalized linear models defined in the section DOBS | COOKD | COOKSD.
A studentized distance measure of the type defined in the section DCLS | CLUSTERCOOKD | CLUSTERCOOKSD of the influence of the th cluster is given by