Introduction to Statistical Modeling with SAS/STAT Software

Residual Analysis

The model errors $\bepsilon = \bY - \bX \bbeta$ are unobservable. Yet important features of the statistical model are connected to them, such as the distribution of the data, the correlation among observations, and the constancy of variance. It is customary to diagnose and investigate features of the model errors through the fitted residuals $\widehat{\bepsilon } = \bY - \widehat{\bY } = \bY - \bH \bY = \bM \bY$ . These residuals are projections of the data onto the null space of $\bX$ and are also referred to as the "raw" residuals to contrast them with other forms of residuals that are transformations of $\widehat{\bepsilon }$ . For the classical linear model, the statistical properties of $\widehat{\bepsilon }$ are affected by the features of that projection and can be summarized as follows:

$\begin{align*} \mr{E}[\widehat{\bepsilon }] & = \mb{0} \\ \mr{Var}[\widehat{\bepsilon }] & = \sigma ^2\bM \\ \mr{rank}(\bM ) & = n - \mr{rank}(\bX ) \end{align*}$

Furthermore, if $\bepsilon \sim N(\mb{0},\sigma ^2\bI )$ , then $\widehat{\bepsilon } \sim N(\mb{0},\sigma ^2\bM )$ .

Because $\bM = \bI - \bH$ , and the "hat" matrix $\bH$ satisfies $\partial \widehat{\bY }/\partial \bY$ , the hat matrix is also the leverage matrix of the model. If $h_{ii}$ denotes the ith diagonal element of $\bH$ (the leverage of observation i), then the leverages are bounded in a model with intercept, $1/n \leq h_{ii} \leq 1$ . Consequently, the variance of a raw residual is less than that of an observation: $\mr{Var}[\widehat{\epsilon }_ i] = \sigma ^2(1 - h_{ii}) < \sigma ^2$ . In applications where the variability of the data is estimated from fitted residuals, the estimate is invariably biased low. An example is the computation of an empirical semivariogram based on fitted (detrended) residuals.

More important, the diagonal entries of $\bH$ are not necessarily identical; the residuals are heteroscedastic. The "hat" matrix is also not a diagonal matrix; the residuals are correlated. In summary, the only property that the fitted residuals $\widehat{\bepsilon }$ share with the model errors is a zero mean. It is thus commonplace to use transformations of the fitted residuals for diagnostic purposes.

Raw and Studentized Residuals

A standardized residual is a raw residual that is divided by its standard deviation:

$\widehat{\epsilon }_{i}^* = \frac{Y_ i - \widehat{Y}_ i}{\sqrt {\mr{Var}[Y_ i - \widehat{Y}_ i]}} = \frac{\widehat{\epsilon }_ i}{\sqrt {\sigma ^2(1-h_{ii})}}$

Because $\sigma ^2$ is unknown, residual standardization is usually not practical. A studentized residual is a raw residual that is divided by its estimated standard deviation. If the estimate of the standard deviation is based on the same data that were used in fitting the model, the residual is also called an internally studentized residual:

$\widehat{\epsilon }_{is} = \frac{Y_ i - \widehat{Y}_ i}{\sqrt {\widehat{\mr{Var}}[Y_ i - \widehat{Y}_ i]}} = \frac{\widehat{\epsilon }_ i}{\sqrt {\widehat{\sigma }^2(1-h_{ii})}}$

If the estimate of the residual’s variance does not involve the ith observation, it is called an externally studentized residual. Suppose that $\widehat{\sigma }^2_{-i}$ denotes the estimate of the residual variance obtained without the ith observation; then the externally studentized residual is

$\widehat{\epsilon }_{ir} = \frac{\widehat{\epsilon }_ i}{\sqrt {\widehat{\sigma }^2_{-i}(1-h_{ii})}}$

Scaled Residuals

A scaled residual is simply a raw residual divided by a scalar quantity that is not an estimate of the variance of the residual. For example, residuals divided by the standard deviation of the response variable are scaled and referred to as Pearson or Pearson-type residuals:

$\widehat{\epsilon }_{ic} = \frac{Y_ i - \widehat{Y}_ i}{\sqrt {\widehat{\mr{Var}}[Y_ i]}}$

In generalized linear models, where the variance of an observation is a function of the mean $\mu$ and possibly of an extra scale parameter, $\mr{Var}[Y] = a(\mu )\phi$ , the Pearson residual is

$\widehat{\epsilon }_{iP} = \frac{Y_ i - \widehat{\mu }_ i}{\sqrt {a(\widehat{\mu })}}$

because the sum of the squared Pearson residuals equals the Pearson $X^2$ statistic:

$X^2 = \sum _{i=1}^ n \widehat{\epsilon }_{iP}^2$

When the scale parameter $\phi$ participates in the scaling, the residual is also referred to as a Pearson-type residual:

$\widehat{\epsilon }_{iP} = \frac{Y_ i - \widehat{\mu }_ i}{\sqrt {a(\widehat{\mu })\phi }}$

Other Residuals

You might encounter other residuals in SAS/STAT software. A "leave-one-out" residual is the difference between the observed value and the residual obtained from fitting a model in which the observation in question did not participate. If $\widehat{Y}_ i$ is the predicted value of the ith observation and $\widehat{Y}_{i,-i}$ is the predicted value if $Y_ i$ is removed from the analysis, then the "leave-one-out" residual is

$\widehat{\epsilon }_{i,-i} = Y_ i - \widehat{Y}_{i,-i}$

Since the sum of the squared "leave-one-out" residuals is the PRESS statistic (prediction sum of squares; Allen 1974), $\widehat{\epsilon }_{i,-i}$ is also called the PRESS residual. The concept of the PRESS residual can be generalized if the deletion residual can be based on the removal of sets of observations. In the classical linear model, the PRESS residual for case deletion has a particularly simple form:

$\widehat{\epsilon }_{i,-i} = Y_ i - \widehat{Y}_{i,-i} = \frac{\widehat{\epsilon }_ i}{1 - h_{ii}}$

That is, the PRESS residual is simply a scaled form of the raw residual, where the scaling factor is a function of the leverage of the observation.

When data are correlated, $\mr{Var}[\bY ] = \bV$ , you can scale the vector of residuals rather than scale each residual separately. This takes the covariances among the observations into account. This form of scaling is accomplished by forming the Cholesky root $\bC ’\bC = \bV$ , where $\bC ’$ is a lower-triangular matrix. Then $\bC ’^{-1}\bY$ is a vector of uncorrelated variables with unit variance. The Cholesky residuals in the model $\bY = \bX \bbeta + \bepsilon$ are

$\widehat{\bepsilon }_ C = \bC ’^{-1}\left(\bY - \bX \widehat{\bbeta }\right)$

In generalized linear models, the fit of a model can be measured by the scaled deviance statistic $D^*$ . It measures the difference between the log likelihood under the model and the maximum log likelihood that is achievable. In models with a scale parameter $\phi$ , the deviance is $D = \phi \times D^* = \sum _{i=1}^ n d_ i$ . The deviance residuals are the signed square roots of the contributions to the deviance statistic:

$\widehat{\epsilon }_{id} = \mr{sign}\{ y_ i - \widehat{\mu }_ i\} \, \sqrt {d_ i}$