Introduction to Statistical Modeling with SAS/STAT Software


Test of Hypotheses

Consider a general linear hypothesis of the form $H\colon \bL \bbeta = \mb{d}$, where $\bL $ is a $(k \times p)$ matrix. It is assumed that $\mb{d}$ is such that this hypothesis is linearly consistent—that is, that there exists some $\bbeta $ for which $\bL \bbeta = \mb{d}$. This is always the case if $\mb{d}$ is in the column space of $\bL $, if $\bL $ has full row rank, or if $\mb{d}=\mb{0}$; the latter is the most common case. Since many linear models have a rank-deficient $\bX $ matrix, the question arises whether the hypothesis is testable. The idea of testability of a hypothesis is—not surprisingly—connected to the concept of estimability as introduced previously. The hypothesis $H\colon \bL \bbeta = \mb{d}$ is testable if it consists of estimable functions.

There are two important approaches to testing hypotheses in statistical applications—the reduction principle and the linear inference approach. The reduction principle states that the validity of the hypothesis can be inferred by comparing a suitably chosen summary statistic between the model at hand and a reduced model in which the constraint $\bL \bbeta = \mb{d}$ is imposed. The linear inference approach relies on the fact that $\widehat{\bbeta }$ is an estimator of $\bbeta $ and its stochastic properties are known, at least approximately. A test statistic can then be formed using $\widehat{\bbeta }$, and its behavior under the restriction $\bL \bbeta = \mb{d}$ can be ascertained.

The two principles lead to identical results in certain—for example, least squares estimation in the classical linear model. In more complex situations the two approaches lead to similar but not identical results. This is the case, for example, when weights or unequal variances are involved, or when $\widehat{\bbeta }$ is a nonlinear estimator.

Reduction Tests

The two main reduction principles are the sum of squares reduction test and the likelihood ratio test. The test statistic in the former is proportional to the difference of the residual sum of squares between the reduced model and the full model. The test statistic in the likelihood ratio test is proportional to the difference of the log likelihoods between the full and reduced models. To fix these ideas, suppose that you are fitting the model $\bY = \bX \bbeta + \bepsilon $, where $\bepsilon \sim N(\mb{0},\sigma ^2\bI )$. Suppose that $\mr{SSR}$ denotes the residual sum of squares in this model and that $\mr{SSR}_ H$ is the residual sum of squares in the model for which $\bL \bbeta = \mb{d}$ holds. Then under the hypothesis the ratio

\[ (\mr{SSR}_ H - \mr{SSR}) / \sigma ^2 \]

follows a chi-square distribution with degrees of freedom equal to the rank of $\bL $. Maybe surprisingly, the residual sum of squares in the full model is distributed independently of this quantity, so that under the hypothesis,

\[ F = \frac{ (\mr{SSR}_ H - \mr{SSR})/\mr{rank}(\bL ) }{ \mr{SSR} / (n - \mr{rank}(\bX )) } \]

follows an F distribution with $\mr{rank}(\bL )$ numerator and $n - \mr{rank}(\bX )$ denominator degrees of freedom. Note that the quantity in the denominator of the F statistic is a particular estimator of $\sigma ^2$—namely, the unbiased moment-based estimator that is customarily associated with least squares estimation. It is also the restricted maximum likelihood estimator of $\sigma ^2$ if $\bY $ is normally distributed.

In the case of the likelihood ratio test, suppose that $l(\widehat{\bbeta },\widehat{\sigma }^2;\mb{y})$ denotes the log likelihood evaluated at the ML estimators. Also suppose that $l(\widehat{\bbeta }_ H,\widehat{\sigma }^2_ H;\mb{y})$ denotes the log likelihood in the model for which $\bL \bbeta = \mb{d}$ holds. Then under the hypothesis the statistic

\[ \lambda = 2 \left( l(\widehat{\bbeta } ,\widehat{\sigma }^2 ;\mb{y}) - l(\widehat{\bbeta }_ H,\widehat{\sigma }^2_ H;\mb{y}) \right) \]

follows approximately a chi-square distribution with degrees of freedom equal to the rank of $\bL $. In the case of a normally distributed response, the log-likelihood function can be profiled with respect to $\bbeta $. The resulting profile log likelihood is

\[ l(\widehat{\sigma }^2;\mb{y}) = -\frac{n}{2}\log \{ 2\pi \} - \frac{n}{2} \left(\log \{ \widehat{\sigma }^2\} \right) \]

and the likelihood ratio test statistic becomes

\[ \lambda = n \left( \log \left\{ \widehat{\sigma }^2_ H\right\} - \log \left\{ \widehat{\sigma }^2 \right\} \right) = n \left( \log \left\{ \mr{SSR}_ H \right\} - \log \left\{ \mr{SSR} \right\} \right) = n \left( \log \left\{ \mr{SSR}_ H / \mr{SSR}\right\} \right) \]

The preceding expressions show that, in the case of normally distributed data, both reduction principles lead to simple functions of the residual sums of squares in two models. As Pawitan (2001, p. 151) puts it, there is, however, an important difference not in the computations but in the statistical content. The least squares principle, where sum of squares reduction tests are widely used, does not require a distributional specification. Assumptions about the distribution of the data are added to provide a framework for confirmatory inferences, such as the testing of hypotheses. This framework stems directly from the assumption about the data’s distribution, or from the sampling distribution of the least squares estimators. The likelihood principle, on the other hand, requires a distributional specification at the outset. Inference about the parameters is implicit in the model; it is the result of further computations following the estimation of the parameters. In the least squares framework, inference about the parameters is the result of further assumptions.

Linear Inference

The principle of linear inference is to formulate a test statistic for $H\colon \bL \bbeta = \mb{d}$ that builds on the linearity of the hypothesis about $\bbeta $. For many models that have linear components, the estimator $\bL \widehat{\bbeta }$ is also linear in $\bY $. It is then simple to establish the distributional properties of $\bL \widehat{\bbeta }$ based on the distributional assumptions about $\bY $ or based on large-sample arguments. For example, $\widehat{\bbeta }$ might be a nonlinear estimator, but it is known to asymptotically follow a normal distribution; this is the case in many nonlinear and generalized linear models.

If the sampling distribution or the asymptotic distribution of $\widehat{\bbeta }$ is normal, then one can easily derive quadratic forms with known distributional properties. For example, if the random vector $\bU $ is distributed as $N(\bmu ,\bSigma )$, then $\bU ’\bA \bU $ follows a chi-square distribution with $\mr{rank}(\bA )$ degrees of freedom and noncentrality parameter $1/2 \bmu ’\mb{A}\bmu $, provided that $\bA \bSigma \bA \bSigma = \bA \bSigma $.

In the classical linear model, suppose that $\bX $ is deficient in rank and that $\widehat{\bbeta } = \left(\bX ’\bX \right)^{-}\bX ’\bY $ is a solution to the normal equations. Then, if the errors are normally distributed,

\[ \widehat{\bbeta } \sim N\left(\left(\bX ’\bX \right)^{-}\bX ’\bX \bbeta , \sigma ^2\left(\bX ’\bX \right)^{-}\bX ’\bX \left(\bX ’\bX \right)^{-}\right) \]

Because $H\colon \bL \bbeta = \mb{d}$ is testable, $\bL \bbeta $ is estimable, and thus $\bL (\bX ’\bX )^{-}\bX ’\bX = \bL $, as established in the previous section. Hence,

\[ \bL \widehat{\bbeta } \sim N\left(\bL \bbeta ,\sigma ^2\bL \left(\bX ’\bX \right)^{-}\bL ^\prime \right) \]

The conditions for a chi-square distribution of the quadratic form

\[ (\bL \widehat{\bbeta } - \mb{d})’ \left(\bL (\bX ’\bX )^{-}\bL ’\right)^{-} (\bL \widehat{\bbeta } - \mb{d}) \]

are thus met, provided that

\[ (\bL (\bX ’\bX )^{-}\bL ’)^{-}\bL (\bX ’\bX )^{-}\bL ’ (\bL (\bX ’\bX )^{-}\bL ’)^{-}\bL (\bX ’\bX )^{-}\bL ’ = (\bL (\bX ’\bX )^{-}\bL ’)^{-}\bL (\bX ’\bX )^{-}\bL ’ \]

This condition is obviously met if $\bL (\bX ’\bX )^{-}\bL ’$ is of full rank. The condition is also met if $\bL (\bX ’\bX )^{-}\bL ’^{-}$ is a reflexive inverse (a $g_2$-inverse) of $\bL (\bX ’\bX )^{-}\bL $.

The test statistic to test the linear hypothesis $H\colon \bL \bbeta = \mb{d}$ is thus

\[ F = \frac{(\bL \widehat{\bbeta } - \mb{d})' \left(\bL (\bX '\bX )^{-}\bL '\right)^{-} (\bL \widehat{\bbeta } - \mb{d}) / \mr{rank}(\bL ) }{\mr{SSR} / (n - \mr{rank}(\bX )) } \]

and it follows an F distribution with $\mr{rank}(\bL )$ numerator and $n-\mr{rank}(\bX )$ denominator degrees of freedom under the hypothesis.

This test statistic looks very similar to the F statistic for the sum of squares reduction test. This is no accident. If the model is linear and parameters are estimated by ordinary least squares, then you can show that the quadratic form $(\bL \widehat{\bbeta } - \mb{d})’ \left(\bL (\bX ’\bX )^{-}\bL ’\right)^{-}(\bL \widehat{\bbeta } - \mb{d})$ equals the differences in the residual sum of squares, $\mr{SSR}_ H - \mr{SSR}$, where $\mr{SSR}_ H$ is obtained as the residual sum of squares from OLS estimation in a model that satisfies $\bL \bbeta = \mb{d}$. However, this correspondence between the two test formulations does not apply when a different estimation principle is used. For example, assume that $\bepsilon \sim N(\mb{0},\bV )$ and that $\bbeta $ is estimated by generalized least squares:

\[ \widehat{\bbeta }_ g = \left(\bX ’\bV ^{-1}\bX \right)^{-}\bX ’\bV ^{-1}\bY \]

The construction of $\bL $ matrices associated with hypotheses in SAS/STAT software is frequently based on the properties of the $\bX $ matrix, not of $\bX ’\bV ^{-}\bX $. In other words, the construction of the $\bL $ matrix is governed only by the design. A sum of squares reduction test for $H\colon \bL \bbeta = \mb{0}$ that uses the generalized residual sum of squares $(\bY - \widehat{\bbeta }_ g)’\bV ^{-1}(\bY - \widehat{\bbeta }_ g)$ is not identical to a linear hypothesis test with the statistic

\[ F^*= \frac{\widehat{\bbeta }_ g^\prime \bL ^\prime \left(\bL \left(\bX '\bV ^{-1}\bX \right)^{-}\bL '\right)^{-} \bL \widehat{\bbeta }_ g}{\mr{rank}(\bL ) } \]

Furthermore, $\bV $ is usually unknown and must be estimated as well. The estimate for $\bV $ depends on the model, and imposing a constraint on the model would change the estimate. The asymptotic distribution of the statistic $F^*$ is a chi-square distribution. However, in practical applications the F distribution with $\mr{rank}(\bL )$ numerator and $\nu $ denominator degrees of freedom is often used because it provides a better approximation to the sampling distribution of $F^*$ in finite samples. The computation of the denominator degrees of freedom $\nu $, however, is a matter of considerable discussion. A number of methods have been proposed and are implemented in various forms in SAS/STAT (see, for example, the degrees-of-freedom methods in the MIXED and GLIMMIX procedures).