Parameter Estimates and Associated Statistics

Least Squares Estimators

Least squares estimators of the regression parameters are found by solving the normal equations

\[  \left(\bX ^\prime \bW \bX \right)\bbeta = \bX ^{\prime } \bW \mb {Y}  \]

for the vector $\bbeta $, where $\bW $ is a diagonal matrix of observed weights. The resulting estimator of the parameter vector is

\[  \widehat{\bbeta } = \left(\bX ^{\prime }\bW \bX \right)^{-1}\bX ^{\prime }\bW \mb {Y}  \]

This is an unbiased estimator, because

\begin{align*}  \mr {E}\left[\widehat{\bbeta }\right] =&  \left(\bX ’\bW \bX \right)^{-1}\bX ’\bW \mr {E}[\bY ] \\ =&  \left(\bX ’\bW \bX \right)^{-1}\bX ’\bW \bX \bbeta = \bbeta \end{align*}

Notice that the only assumption necessary in order for the least squares estimators to be unbiased is that of a zero mean of the model errors. If the estimator is evaluated at the observed data, it is referred to as the least squares estimate,

\[  \widehat{\bbeta } = \left(\bX ^{\prime }\bW \bX \right)^{-1}\bX ^{\prime }\bW \mb {Y}  \]

If the standard classical assumptions are met, the least squares estimators of the regression parameters are the best linear unbiased estimators (BLUE). In other words, the estimators have minimum variance in the class of estimators that are unbiased and that are linear functions of the responses. If the additional assumption of normally distributed errors is satisfied, then the following are true:

  • The statistics that are computed have the proper sampling distributions for hypothesis testing.

  • Parameter estimators are normally distributed.

  • Various sums of squares are distributed proportional to chi-square, at least under proper hypotheses.

  • Ratios of estimators to standard errors follow the Student’s t distribution under certain hypotheses.

  • Appropriate ratios of sums of squares follow an F distribution for certain hypotheses.

When regression analysis is used to model data that do not meet the assumptions, you should interpret the results in a cautious, exploratory fashion. The significance probabilities under these circumstances are unreliable.

Box (1966) and Mosteller and Tukey (1977, Chapters 12 and 13) discuss the problems that are encountered in regression data, especially when the data are not under experimental control.

Estimating the Precision

Assume for the present that $\bX ^{\prime }\bW \bX $ has full column rank k (this assumption is relaxed later). The variance of the error terms, $\mr {Var}[\epsilon _ i] = \sigma ^2$, is then estimated by the mean square error,

\[  s^2 = \mbox{MSE} = \frac{\mbox{SSE}}{n-k} = \frac{1}{n-k}\sum _{i=1}^ n w_ i\left( y_ i - \mb {x}_ i’ \bbeta \right)^2  \]

where $\mb {x}_ i’$ is the ith row of the design matrix $\bX $. The residual variance estimate is also unbiased: $\mr {E}[s^2] = \sigma ^2$.

The covariance matrix of the least squares estimators is

\[  \mr {Var}\left[\widehat{\bbeta }\right] = \sigma ^2 \left(\bX ^{\prime } \bW \bX \right)^{-1}  \]

An estimate of the covariance matrix is obtained by replacing $\sigma ^2$ with its estimate, $s^2$ in the preceding formula. This estimate is often referred to as COVB in SAS/STAT modeling procedures:

\[  \mbox{COVB} = \widehat{\mr {Var}}\left[\widehat{\bbeta }\right] = s^2 \left(\bX ^{\prime } \bW \bX \right)^{-1}  \]

The correlation matrix of the estimates, often referred to as CORRB, is derived by scaling the covariance matrix: Let $\mb {S} = \mr {diag}\left(\left(\bX ’\bW \bX \right)^{-1}\right)^{-\frac{1}{2}}$. Then the correlation matrix of the estimates is

\[  \mbox{CORRB} = \bS \left( \bX ^{\prime } \bW \bX \right)^{-1} \bS  \]

The estimated standard error of the ith parameter estimator is obtained as the square root of the ith diagonal element of the COVB matrix. Formally,

\[  \mr {STDERR}\left(\widehat{\bbeta }_ i\right) = \sqrt {\left[s^2 (\bX ’\bW \bX )^{-1}\right]_{ii}}  \]

The ratio

\[  t = \frac{\widehat{\bbeta }_ i}{\mr {STDERR}\left(\widehat{\bbeta }_ i\right)}  \]

follows a Student’s t distribution with $(n-k)$ degrees of freedom under the hypothesis that $\beta _ i$ is zero and provided that the model errors are normally distributed.

Regression procedures display the t ratio and the significance probability, which is the probability under the hypothesis $H\colon \beta _ i=0$ of a larger absolute t value than was actually obtained. When the probability is less than some small level, the event is considered so unlikely that the hypothesis is rejected.

Type I SS and Type II SS measure the contribution of a variable to the reduction in SSE. Type I SS measure the reduction in SSE as that variable is entered into the model in sequence. Type II SS are the increment in SSE that results from removing the variable from the full model. Type II SS are equivalent to the Type III and Type IV SS that are reported in PROC GLM. If Type II SS are used in the numerator of an F test, the test is equivalent to the t test for the hypothesis that the parameter is zero. In polynomial models, Type I SS measure the contribution of each polynomial term after it is orthogonalized to the previous terms in the model. The four types of SS are described in Chapter 15: The Four Types of Estimable Functions.

Coefficient of Determination

The coefficient of determination in a regression model, also known as the R-square statistic, measures the proportion of variability in the response that is explained by the regressor variables. In a linear regression model with intercept, R square is defined as

\[  R^2 = 1 - \frac{\mr {SSE}}{\mr {SST}}  \]

where SSE is the residual (error) sum of squares and SST is the total sum of squares corrected for the mean. The adjusted R-square statistic is an alternative to R square that takes into account the number of parameters in the model. The adjusted R-square statistic is calculated as

\[  \mr {ADJRSQ} = 1 - \frac{n - i}{n - p} \left(1 - R^2 \right)  \]

where n is the number of observations that are used to fit the model, p is the number of parameters in the model (including the intercept), and i is 1 if the model includes an intercept term, and 0 otherwise.

R-square statistics also play an important indirect role in regression calculations. For example, the proportion of variability that is explained by regressing all other variables in a model on a particular regressor can provide insights into the interrelationship among the regressors.

Tolerances and variance inflation factors measure the strength of interrelationships among the regressor variables in the model. If all variables are orthogonal to each other, both tolerance and variance inflation are 1. If a variable is very closely related to other variables, the tolerance approaches 0 and the variance inflation gets very large. Tolerance (TOL) is 1 minus the R square that results from the regression of the other variables in the model on that regressor. Variance inflation (VIF) is the diagonal of $(\bX ’\bW \bX )^{-1}$, if $(\bX ’\bW \bX )$ is scaled to correlation form. The statistics are related as

\[  \mbox{VIF} = \frac{1}{\mbox{TOL}}  \]

Explicit and Implicit Intercepts

A linear model contains an explicit intercept if the $\bX $ matrix contains a column whose nonzero values do not vary, usually a column of ones. Many SAS/STAT procedures automatically add this column of ones as the first column in the $\bX $ matrix. Procedures that support a NOINT option in the MODEL statement provide the capability to suppress the automatic addition of the intercept column.

In general, models without an intercept should be the exception, especially if your model does not contain classification (CLASS) variables. An overall intercept is provided in many models to adjust for the grand total or overall mean in your data. A simple linear regression without intercept, such as

\[  \mr {E}[Y_ i] = \beta _1 x_ i + \epsilon _ i  \]

assumes that Y has mean zero if X takes on the value zero. This might not be a reasonable assumption.

If you explicitly suppress the intercept in a statistical model, the calculation and interpretation of your results can change. For example, the exclusion of the intercept in the following PROC REG statements leads to a different calculation of the R-square statistic. It also affects the calculation of the sum of squares in the analysis of variance for the model. For example, the model and error sum of squares add up to the uncorrected total sum of squares in the absence of an intercept.

proc reg;
   model y = x / noint;

Many statistical models contain an implicit intercept, which occurs when a linear function of one or more columns in the $\bX $ matrix produces a column of constant, nonzero values. For example, the presence of a CLASS variable in the GLM parameterization always implies an intercept in the model. If a model contains an implicit intercept, then adding an intercept to the model does not alter the quality of the model fit, but it changes the interpretation (and number) of the parameter estimates.

The way in which the implicit intercept is detected and accounted for in the analysis depends on the procedure. For example, the following statements in PROC GLM lead to an implied intercept:

proc glm;
   class a;
   model y = a / solution noint;

Whereas the analysis of variance table uses the uncorrected total sum of squares (because of the NOINT option), the implied intercept does not lead to a redefinition or recalculation of the R-square statistic (compared to the model without the NOINT option). Also, because the intercept is implied by the presence of the CLASS variable a in the model, the same error sum of squares results whether the NOINT option is specified or not.

A different approach is taken, for example, by PROC TRANSREG. The ZERO=NONE option in the CLASS parameterization of the following statements leads to an implicit intercept model:

proc transreg;
   model ide(y) = class(a / zero=none) / ss2;

The analysis of variance table or the regression fit statistics are not affected in PROC TRANSREG. Only the interpretation of the parameter estimates changes because of the way in which the intercept is accounted for in the model.

Implied intercepts not only occur when classification effects are present in the model. They also occur with B-splines and other sets of constructed columns.

Models Not of Full Rank

If the $\bX $ matrix is not of full rank, then a generalized inverse can be used to solve the normal equations to minimize the SSE:

\[  \widehat{\bbeta } = \left(\bX ’\bW \bX \right)^{-}\bX ’\bW \mb {Y}  \]

However, these estimates are not unique, because there are an infinite number of solutions that correspond to different generalized inverses. PROC REG and other regression procedures choose a nonzero solution for all variables that are linearly independent of previous variables and a zero solution for other variables. This approach corresponds to using a generalized inverse in the normal equations, and the expected values of the estimates are the Hermite normal form of $\bX ’\bW \bX $ multiplied by the true parameters:

\[  \mr {E}\left[\widehat{\bbeta }\right] = \left(\bX ’\bW \bX \right)^{-}(\bX ’\bW \bX )\bbeta  \]

Degrees of freedom for the estimates that correspond to singularities are not counted (reported as zero). The hypotheses that are not testable have t tests displayed as missing. The message that the model is not of full rank includes a display of the relations that exist in the matrix.

For information about generalized inverses and their importance for statistical inference in linear models, see the sections Generalized Inverse Matrices and Linear Model Theory in Chapter 3: Introduction to Statistical Modeling with SAS/STAT Software.