Least Squares

The idea of the ordinary least squares (OLS) principle is to choose parameter estimates that minimize the squared distance between the data and the model. In terms of the general, additive model,

\[  Y_ i = f(x_{i1},\cdots ,x_{ik};\beta _1,\cdots ,\beta _ p) + \epsilon _ i  \]

the OLS principle minimizes

\[  \mr {SSE} = \sum _{i=1}^ n\left(y_ i - f(x_{i1},\cdots ,x_{ik};\beta _1,\cdots ,\beta _ p) \right)^2  \]

The least squares principle is sometimes called nonparametric in the sense that it does not require the distributional specification of the response or the error term, but it might be better termed distributionally agnostic. In an additive-error model it is only required that the model errors have zero mean. For example, the specification

\begin{align*}  Y_ i & = \beta _0 + \beta _1 x_ i + \epsilon _ i \\ \mr {E}[\epsilon _ i] & = 0 \end{align*}

is sufficient to derive ordinary least squares (OLS) estimators for $\beta _0$ and $\beta _1$ and to study a number of their properties. It is easy to show that the OLS estimators in this SLR model are

\begin{align*}  \widehat{\beta }_1 & = \left( \sum _{i=1}^ n \left(Y_ i-\overline{Y}\right) \sum _{i=1}^ n \left(x_ i-\overline{x}\right) \right) \Big/ \sum _{i=1}^ n \left(x_ i-\overline{x}\right)^2 \\ \widehat{\beta }_0 & = \overline{Y} - \widehat{\beta }_1\overline{x} \end{align*}

Based on the assumption of a zero mean of the model errors, you can show that these estimators are unbiased, $\mr {E}[\widehat{\beta }_1] = \beta _1$, $\mr {E}[\widehat{\beta }_0] = \beta _0$. However, without further assumptions about the distribution of the $\epsilon _ i$, you cannot derive the variability of the least squares estimators or perform statistical inferences such as hypothesis tests or confidence intervals. In addition, depending on the distribution of the $\epsilon _ i$, other forms of least squares estimation can be more efficient than OLS estimation.

The conditions for which ordinary least squares estimation is efficient are zero mean, homoscedastic, uncorrelated model errors. Mathematically,

\begin{align*}  \mr {E}[\epsilon _ i] & = 0 \\ \mr {Var}[\epsilon _ i] & = \sigma ^2 \\ \mr {Cov}[\epsilon _ i,\epsilon _ j] & = 0 \mbox{ if } i \not= j \end{align*}

The second and third assumption are met if the errors have an iid distribution—that is, if they are independent and identically distributed. Note, however, that the notion of stochastic independence is stronger than that of absence of correlation. Only if the data are normally distributed does the latter implies the former.

The various other forms of the least squares principle are motivated by different extensions of these assumptions in order to find more efficient estimators.

Weighted Least Squares

The objective function in weighted least squares (WLS) estimation is

\[  \mr {SSE}_ w = \sum _{i=1}^ n w_ i \left(Y_ i - f(x_{i1},\cdots ,x_{ik};\beta _1,\cdots ,\beta _ p) \right)^2  \]

where $w_ i$ is a weight associated with the ith observation. A situation where WLS estimation is appropriate is when the errors are uncorrelated but not homoscedastic. If the weights for the observations are proportional to the reciprocals of the error variances, $\mr {Var}[\epsilon _ i] = \sigma ^2/w_ i$, then the weighted least squares estimates are best linear unbiased estimators (BLUE). Suppose that the weights $w_ i$ are collected in the diagonal matrix $\mb {W}$ and that the mean function has the form of a linear model. The weighted sum of squares criterion then can be written as

\[  \mr {SSE}_ w = \left(\bY -\bX \bbeta \right)’\mb {W} \left(\bY -\bX \bbeta \right)  \]

which gives rise to the weighted normal equations

\[  (\bX ’\mb {W}\bX )\bbeta = \bX ’\mb {W}\bY  \]

The resulting WLS estimator of $\bbeta $ is

\[  \widehat{\bbeta }_ w = \left(\bX ’\mb {W}\bX \right)^{-}\bX ’\mb {W}\bY  \]
Iteratively Reweighted Least Squares

If the weights in a least squares problem depend on the parameters, then a change in the parameters also changes the weight structure of the model. Iteratively reweighted least squares (IRLS) estimation is an iterative technique that solves a series of weighted least squares problems, where the weights are recomputed between iterations. IRLS estimation can be used, for example, to derive maximum likelihood estimates in generalized linear models.

Generalized Least Squares

The previously discussed least squares methods have in common that the observations are assumed to be uncorrelated—that is, $\mr {Cov}[\epsilon _ i,\epsilon _ j] = 0$, whenever $i \not= j$. The weighted least squares estimation problem is a special case of a more general least squares problem, where the model errors have a general covariance matrix, $\mr {Var}[\bepsilon ] = \bSigma $. Suppose again that the mean function is linear, so that the model becomes

\[  \bY = \bX \bbeta + \bepsilon \quad \quad \bepsilon \sim \left(\mb {0},\bSigma \right)  \]

The generalized least squares (GLS) principle is to minimize the generalized error sum of squares

\[  \mr {SSE}_ g = \left(\bY -\bX \bbeta \right)’\bSigma ^{-1} \left(\bY -\bX \bbeta \right)  \]

This leads to the generalized normal equations

\[  (\bX ’\bSigma ^{-1}\bX )\bbeta = \bX ’\bSigma ^{-1}\bY  \]

and the GLS estimator

\[  \widehat{\bbeta }_ g = \left(\bX ’\bSigma ^{-1}\bX \right)^{-}\bX ’\bSigma ^{-1}\bY  \]

Obviously, WLS estimation is a special case of GLS estimation, where $\bSigma = \sigma ^2\mb {W}^{-1}$—that is, the model is

\[  \bY = \bX \bbeta + \bepsilon \quad \quad \bepsilon \sim \left(\mb {0},\sigma ^2\mb {W}^{-1}\right)  \]