Robust Regression Examples


Overview

SAS/IML has four subroutines that you can use for robust estimation of location and scale, for outlier detection, and for robust regression. The least median of squares (LMS) and least trimmed squares (LTS) subroutines perform robust regression (sometimes called resistant regression). These subroutines can detect outliers and perform a least squares regression on the remaining observations. You can use the minimum volume ellipsoid estimation (MVE) and minimum covariance determinant estimation (MCD) subroutines to find a robust location and a robust covariance matrix that you can use to construct confidence regions, to detect multivariate outliers and leverage points, and to conduct robust canonical correlation and principal component analyses.

The LMS, LTS, MVE, and MCD methods were developed by Rousseeuw (1984) and Rousseeuw and Leroy (1987). All these methods have a high breakdown value. The breakdown value is a measure of the proportion of contamination that a procedure can withstand and still maintain its robustness.

The algorithm that the LMS subroutine uses is based on the program for robust regression (PROGRESS) of Rousseeuw and Hubert (1996), which is an updated version of Rousseeuw and Leroy (1987). In the special case of regression through the origin for a single regressor, Barreto and Maharry (2006) show that the PROGRESS algorithm does not, in general, find the slope that yields the least median of squares. Starting with SAS/IML 9.2, the LMS subroutine includes the algorithm of Barreto and Maharry (2006) as a special case.

The algorithm that the LTS subroutine uses is based on the FAST-LTS algorithm of Rousseeuw and Van Driessen (2000). The MCD algorithm is based on the FAST-MCD algorithm of Rousseeuw and Van Driessen (1999), which is similar to the FAST-LTS algorithm. The MVE algorithm is based on the algorithm that is used in the MINVOL program by Rousseeuw (1984). LTS estimation has higher statistical efficiency than LMS estimation. Using the FAST-LTS algorithm, LTS is also faster than LMS for large data sets. Similarly, MCD is faster than MVE for large data sets.

In addition to LTS estimation and LMS estimation, there are other methods for robust regression and outlier detection. For more information, see the documentation of the ROBUSTREG procedure in SAS/STAT User's Guide. A summary of these robust tools in SAS can be found in Chen (2002).

The four SAS/IML subroutines are designed for the following tasks:

  • LMS minimizes the hth ordered squared residual.

  • LTS minimizes the sum of the h smallest squared residuals.

  • MCD minimizes the determinant of the covariance of h points.

  • MVE minimizes the volume of an ellipsoid that contains h points.

The value h is the number of observations to use. The value is in the range

\[ \frac{N}{2} + 1 ~ \leq ~ h ~ \leq ~ \frac{3N}{4} + \frac{n+1}{4} \]

where N is the number of observations and n is the number of regressors. (The value of h can be specified, but in most applications the default value works well and the results seem to be quite stable for different choices of h.) The value of h determines the breakdown value, which is "the smallest fraction of contamination that can cause the estimator T to take on values arbitrarily far from $T(Z)$" (Rousseeuw and Leroy 1987). Here, $T(Z)$ indicates the result of applying an estimator T to a sample Z that contains h observations.

For a linear regression model that includes the parameter vector $\mb{b}=(b_1,\ldots ,b_ n)$, the residual of observation i is $r_ i= y_ i - \mb{x}_ i \mb{b}$. Let $ (r^2)_{1:N} \leq \ldots \leq (r^2)_{N:N} $ be the ordered, squared residuals. The objective functions for the LMS, LTS, MCD, and MVE optimization problems are defined as follows. Each algorithm strives to find the parameter vector that minimizes the objective function.

  • The objective function for the LMS optimization problem is the hth ordered squared residual:

    \[ F_\mr {LMS} = (r^2)_{h:N} \]

    For $h= N/2+1$, the hth quantile is the median of the squared residuals. The default h in PROGRESS is an optimal value $h = \left[ \frac{N+n+1}{2} \right]$, which yields the breakdown value $(N-h+1)/n$, where $[k]$ denotes the integer part of k.

  • The objective function for the LTS optimization problem is the sum of the h smallest ordered squared residuals:

    \[ F_\mr {LTS} = \sqrt { \frac{1}{h} \sum _{i=1}^ h (r^2)_{i:N} } \]
  • The objective function for the MCD optimization problem is based on the determinant of the covariance of the selected h points,

    \[ F_\mr {MCD} = \mbox{det}(\mb{C}_ h) \]

    where $\mb{C}_ h$ is the covariance matrix of the selected h points.

  • The objective function for the MVE optimization problem is based on the hth quantile $d_{h:N}$ of the Mahalanobis-type distances $\mb{d}=(d_1,\ldots ,d_ N)$,

    \[ F_\mr {MVE} = \sqrt {d_{h:N} \mbox{det}(\mb{C})} \]

    subject to $d_{h:N} = \sqrt {\chi ^2_{n,0.5}}$, where $\mb{C}$ is the scatter matrix estimate, and the Mahalanobis-type distances are computed as

    \[ \mb{d} = \mbox{diag}(\sqrt { (\mb{X} - T)^ T \mb{C}^{-1} (\mb{X} - T)} ) \]

    where T is the location estimate.

Because of the nonsmooth form of these objective functions, the estimates cannot be obtained by using traditional optimization algorithms. For LMS and LTS, the algorithm, as in the PROGRESS program, selects a number of subsets of n observations out of the N specified observations, evaluates the objective function, and saves the subset with the lowest objective function. As long as the problem size enables you to evaluate all such subsets, the result is a global optimum. If computing time does not permit you to evaluate all the different subsets, a random collection of subsets is evaluated. In such a case, you might not obtain the global optimum.

The LMS, LTS, MCD, and MVE subroutines require that the number of observations, N, be more than twice the number of explanatory variables, n (including the intercept). That is, the require $N > 2n$.