M Estimation :: SAS/STAT(R) 12.3 User's Guide

M Estimation

Algorithm
Weight Functions
Convergence Criteria
Asymptotic Covariance and Confidence Intervals
R Square and Deviance
Linear Tests
Model Selection

M estimation in the context of regression was first introduced by Huber (1973) as a result of making the least squares approach robust. Although M estimators are not robust with respect to leverage points, they are popular in applications where leverage points are not an issue.

Instead of minimizing a sum of squares of the residuals, a Huber-type M estimator ${\hat\btheta }_ M$ of $\btheta$ minimizes a sum of less rapidly increasing functions of the residuals:

$Q(\btheta ) = \sum _{i=1}^ n \rho \left({r_ i \over \sigma }\right)$

where $\mb {r} = \mb {y} - \bX \btheta$ . For the ordinary least squares estimation, $\rho$ is the square function, $\rho (z)=z^2$ .

If $\sigma$ is known, then by taking derivatives with respect to $\btheta$ , ${\hat\btheta }_ M$ is also a solution of the system of p equations:

$\sum _{i=1}^ n \psi \left({r_ i \over \sigma }\right) x_{ij}= 0, \ j=1,\ldots ,p$

where $\psi ={\frac{\partial \rho }{\partial z}}$ . If $\rho$ is convex, ${\hat\btheta }_ M$ is the unique solution.

The ROBUSTREG procedure solves this system by using iteratively reweighted least squares (IRLS). The weight function is defined as

$w(z) = {\psi (z) \over z}$

The ROBUSTREG procedure provides 10 kinds of weight functions through the WEIGHTFUNCTION= option in the MODEL statement. Each weight function corresponds to a $\rho$ function. See the section Weight Functions for a complete discussion. You can specify the scale parameter $\sigma$ with the SCALE= option in the PROC ROBUSTREG statement.

If $\sigma$ is unknown, both $\btheta$ and $\sigma$ are estimated by minimizing the function

$Q(\btheta ,\sigma ) = \sum _{i=1}^ n \left[\rho \left({r_ i \over \sigma }\right) + a \right]\sigma , \ a> 0$

The algorithm proceeds by alternately improving $\hat\btheta$ in a location step and $\hat\sigma$ in a scale step.

For the scale step, three methods are available to estimate $\sigma$ , which you can select with the SCALE= option.

(SCALE=HUBER <(D=d)>) Compute $\hat\sigma$ by the iteration

$\left({\hat\sigma }^{(m+1)}\right)^2 = {1\over nh}\sum _{i=1}^ n \chi _ d \left( {r_ i\over {\hat\sigma }^{(m)}} \right) \left({\hat\sigma }^{(m)}\right)^2$

where

$\chi _ d(x) = \left\{ \begin{array}{ll} x^2 / 2 & {\mbox{ if }} |x| < d \\ d^2 / 2 & {\mbox{ otherwise }} \end{array} \right.$

is the Huber function and $h={n-p\over n}\left(d^2 + (1-d^2)\Phi (d) -0.5 - d{\sqrt {2\pi }}e^{-{1\over 2} d^2}\right)$ is the Huber constant (Huber, 1981, p. 179). You can specify d with the D=d option. By default, D=2.5.
(SCALE=TUKEY <(D=d)>) Compute ${\hat\sigma }$ by solving the supplementary equation

${1\over n-p}\sum _{i=1}^ n \chi _ d \left({r_ i \over \sigma }\right) = \beta$

where

$\chi _ d(x) = \left\{ \begin{array}{ll} {3x^2\over d^2} - {3x^4\over d^4} + {x^6\over d^6} & {\mbox{ if }} |x| < d \\ 1 & {\mbox{ otherwise }} \end{array} \right.$

Here $\psi ={1\over 6}\chi _1’$ is Tukey’s bisquare function, and $\beta = \int \chi _ d(s) d \Phi (s)$ is the constant such that the solution ${\hat\sigma }$ is asymptotically consistent when $L(\cdot / \sigma ) = \Phi (\cdot )$ (Hampel et al., 1986, p. 149). You can specify d with the D=d option. By default, D=2.5.
(SCALE=MED) Compute ${\hat\sigma }$ by the iteration

${\hat\sigma }^{(m+1)} = {\mbox{median}} \left\{ |y_ i - \mb {x}_ i’ {\hat\btheta }^{(m)} |/ \beta _0, i=1,\ldots ,n \right\}$

where $\beta _0 = \Phi ^{-1}(.75)$ is the constant such that the solution ${\hat\sigma }$ is asymptotically consistent when $L(\cdot / \sigma ) = \Phi (\cdot )$ (Hampel et al., 1986, p. 312).

SCALE=MED is the default.

Algorithm

The basic algorithm for computing M estimates for regression is iteratively reweighted least squares (IRLS). As the name suggests, a weighted least squares fit is carried out inside an iteration loop. For each iteration, a set of weights for the observations is used in the least squares fit. The weights are constructed by applying a weight function to the current residuals. Initial weights are based on residuals from an initial fit. The ROBUSTREG procedure uses the unweighted least squares fit as a default initial fit. The iteration terminates when a convergence criterion is satisfied. The maximum number of iterations is set to 1,000. You can specify the weight function and the convergence criteria.

Weight Functions

You can specify the weight function for M estimation with the WEIGHTFUNCTION= option. The ROBUSTREG procedure provides 10 weight functions. By default, the procedure uses the bisquare weight function. In most cases, M estimates are more sensitive to the parameters of these weight functions than to the type of the weight function. The median weight function is not stable and is seldom recommended in data analysis; it is included in the procedure for completeness. You can specify the parameters for these weight functions. Except for the Hampel and median weight functions, default values for these parameters are defined such that the corresponding M estimates have 95% asymptotic efficiency in the location model with the Gaussian distribution (Holland and Welsch, 1977).

The following list shows the weight functions available. See Table 80.5 for the default values of the constants in these weight functions.

Andrews	$W(x,c) = \left\{ \begin{array}{ll} {\sin ({x/c}) \over { x/c}} & {\mbox{if }} \|x\| \leq \pi c \\ 0 & {\mbox{otherwise}} \end{array} \right.$
Bisquare	$W(x,c)= \left\{ \begin{array}{ll} \left(1 - ({x/c})^2\right)^2 & {\mbox{if }} \|x\| < c \\ 0 & {\mbox{otherwise}} \end{array} \right.$
Cauchy	$W(x,c) = {1\over 1 + ({\|x\|/c})^2}$
Fair	$W(x,c) = {1\over (1 + {\|x\|/c})}$
Hampel	$W(x,a,b,c) = \left\{ \begin{array}{ll} 1 & \|x\| < a \\ {a\over \|x\|} & a < \|x\| \leq b \\ {a\over \|x\|} {c-\|x\| \over c-b} & b<\|x\|\leq c \\ 0 & {\mbox{otherwise}} \end{array} \right.$
Huber	$W(x,c) = \left\{ \begin{array}{ll} 1 & {\mbox{ if }} \|x\| < c \\ {c\over \|x\|} & {\mbox{ otherwise } } \end{array} \right.$
Logistic	$W(x,c) = {\tanh ({x/c}) \over {x/c}}$
Median	$W(x,c) = \left\{ \begin{array}{ll} {1\over c} & {\mbox{ if }} x = 0 \\ {1\over \|x\|} & {\mbox{ otherwise }} \end{array} \right.$
Talworth	$W(x,c) = \left\{ \begin{array}{ll} 1 & {\mbox{ if }} \|x\| < c \\ 0 & {\mbox{otherwise}} \end{array} \right.$
Welsch	$W(x,c) = \exp \left(-{ ({x/c})^2\over 2}\right)$

Convergence Criteria

The following convergence criteria are available in PROC ROBUSTREG:

relative change in the coefficients (CONVERGENCE= COEF)
relative change in the scaled residuals (CONVERGENCE= RESID)
relative change in weights (CONVERGENCE= WEIGHT)

You can specify the criteria with the CONVERGENCE= option in the PROC ROBUSTREG statement. The default is CONVERGENCE= COEF.

You can specify the precision of the convergence criterion with the EPS= suboption. The default is EPS=1E–8.

In addition to these convergence criteria, a convergence criterion based on scale-independent measure of the gradient is always checked. See Coleman et al. (1980) for more details. A warning is issued if this criterion is not satisfied.

Asymptotic Covariance and Confidence Intervals

The following three estimators of the asymptotic covariance of the robust estimator are available in PROC ROBUSTREG:

${\mbox{H1: }} K^2 {[1 / (n-p)] \sum (\psi (r_ i))^2 \over [(1 / n) \sum (\psi ’(r_ i))]^2 } (\bX ’ \bX )^{-1}$

${\mbox{H2: }} K {[1 / (n-p)] \sum (\psi (r_ i))^2 \over [(1 / n) \sum (\psi ’(r_ i))] } \bW ^{-1}$

${\mbox{H3: }} K^{-1} {1\over (n-p)} \sum (\psi (r_ i))^2 \bW ^{-1}(\bX ’ \bX )\bW ^{-1}$

where $K=1 + {p\over n} {{\mbox{Var}}(\psi ’)\over (E \psi ’)^2}$ is a correction factor and $W_{jk} = \sum \psi ’(r_ i) x_{ij}x_{ik}$ . See Huber (1981, p. 173) for more details.

You can specify the asymptotic covariance estimate with the option ASYMPCOV= option. The ROBUSTREG procedure uses H1 as the default because of its simplicity and stability. Confidence intervals are computed from the diagonal elements of the estimated asymptotic covariance matrix.

R Square and Deviance

The robust version of R-square is defined as

$R^2 = {{\sum \rho \left({y_ i-{\hat\mu } \over {\hat s}}\right) - \sum \rho \left({y_ i-\mb {x}_ i’ {\hat\btheta } \over {\hat s}}\right) \over \sum \rho \left({y_ i-{\hat\mu } \over {\hat s}}\right)}}$

and the robust deviance is defined as the optimal value of the objective function on the $\sigma ^2$ scale

$D = 2 {\hat s}^2 \sum \rho \left({y_ i-\mb {x}_ i’ {\hat\btheta } \over {\hat s}}\right)$

where $\rho ’ = \psi$ , ${\hat\btheta }$ is the M estimator of $\btheta$ , ${\hat\mu }$ is the M estimator of location, and $\hat s$ is the M estimator of the scale parameter in the full model.

Linear Tests

Two tests are available in PROC ROBUSTREG for the canonical linear hypothesis

$H_0: \ \ \ \theta _ j = 0, \ \ j={i_1},\ldots ,{i_ q}$

where q is the total number of parameters of the tested effects. The first test is a robust version of the F test, which is referred to as the $\rho$ test. Denote the M estimators in the full and reduced models as ${\hat\btheta }(0) \in \Omega _0$ and ${\hat\btheta }(1) \in \Omega _1$ , respectively. Let

$\displaystyle Q_0$	$\displaystyle =$	$\displaystyle Q({\hat\btheta }(0)) = \min \{ Q(\btheta ) \| \btheta \in \Omega _0 \}$
$\displaystyle Q_1$	$\displaystyle =$	$\displaystyle Q({\hat\btheta }(1)) = \min \{ Q(\btheta ) \| \btheta \in \Omega _1 \}$

with

$Q = \sum _{i=1}^ n \rho \left({r_ i \over \sigma }\right)$

The robust F test is based on the test statistic

$S_ n^2 = {2\over q} [Q_1 -Q_0]$

Asymptotically $S_ n^2 \sim \lambda \chi _{q}^2$ under , where the standardization factor is $\lambda = \int \psi ^2(s) d\Phi (s) / \int \psi ’(s)$ $d\Phi (s)$ and $\Phi$ is the cumulative distribution function of the standard normal distribution. Large values of are significant. This test is a special case of the general $\tau$ test of Hampel et al. (1986, Section 7.2).

The second test is a robust version of the Wald test, which is referred to as test. The test uses a test statistic

$R_ n^2 = n ({\hat\theta }_{i_1},\ldots ,{\hat\theta }_{i_ q}) \bH _{22}^{-1} ({\hat\theta }_{i_1},\ldots ,{\hat\theta }_{i_ q})’$

where ${1\over n}\bH _{22}$ is the $q\times q$ block (corresponding to $\theta _{i_1},\ldots ,\theta _{i_ q}$ ) of the asymptotic covariance matrix of the M estimate ${ \hat\btheta }_ M$ of $\btheta$ in a p-parameter linear model.

Under , the statistic has an asymptotic $\chi ^2$ distribution with q degrees of freedom. Large values of are significant. See Hampel et al. (1986, Chapter 7) for more details.

Model Selection

When M estimation is used, two criteria are available in PROC ROBUSTREG for model selection. The first criterion is a counterpart of the Akaike (1974) information criterion for robust regression (AICR); it is defined as

$\mbox{AICR} = 2\sum _{i=1}^ n \rho (r_{i:p}) + \alpha p$

where $r_{i:p} = (y_ i - \mb {x}_ i’{\hat\btheta }) / {\hat\sigma }$ , ${\hat\sigma }$ is a robust estimate of $\sigma$ and ${\hat\btheta }$ is the M estimator with p-dimensional design matrix.

As with AIC, $\alpha$ is the weight of the penalty for dimensions. The ROBUSTREG procedure uses $\alpha = 2E\psi ^2 / E\psi ’$ (Ronchetti, 1985) and estimates it by using the final robust residuals.

The second criterion is a robust version of the Schwarz information criteria (BICR); it is defined as

$\mbox{BICR} = 2\sum _{i=1}^ n \rho (r_{i:p}) + p \log (n)$

The ROBUSTREG Procedure