The UNIVARIATE Procedure

Robust Estimators

Subsections:

Winsorized Means
Trimmed Means
Robust Estimates of Scale

A statistical method is robust if it is insensitive to moderate or even large departures from the assumptions that justify the method. PROC UNIVARIATE provides several methods for robust estimation of location and scale. See Example 4.11.

Winsorized Means

The Winsorized mean is a robust estimator of the location that is relatively insensitive to outliers. The k-times Winsorized mean is calculated as

$\bar{x}_{wk}=\frac{1}{n} \left((k+1)x_{(k+1)} +\sum _{i=k+2}^{n-k-1}x_{(i)}+(k+1)x_{(n-k)} \right)$

where n is the number of observations and $x_{(i)}$ is the ith order statistic when the observations are arranged in increasing order:

$x_{(1)} \le x_{(2)} \le \ldots \le x_{(n)}$

The Winsorized mean is computed as the ordinary mean after the k smallest observations are replaced by the $(k+1)$ st smallest observation and the k largest observations are replaced by the $(k+1)$ st largest observation.

For data from a symmetric distribution, the Winsorized mean is an unbiased estimate of the population mean. However, the Winsorized mean does not have a normal distribution even if the data are from a normal population.

The Winsorized sum of squared deviations is defined as

$s^2_{wk}=(k+1)(x_{(k+1)} -\bar{x}_{wk} )^2 +\sum _{i=k+2}^{n-k-1} (x_{(i)}-\bar{x}_{wk})^2 + (k+1)(x_{(n-k)} -\bar{x}_{wk} )^2$

The Winsorized t statistic is given by

$t_{wk} =\frac{\bar{x}_{wk} - \mu _0}{\mr{SE}(\bar{x}_{wk} )}$

where $\mu _0$ denotes the location under the null hypothesis and the standard error of the Winsorized mean is

$\mr{SE}(\bar{x}_{wk})=\frac{n-1}{n-2k-1} \times \frac{s_{wk}}{\sqrt {n(n-1)}}$

When the data are from a symmetric distribution, the distribution of $t_{wk}$ is approximated by a Student’s t distribution with $n-2k-1$ degrees of freedom (Tukey and McLaughlin, 1963; Dixon and Tukey, 1968).

The Winsorized $100(1-\frac{\alpha }{2})\%$ confidence interval for the location parameter has upper and lower limits

$\bar{x}_{wk} \pm t_{1-\frac{\alpha }{2};n-2k-1} \mr{SE}(\bar{x}_{wk} )$

where $t_{1-\frac{\alpha }{2};n-2k-1}$ is the $(1-\frac{\alpha }{2})100$ th percentile of the Student’s t distribution with $n-2k-1$ degrees of freedom.

Trimmed Means

Like the Winsorized mean, the trimmed mean is a robust estimator of the location that is relatively insensitive to outliers. The k-times trimmed mean is calculated as

$\bar{x}_{tk} =\frac{1}{n-2k}\sum _{i=k+1}^{n-k}x_{(i)}$

where n is the number of observations and $x_{(i)}$ is the ith order statistic when the observations are arranged in increasing order:

$x_{(1)} \le x_{(2)} \le \ldots \le x_{(n)}$

The trimmed mean is computed after the k smallest and k largest observations are deleted from the sample. In other words, the observations are trimmed at each end.

For a symmetric distribution, the symmetrically trimmed mean is an unbiased estimate of the population mean. However, the trimmed mean does not have a normal distribution even if the data are from a normal population.

A robust estimate of the variance of the trimmed mean $t_{tk}$ can be based on the Winsorized sum of squared deviations $s^2_{wk}$ , which is defined in the section Winsorized Means; see Tukey and McLaughlin (1963). This can be used to compute a trimmed t test which is based on the test statistic

$t_{tk} =\frac{(\bar{x}_{tk} -\mu _{0})}{\mr{SE}(\bar{x}_{tk})}$

where the standard error of the trimmed mean is

$\mr{SE}(\bar{x}_{tk})=\frac{s_{wk}}{\sqrt {(n-2k)(n-2k-1)}}$

When the data are from a symmetric distribution, the distribution of $t_{tk}$ is approximated by a Student’s t distribution with $n-2k-1$ degrees of freedom (Tukey and McLaughlin, 1963; Dixon and Tukey, 1968).

The "trimmed" $100(1-\alpha )\%$ confidence interval for the location parameter has upper and lower limits

$\bar{x}_{tk} \pm t_{1-\frac{\alpha }{2};n-2k-1} \mr{SE}(\bar{x}_{tk})$

where $t_{1-\frac{\alpha }{2};n-2k-1}$ is the $(1-\frac{\alpha }{2})100$ th percentile of the Student’s t distribution with $n-2k-1$ degrees of freedom.

Robust Estimates of Scale

The sample standard deviation, which is the most commonly used estimator of scale, is sensitive to outliers. Robust scale estimators, on the other hand, remain bounded when a single data value is replaced by an arbitrarily large or small value. The UNIVARIATE procedure computes several robust measures of scale, including the interquartile range, Gini’s mean difference G, the median absolute deviation about the median (MAD), $Q_ n$ , and $S_ n$ . In addition, the procedure computes estimates of the normal standard deviation $\sigma$ derived from each of these measures.

The interquartile range (IQR) is simply the difference between the upper and lower quartiles. For a normal population, $\sigma$ can be estimated as IQR/1.34898.

Gini’s mean difference is computed as

$G = \frac{1}{\left(\begin{array}{c} n \cr 2 \end{array}\right)} \sum _{i<j} |x_ i - x_ j|$

For a normal population, the expected value of G is $2\sigma /\sqrt {\pi }$ . Thus $G\sqrt {\pi }/2$ is a robust estimator of $\sigma$ when the data are from a normal sample. For the normal distribution, this estimator has high efficiency relative to the usual sample standard deviation, and it is also less sensitive to the presence of outliers.

A very robust scale estimator is the MAD, the median absolute deviation from the median (Hampel, 1974), which is computed as

$\mr{MAD} = \mr{med}_{i}( | x_ i - \mr{med}_{j}( x_ j ) | )$

where the inner median, $\mr{med}_{j}(x_ j)$ , is the median of the n observations, and the outer median (taken over i) is the median of the n absolute values of the deviations about the inner median. For a normal population, $1.4826 \times \mr{MAD}$ is an estimator of $\sigma$ .

The MAD has low efficiency for normal distributions, and it may not always be appropriate for symmetric distributions. Rousseeuw and Croux (1993) proposed two statistics as alternatives to the MAD. The first is

$S_ n = 1.1926 \times \mr{med}_{i} ( \mr{med}_{j} ( |x_ i - x_ j | ) )$

where the outer median (taken over i) is the median of the n medians of $|x_ i - x_ j|$ , $j = 1, 2, \ldots , n$ . To reduce small-sample bias, $c_{sn}S_{n}$ is used to estimate $\sigma$ , where $c_{sn}$ is a correction factor; see Croux and Rousseeuw (1992).

The second statistic proposed by Rousseeuw and Croux (1993) is

$Q_ n = 2.2219 \{ | x_ i - x_ j |; i < j \} _{(k)}$

where

$k = \left(\begin{array}{c} \left[\frac{n}{2}\right]+1 \cr 2 \end{array}\right)$

In other words, $Q_ n$ is 2.2219 times the kth order statistic of the $\left(\begin{array}{c} n \cr 2 \end{array}\right)$ distances between the data points. The bias-corrected statistic $c_{qn}Q_{n}$ is used to estimate $\sigma$ , where $c_{qn}$ is a correction factor; see Croux and Rousseeuw (1992).