A statistical method is robust if it is insensitive to moderate or even large departures from the assumptions that justify the method. PROC UNIVARIATE provides several methods for robust estimation of location and scale. See Example 4.11.
The Winsorized mean is a robust estimator of the location that is relatively insensitive to outliers. The -times Winsorized mean is calculated as
![]() |
where is the number of observations and
is the
th order statistic when the observations are arranged in increasing order:
![]() |
The Winsorized mean is computed as the ordinary mean after the smallest observations are replaced by the
st smallest observation and the
largest observations are replaced by the
st largest observation.
For data from a symmetric distribution, the Winsorized mean is an unbiased estimate of the population mean. However, the Winsorized mean does not have a normal distribution even if the data are from a normal population.
The Winsorized sum of squared deviations is defined as
![]() |
The Winsorized statistic is given by
![]() |
where denotes the location under the null hypothesis and the standard error of the Winsorized mean is
![]() |
When the data are from a symmetric distribution, the distribution of is approximated by a Student’s
distribution with
degrees of freedom (Tukey and McLaughlin, 1963; Dixon and Tukey, 1968).
The Winsorized confidence interval for the location parameter has upper and lower limits
![]() |
where is the
th percentile of the Student’s
distribution with
degrees of freedom.
Like the Winsorized mean, the trimmed mean is a robust estimator of the location that is relatively insensitive to outliers.
The -times trimmed mean is calculated as
![]() |
where is the number of observations and
is the
th order statistic when the observations are arranged in increasing order:
![]() |
The trimmed mean is computed after the smallest and
largest observations are deleted from the sample. In other words, the observations are trimmed at each end.
For a symmetric distribution, the symmetrically trimmed mean is an unbiased estimate of the population mean. However, the trimmed mean does not have a normal distribution even if the data are from a normal population.
A robust estimate of the variance of the trimmed mean can be based on the Winsorized sum of squared deviations
, which is defined in the section Winsorized Means; see Tukey and McLaughlin (1963). This can be used to compute a trimmed
test which is based on the test statistic
![]() |
where the standard error of the trimmed mean is
![]() |
When the data are from a symmetric distribution, the distribution of is approximated by a Student’s
distribution with
degrees of freedom (Tukey and McLaughlin, 1963; Dixon and Tukey, 1968).
The “trimmed” confidence interval for the location parameter has upper and lower limits
![]() |
where is the
th percentile of the Student’s
distribution with
degrees of freedom.
The sample standard deviation, which is the most commonly used estimator of scale, is sensitive to outliers. Robust scale
estimators, on the other hand, remain bounded when a single data value is replaced by an arbitrarily large or small value.
The UNIVARIATE procedure computes several robust measures of scale, including the interquartile range, Gini’s mean difference
, the median absolute deviation about the median (MAD),
, and
. In addition, the procedure computes estimates of the normal standard deviation
derived from each of these measures.
The interquartile range (IQR) is simply the difference between the upper and lower quartiles. For a normal population, can be estimated as IQR/1.34898.
Gini’s mean difference is computed as
![]() |
For a normal population, the expected value of is
. Thus
is a robust estimator of
when the data are from a normal sample. For the normal distribution, this estimator has high efficiency relative to the usual
sample standard deviation, and it is also less sensitive to the presence of outliers.
A very robust scale estimator is the MAD, the median absolute deviation from the median (Hampel, 1974), which is computed as
![]() |
where the inner median, , is the median of the
observations, and the outer median (taken over
) is the median of the
absolute values of the deviations about the inner median. For a normal population,
is an estimator of
.
The MAD has low efficiency for normal distributions, and it may not always be appropriate for symmetric distributions. Rousseeuw and Croux (1993) proposed two statistics as alternatives to the MAD. The first is
![]() |
where the outer median (taken over ) is the median of the
medians of
,
. To reduce small-sample bias,
is used to estimate
, where
is a correction factor; see Croux and Rousseeuw (1992).
The second statistic proposed by Rousseeuw and Croux (1993) is
![]() |
where
![]() |
In other words, is 2.2219 times the
th order statistic of the
distances between the data points. The bias-corrected statistic
is used to estimate
, where
is a correction factor; see Croux and Rousseeuw (1992).