Distribution Analysis: Outlier Detection: Distribution Analysis: Outlier Detection

Distribution Analysis: Outlier Detection

The Outlier Detection analysis computes outliers in contaminated normally distributed data. This analysis defines outliers as values that are sufficiently far from an estimate of the central tendency of the data.

More formally, suppose the data are normally distributed with location parameter $\mu$ and scale parameter $\sigma$ . Let $\hat{\mu}$ be an estimate of the location parameter. Let $\hat{\sigma}$ be an estimate of the scale parameter. Then a value is considered an outlier if

$| x - \hat{\mu} | \gt c \hat{\sigma}$

where

is a constant that you can specify. The constant

is called the scale multiplier.

The basic idea is that if the data are normally distributed, then about 99% of the data are within three standard deviations of the mean. Therefore, if you can accurately estimate the mean (location parameter) and standard deviation (scale parameter), you can identify values in the tails of the distribution. However, if the data contain outliers, then you need to use robust estimators of the location and scale parameters. Robust estimates are described in the "Details" section of the documentation for the UNIVARIATE procedure, in the Base SAS Procedures Guide.

You can use the analysis to specify traditional or robust estimates of location and scale parameters for a numerical variable. You can create a histogram with a normal curve overlaid. You can create an indicator variable that has the value 1 for observations that are sufficiently far from the location estimate.

You can run an Outlier Detection analysis by selecting Analysis $\blacktriangleright\,$ Distribution Analysis $\blacktriangleright\,$ Outlier Detection from the main menu. When you request outlier detection, Stat Studio calls the UNIVARIATE procedure in Base SAS to compute location and scale estimates. SAS/IML is then used to compute the outliers.

Distribution Analysis: Outlier Detection

Example

Specifying the Outlier Detection Analysis

Analysis of Selected Variables