The UNIVARIATE Procedure

Goodness-of-Fit Tests

Subsections:

Shapiro-Wilk Statistic
EDF Goodness-of-Fit Tests

When you specify the NORMAL option in the PROC UNIVARIATE statement or you request a fitted parametric distribution in the HISTOGRAM statement, the procedure computes goodness-of-fit tests for the null hypothesis that the values of the analysis variable are a random sample from the specified theoretical distribution. See Example 4.22.

When you specify the NORMAL option, these tests, which are summarized in the output table labeled "Tests for Normality," include the following:

Shapiro-Wilk test
Kolmogorov-Smirnov test
Anderson-Darling test
Cramér–von Mises test

The Kolmogorov-Smirnov D statistic, the Anderson-Darling statistic, and the Cramér–von Mises statistic are based on the empirical distribution function (EDF). However, some EDF tests are not supported when certain combinations of the parameters of a specified distribution are estimated. See Table 4.31 for a list of the EDF tests available. You determine whether to reject the null hypothesis by examining the p-value that is associated with a goodness-of-fit statistic. When the p-value is less than the predetermined critical value ( $\alpha$ ), you reject the null hypothesis and conclude that the data did not come from the specified distribution.

If you want to test the normality assumptions for analysis of variance methods, beware of using a statistical test for normality alone. A test’s ability to reject the null hypothesis (known as the power of the test) increases with the sample size. As the sample size becomes larger, increasingly smaller departures from normality can be detected. Because small deviations from normality do not severely affect the validity of analysis of variance tests, it is important to examine other statistics and plots to make a final assessment of normality. The skewness and kurtosis measures and the plots that are provided by the PLOTS option, the HISTOGRAM statement, the PROBPLOT statement, and the QQPLOT statement can be very helpful. For small sample sizes, power is low for detecting larger departures from normality that may be important. To increase the test’s ability to detect such deviations, you may want to declare significance at higher levels, such as 0.15 or 0.20, rather than the often-used 0.05 level. Again, consulting plots and additional statistics can help you assess the severity of the deviations from normality.

Shapiro-Wilk Statistic

If the sample size is less than or equal to 2000 and you specify the NORMAL option, PROC UNIVARIATE computes the Shapiro-Wilk statistic, W (also denoted as $W_ n$ to emphasize its dependence on the sample size n). The W statistic is the ratio of the best estimator of the variance (based on the square of a linear combination of the order statistics) to the usual corrected sum of squares estimator of the variance (Shapiro and Wilk 1965). When n is greater than three, the coefficients to compute the linear combination of the order statistics are approximated by the method of Royston (1992). The statistic W is always greater than zero and less than or equal to one $(0 < W \leq 1)$ .

Small values of W lead to the rejection of the null hypothesis of normality. The distribution of W is highly skewed. Seemingly large values of W (such as 0.90) may be considered small and lead you to reject the null hypothesis. The method for computing the p-value (the probability of obtaining a W statistic less than or equal to the observed value) depends on n. For n = 3, the probability distribution of W is known and is used to determine the p-value. For $n>4$ , a normalizing transformation is computed:

$\begin{eqnarray*} Z_{n} = \left\{ \begin{array}{ll} ( - \log ( \gamma - \log ( 1- W_ n ) ) - \mu ) / \sigma & \mbox{if }4 \leq n \leq 11 \\ ( \log ( 1 - W_ n ) - \mu ) / \sigma & \mbox{if }12 \leq n \leq 2000 \end{array} \right. \end{eqnarray*}$

The values of $\sigma$ , $\gamma$ , and $\mu$ are functions of n obtained from simulation results. Large values of $Z_ n$ indicate departure from normality, and because the statistic $Z_ n$ has an approximately standard normal distribution, this distribution is used to determine the p-values for $n>4$ .

EDF Goodness-of-Fit Tests

When you fit a parametric distribution, PROC UNIVARIATE provides a series of goodness-of-fit tests based on the empirical distribution function (EDF). The EDF tests offer advantages over traditional chi-square goodness-of-fit test, including improved power and invariance with respect to the histogram midpoints. For a thorough discussion, refer to D’Agostino and Stephens (1986).

The empirical distribution function is defined for a set of n independent observations $X_1,\ldots ,X_ n$ with a common distribution function $F(x)$ . Denote the observations ordered from smallest to largest as $X_{(1)},\ldots ,X_{(n)}$ . The empirical distribution function, $F_ n(x)$ , is defined as

$\begin{array}{llr} F_ n(x) = 0, & x < X_{(1)} \\ F_ n(x) = \frac{i}{n}, & X_{(i)} \leq x < X_{(i+1)} & i=1,\ldots ,n-1 \\ \nonumber F_ n(x) = 1, & X_{(n)} \leq x \end{array}$

Note that $F_ n(x)$ is a step function that takes a step of height $\frac{1}{n}$ at each observation. This function estimates the distribution function $F(x)$ . At any value x, $F_ n(x)$ is the proportion of observations less than or equal to x, while $F(x)$ is the probability of an observation less than or equal to x. EDF statistics measure the discrepancy between $F_ n(x)$ and $F(x)$ .

The computational formulas for the EDF statistics make use of the probability integral transformation $U=F(X)$ . If $F(X)$ is the distribution function of X, the random variable U is uniformly distributed between 0 and 1.

Given n observations $X_{(1)},\ldots ,X_{(n)}$ , the values $U_{(i)}=F(X_{(i)})$ are computed by applying the transformation, as discussed in the next three sections.

PROC UNIVARIATE provides three EDF tests:

Kolmogorov-Smirnov
Anderson-Darling
Cramér–von Mises

The following sections provide formal definitions of these EDF statistics.

Kolmogorov D Statistic

The Kolmogorov-Smirnov statistic (D) is defined as

$D = \mbox{sup}_ x|F_{n}(x)-F(x)|$

The Kolmogorov-Smirnov statistic belongs to the supremum class of EDF statistics. This class of statistics is based on the largest vertical difference between $F(x)$ and $F_ n(x)$ .

The Kolmogorov-Smirnov statistic is computed as the maximum of $D^{+}$ and $D^{-}$ , where $D^{+}$ is the largest vertical distance between the EDF and the distribution function when the EDF is greater than the distribution function, and $D^{-}$ is the largest vertical distance when the EDF is less than the distribution function.

$\begin{array}{lll} D^{+} & = & \max _{i}\left(\frac{i}{n} - U_{(i)}\right) \\ D^{-} & = & \max _{i}\left(U_{(i)} - \frac{i-1}{n}\right) \\ D & = & \max \left(D^{+},D^{-}\right) \end{array}$

PROC UNIVARIATE uses a modified Kolmogorov D statistic to test the data against a normal distribution with mean and variance equal to the sample mean and variance.

Anderson-Darling Statistic

The Anderson-Darling statistic and the Cramér–von Mises statistic belong to the quadratic class of EDF statistics. This class of statistics is based on the squared difference $(F_ n(x)- F(x))^2$ . Quadratic statistics have the following general form:

$Q = n \int _{-\infty }^{+\infty } (F_ n(x)-F(x))^2 \psi (x) dF(x)$

The function $\psi (x)$ weights the squared difference $(F_ n(x)-F(x))^2$ .

The Anderson-Darling statistic ( $A^2$ ) is defined as

$A^{2} = n\int _{-\infty }^{+\infty }(F_ n(x)-F(x))^2 \left[F(x)\left(1-F(x)\right)\right]^{-1} dF(x)$

Here the weight function is $\psi (x) = \left[F(x)\left(1-F(x)\right)\right]^{-1}$ .

The Anderson-Darling statistic is computed as

$A^2 = -n-\frac{1}{n}\sum _{i=1}^ n \left[(2i-1)\log U_{(i)} + (2n+1-2i) \log (1-U_{(i)})\right]$

Cramér–von Mises Statistic

The Cramér–von Mises statistic ( $W^2$ ) is defined as

$W^2 = n \int _{-\infty }^{+\infty } (F_{n}(x)-F(x))^2 dF(x)$

Here the weight function is $\psi (x) = 1$ .

The Cramér–von Mises statistic is computed as

$W^2 = \sum _{i=1}^ n\left(U_{(i)}-\frac{2i-1}{2n}\right)^2 + \frac{1}{12n}$

Probability Values of EDF Tests

Once the EDF test statistics are computed, PROC UNIVARIATE computes the associated probability values (p-values).

For the Gumbel, inverse Gaussian, generalized Pareto, and Rayleigh distributions, PROC UNIVARIATE computes associated probability values (p-values) by resampling from the estimated distribution. By default 500 EDF test statistics are computed and then compared to the EDF test statistic for the specified (fitted) distribution. The number of samples can be controlled by setting EDFNSAMPLES=n. For example, to request Gumbel distribution Goodness-of-Fit test p-values based on 5000 simulations, use the following statement:

proc univariate data=test;
   histogram / gumbel(edfnsamples=5000);
run;

For the beta, exponential, gamma, lognormal, normal, power function, and Weibull distributions the UNIVARIATE procedure uses internal tables of probability levels similar to those given by D’Agostino and Stephens (1986). If the value is between two probability levels, then linear interpolation is used to estimate the probability value.

The probability value depends upon the parameters that are known and the parameters that are estimated for the distribution. Table 4.31 summarizes different combinations fitted for which EDF tests are available.

Table 4.31: Availability of EDF Tests

Distribution	Parameters			Tests Available
	Threshold	Scale	Shape
beta	$\theta$ known	$\sigma$ known	$\alpha , \beta$ known	all
	$\theta$ known	$\sigma$ known	$\alpha ,\beta <5$ unknown	all
exponential	$\theta$ known,	$\sigma$ known		all
	$\theta$ known	$\sigma$ unknown		all
	$\theta$ unknown	$\sigma$ known		all
	$\theta$ unknown	$\sigma$ unknown		all
gamma	$\theta$ known	$\sigma$ known	$\alpha$ known	all
	$\theta$ known	$\sigma$ unknown	$\alpha$ known	all
	$\theta$ known	$\sigma$ known	$\alpha$ unknown	all
	$\theta$ known	$\sigma$ unknown	$\alpha >1$ unknown	all
	$\theta$ unknown	$\sigma$ known	$\alpha >1$ known	all
	$\theta$ unknown	$\sigma$ unknown	$\alpha >1$ known	all
	$\theta$ unknown	$\sigma$ known	$\alpha >1$ unknown	all
	$\theta$ unknown	$\sigma$ unknown	$\alpha >1$ unknown	all
lognormal	$\theta$ known	$\zeta$ known	$\sigma$ known	all
	$\theta$ known	$\zeta$ known	$\sigma$ unknown	$A^2$ and $W^2$
	$\theta$ known	$\zeta$ unknown	$\sigma$ known	$A^2$ and $W^2$
	$\theta$ known	$\zeta$ unknown	$\sigma$ unknown	all
	$\theta$ unknown	$\zeta$ known	$\sigma <3$ known	all
	$\theta$ unknown	$\zeta$ known	$\sigma <3$ unknown	all
	$\theta$ unknown	$\zeta$ unknown	$\sigma <3$ known	all
	$\theta$ unknown	$\zeta$ unknown	$\sigma <3$ unknown	all
normal	$\theta$ known	$\sigma$ known		all
	$\theta$ known	$\sigma$ unknown		$A^2$ and $W^2$
	$\theta$ unknown	$\sigma$ known		$A^2$ and $W^2$
	$\theta$ unknown	$\sigma$ unknown		all
power function	$\theta$ known	$\sigma$ known	$\alpha$ known	all
	$\theta$ known	$\sigma$ known	$\alpha <5$ unknown	all
Weibull	$\theta$ known	$\sigma$ known	$c$ known	all
	$\theta$ known	$\sigma$ unknown	c known	$A^2$ and $W^2$
	$\theta$ known	$\sigma$ known	c unknown	$A^2$ and $W^2$
	$\theta$ known	$\sigma$ unknown	c unknown	$A^2$ and $W^2$
	$\theta$ unknown	$\sigma$ known	$c>2$ known	all
	$\theta$ unknown	$\sigma$ unknown	$c>2$ known	all
	$\theta$ unknown	$\sigma$ known	$c>2$ unknown	all
	$\theta$ unknown	$\sigma$ unknown	$c>2$ unknown	all