24983 - The MultNorm macro tests multivariate normality

SUPPORT / SAMPLES & SAS NOTES

Support

Sample 24983: The MultNorm macro tests multivariate normality

MultNorm macro to test multivariate normality

Contents:

Purpose / History / Requirements / Usage / Details / Limitations / Missing Values / References

PURPOSE:

The MultNorm macro provides tests and plots of multivariate normality, including the Mardia skewness and kurtosis tests, the Royston H test, the Henze-Zirkler test, and the Doornik-Hansen test. A test of univariate normality is also given for each of the variables. You can obtain a chi-square quantile-quantile plot of the observations' squared Mahalanobis distances, allowing a visual assessment of multivariate normality. Univariate histograms with overlaid normal curves are also available.

HISTORY:

The version of the MultNorm macro that you are using is displayed when you specify anything as the first argument. Here is an example:

%multnorm(v)

The MultNorm macro always attempts to check for a later version of itself. If it is unable to do this (such as if there is no active internet connection available), the macro issues the following message:

NOTE: Unable to check for newer version of MultNorm macro.

The computations performed by the macro are not affected by the appearance of this message. However, you can avoid this check by specifying nochk as the first macro argument. This action can be useful if your machine has no connection to the internet.

Version	Update Notes
2.0	Added the Royston and Doornik-Hansen tests. SAS/ETS^® is no longer required or used to compute any test. var= is now optional. plot=mult is now the default. nochk is added. All tests are saved in data set _STATS.
1.4	SAS/IML^® is no longer required if SAS/ETS PROC MODEL is not found. SAS/STAT^® PROC PRINCOMP is required instead. Univariate plots, if requested, and tests are now presented first. High-resolution plotting is done by Base SAS^® PROC SGPLOT if available, or by SAS/GRAPH^® PROC GPLOT if not. It checks that the specified data set and variables exist.
1.3	Added a message showing whether MODEL or IML is selected. Added a check for error status after MODEL or IML. Errors terminate the macro. Added an automatic check for a newer version. Documented the difference between tests in PROC MODEL and PROC UNIVARIATE.
1.2	Use SAS/ETS PROC MODEL if available to get all tests, then SAS/IML, and then univariate only. Use ODS SELECT to obtain only the normal table from MODEL (requires SAS^® 8 or later). Provide univariate histograms with overlaid normal curves and tests controlled by the expanded PLOT= parameter.
1.1	Use the PVALUE format. Prefix notes from macro with MULTNORM: instead of NOTE:.
1.0	Initial coding.

REQUIREMENTS:

Only Base SAS is required for the univariate tests and histograms and for the Royston multivariate test. SAS/STAT is required for the Mardia, Henze-Zirkler, and Doornik-Hansen multivariate tests and for the multivariate chi-square plot.

USAGE:

Follow the instructions on the Downloads tab of this sample to save the MultNorm macro definition. Replace the text within quotes in the following statement with the location of the MultNorm macro definition file on your system. In your SAS program or in the SAS editor window, specify this statement to define the MultNorm macro and make it available for use:

   %inc "<location of your file containing the MultNorm macro>";

Following this statement, you can call the MultNorm macro. See the Results tab for an example.

The options and allowable values are as follows:

data=data-set-name: SAS data set to be analyzed. If data= is not supplied, the most recently created SAS data set is used.
var=variable-list: The list of variables to use when testing for univariate and multivariate normality. Individual variable names, separated by blanks, must be specified. Special variable lists (such as VAR1-VAR10 or ABC--XYZ) cannot be used. If not specified, all numeric variables in the data= data set are used.
plot=both | mult | uni | none: plot=mult (the default) requests a chi-square quantile-quantile (Q-Q) plot of the squared Mahalanobis distances of the observations from the mean vector. plot=uni requests a univariate histogram for each variable with overlaid normal density curves and additional univariate tests of normality. plot=both requests both of the above. plot=none suppresses all plots and the additional univariate normality tests.
hires=yes | no: Ignored if plot=none. hires=yes (the default) requests that high-resolution graphics be used when creating plots. PROC SGPLOT is used by default but, if not found, SAS/GRAPH PROC GPLOT is used. If SAS/GRAPH PROC GPLOT is used, you must specify any needed graphics-related options before invoking the macro. hires=no requests that the multivariate plot be drawn with low-resolution using PROC PLOT. The univariate plots are not available with hires=no.

DETAILS:

In order for a set of variables to be distributed as multivariate normal, each variable must be normally distributed. When all individual variables are normally distributed, the set of variables might or might not be distributed as multivariate normal. Hence, testing each variable only for univariate normality is not a sufficient test of multivariate normality.

Univariate tests and plots

Univariate normality for each of the analysis variables specified in var= is assessed using the Shapiro-Wilk W test (for sample size 2000 or less) or the Kolmogorov-Smirnov test, depending on the sample size, as done in the UNIVARIATE procedure. Additional tests are provided if univariate plots are requested with plot=uni or plot=both. For details about the univariate tests of normality, see Goodness Of Fit Tests in the Details section of the PROC UNIVARIATE documentation.

If the p-value of any of the tests is small, then multivariate, as well as univariate, normality can be rejected. However, it is important to note that the univariate Shapiro-Wilk W test is very powerful and is capable of detecting trivially small departures from univariate normality as the sample size becomes large. This might cause you to reject univariate, and therefore multivariate, normality unnecessarily if the tests are being done to validate the use of methods that are robust to small departures from normality. For such situations, the plots are useful by providing a visual assessment of approximate normality.

Multivariate tests and plot

Four tests of multivariate normality are available in the MultNorm macro. Mardia (1974) proposed tests of multivariate normality based on sample measures of multivariate skewness and kurtosis. The Henze-Zirkler test of multivariate normality is based on a nonnegative function that measures the distance between two distribution functions and is used to assess the distance between the distribution function of the data and the multivariate normal distribution function. Royston (1983, 1992, 1993) introduced a multivariate test based on and extending the Shapiro-Wilk W test for univariate normality. The Doornik-Hansen test uses skewness and kurtosis to create an omnibus test of multivariate normality.

For all of the tests provided, a small p-value rejects the null hypothesis of multivariate normality.

All tests, except the Royston test, and the chi-square plot require the PRINCOMP procedure in SAS/STAT to compute principal component scores or the eigenvalues and eigenvectors of the correlation matrix of the original variables. When the correlation matrix for the data is singular or if PROC PRINCOMP is not available, then a message is printed in the log and only the Royston multivariate normality test is done.

Under the normal distribution, the expected multivariate skewness is p(p+2)[(n+1)(p+1)-6]/(n+1)(n+3) and the expected multivariate kurtosis is p(p+2)(n-1)/(n+1). MultNorm displays centered values (observed minus expected) of these statistics and a small p-value indicates significant deviation of the observed measure from expected under normality. Mardia's multivariate skewness statistic and p-value is computed using a small sample correction multiplier. Since the value of this correction diminishes to very near 1 beyond a sample size of about 100, it is always included. The uncentered skewness and kurtosis measures, their expected values under normality, and the uncorrected multivariate skewness statistic and its p-value are not included in the table displayed by the macro but are included in the output data set, _STATS.

Many tests for multivariate normality have been proposed, and while no single test has been found to be uniformly best, the three offered by the MultNorm macro are among the ones most often used. Farrell et al. (2007) found that the Type I error rate was well-preserved by both the Royston test and the Doornik-Hansen test over a wide range of sample sizes and number of variables. The Henze-Zirkler test also performs well in this regard for sample sizes above 75 and has good statistical power against alternative distributions, but is slightly conservative for smaller sample sizes. Royston's test exhibits good power for smaller sample sizes.

Chi-square Q-Q plot

For p variables and a large sample size, the squared Mahalanobis distances of the observations to the mean vector are distributed as chi-square with p degrees of freedom. However, the sample size must be quite large for the chi-square distribution to obtain unless p is very small. Also, this plot is sensitive to the presence of outliers. So, this plot should be cautiously used as a rough indicator of multivariate normality.

BY group processing

While the MultNorm macro does not directly support BY group processing, this capability can be provided by the RunBY macro that can run the MultNorm macro repeatedly for each of the BY groups in your data. See the RunBY macro documentation for details about its use. Also see the example titled "BY group processing" on the Results tab.

Output data set

The results of the univariate Shapiro-Wilk or Kolmogorov-Smirnov tests as well as any multivariate tests are saved in data set _STATS. If many variables are tested for univariate normality, the p-values in this data set could be adjusted for multiple testing be using it as input to the MULTTEST procedure.

LIMITATIONS:

The Doornik-Hansen test requires 8 or more observations. The Royston test is limited to sample sizes between 4 and 2,000 observations.

Memory and time requirements increase with both the sample size and number of variables. Data sets having thousands of observations and/or hundreds of variables might require excessive running time or memory.

MISSING VALUES:

Observations with missing values in any of the analysis variables are omitted from the analysis and plot.

REFERENCES:

Doornik, J.A. and Hansen, (2008), "An Omnibus Test for Univariate and Multivariate Normality," Oxford Bulletin of Economics and Statistics, 70(Supplement 1), 927-939.

Farrell, P.J., Salibian-Barrera, M. and Naczk, K. (2007), "On tests for multivariate normality and associated simulation studies," Journal of Statistical Computation & Simulation, Vol. 77(12), 1065-1080.

Henze, N. and Zirkler, B. (1990), "A Class of Invariant Consistent tests for Multivariate Normality," Communications in Statistics, Part A - Theory and Methods., 19(10), 3595-3617.

Mardia, K.V. (1974), "Applications of some measures of multivariate skewness and kurtosis in testing normality and robustness studies," Sankhya B, 36, 115-128.

Mardia, K.V. (1975), "Assessment of Multinormality and the Robustness of Hotelling's T-squared Test," Applied Statistics, 1975, 24(2), 163-171.

Mardia, K.V., Kent, J.T., and Bibby, J.M. (1979), Multivariate Analysis, New York: Academic Press.

Mardia, K.V. (1980), "Measures of Multivariate Skewness and Kurtosis with Applications," Biometrika, 57(3), 519-530.

Royston, J.P. (1982), "An Extension of Shapiro and Wilk's W Test for Normality to Large Samples," Applied Statistics, 31, 115-124.

Royston, J.P. (1983), "Techniques for Assessing Multivarate Normality Based on the Shapiro-Wilk W," Journal of the Royal Statistical Society, Series C (Applied Statistics), Vol. 32(2), 121-133.

Royston, J.P. (1992), "Approximating the Shapiro-Wilk W-Test for Non-normality," Statistics and Computing, 2, 117-119.

Royston, J.P. (1993), "A Toolkit for Testing for Non-Normality in Complete and Censored Samples," Journal of the Royal Statistical Society, Series D (The Statistician, Vol. 42(1), 37-43.

Shapiro, S.S. and Wilk, M.B. (1965), "An Analysis of Variance Test for Normality (complete samples)," Biometrika, 52, 591-611.

Svantesson, T. and Wallace, J.W. (2003), "Tests for assessing multivariate normality and the covariance structure of MIMO data," Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing.

These sample files and code examples are provided by SAS Institute Inc. "as is" without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose. Recipients acknowledge and agree that SAS Institute shall not be liable for any damages whatsoever arising out of their use of this material. In addition, SAS Institute will provide no support for the materials contained herein.

Type:	Sample
Topic:	Analytics ==> Exploratory Data Analysis Analytics ==> Multivariate Analysis Analytics ==> Descriptive Statistics SAS Reference ==> Macro Analytics ==> Analysis of Variance

Date Modified:	2022-08-24 10:20:37
Date Created:	2005-01-13 15:02:42

Product Family	Product	Host	SAS Release
			Starting	Ending
SAS System	SAS/STAT	All	n/a	n/a

Support

Sample 24983: The MultNorm macro tests multivariate normality

MultNorm macro to test multivariate normality

BY group processing

Operating System and Release Information

Follow Us

What is...