SAS Institute. The Power to Know

FOCUS AREAS

New Tools for Visualizing and Analyzing Distributions

Q-Q Plot (click to enlarge) Version 8 SAS software provides enhancements for visualizing and modeling distributions. The UNIVARIATE procedure provides new statements that enable you to produce high-resolution graphic displays including histograms, probability plots, and quantile-quantile plots. Comparative displays of these graphics are available, and you can enhance the plots with an additional box or table of summary statistics referred to as an inset. Many new statistics have been added to PROC UNIVARIATE, and new distribution fitting facilities enable you to fit a wide range of parametric distributions as well as nonparametric kernel density estimators. In addition, Version 8 SAS/STAT software now includes the BOXPLOT procedure for generating box plots.

Statistical Enhancements

PROC UNIVARIATE now provides confidence limits for basic parameters and percentiles. You can compute confidence limits for the mean, standard deviation, variance, and percentiles based on the assumption that the data are normally distributed. You can also request confidence limits for percentiles based on a distribution-free method.

PROC UNIVARIATE now computes both Winsorized and trimmed means as well as robust measures of scale including Gini's mean difference, the mean absolute deviation (MAD), and the Sn and Qn statistics. The procedure also provides facilities for fitting and plotting continuous distributions, including the normal, lognormal, exponential, gamma, beta, and Weibull, and it enables you to smooth the data distribution using kernel density estimation.

When you fit a parametric distribution, PROC UNIVARIATE provides a series of goodness-of-fit tests and p-values based on the empirical distribution function (EDF), including the Kolmogorov-Smirnov, Anderson-Darling, and Cramer-von Mises statistics. Other new PROC UNIVARIATE options enable you to specify the location parameter value in the null hypothesis of a test of location, request a table of all possible modes, and control the number of tabulated extreme values and observations.

Box Plots

Box plot (click to enlarge) Box plots, also known as box-and-whisker plots, provide a convenient tool for comparing distributions of a quantitative variable across levels of a grouping variable. These plots display a wide range of qualitative information about a variable including its mean, median, quartiles, minimum, maximum, and outlying observations. The BOXPLOT procedure enables you to specify different methods for calculating quantiles, control the layout and appearance of the plot, and select from styles including skeletal and schematic box plots.

For more detail on the enhancements to PROC UNIVARIATE and PROC BOXPLOT, refer to the paper Are Histograms Giving You Fits? New SAS Software for Analyzing Distributions by Nathan Curtis.

Example: Creating a Comparative Histogram with an Inset

Consider a trial testing the efficacy of a new antihypertensive medication. The study measured the change in blood pressure after nine months for patients receiving either a new medication or a placebo. The following statements create a data set that contains a treatment grouping variable and a variable that contains the change in blood pressure.

   data BPChange;
     input Treatment $ BPChange;
     datalines;
   Placebo  -14.0
   Active    -8.0
   Active   -23.0
   Placebo   2.5
   Active   -15.5
   Placebo  -12.0
   ...
   ;

The following statements create a comparative histogram with an inset to compare the change in blood pressure across treatment groups.

   proc univariate data=BPChange;
     class Treatment;
     var BPChange;
     histogram BPChange / 
       normal
       midpoints = -50 to 50 by 5
       href = 0
       vscale = count;
     inset n = "N" (5.0) mean = "Mean" (5.1) std = "Std Dev" (5.1) /
       pos = ne
       height = 3;
   run;

The TREATMENT variable defines the classification levels for the analysis and is specified in the CLASS statement. The HISTOGRAM statement specifies a histogram of BPChange be created with a superimposed normal curve that bins centered at values ranging from -50 to 50 by 5. The HREF= option in the HISTOGRAM statement creates a reference line on the x-axis at blood pressure equal to zero and the VSCALE=COUNT option specifies the scale of the vertical axis as the number of observations per bin. By default, PROC UNIVARIATE displays the scale of the vertical axis as the percentage of observations per bin.

The INSET statement positions a table of summary statistics in the northeast corner of the display as defined by POS=NE. The table will include the number of observations, mean, and standard deviation, labeled "N," "Mean," and "Std Dev," respectively. The values in parentheses specify the format for the statistics in the inset, and the HEIGHT= option specifies the height of the text. Figure 1 displays the histogram comparing change in blood pressure for the active and placebo treatment groups.

Histrogram with Inset
Figure 1: Comparative Histogram


Statistics and Operations Research Home Page | What's New in Data Analysis