Summary Statistics Task

About the Summary Statistics Task

The Summary Statistics task provides data summarization tools to compute descriptive statistics for variables across all observations and within groups of observations. You can also summarize your data in a graphical display, such as a histogram.
For example, you could use this task to create a report on the number of new sales, arranged by product type and country.

Example: Summary Statistics of Unit Sales

In this example, you want to analyze unit sales. In addition to the tabular results, you choose to display a histogram of the distribution.
To create this example:
  1. In the Tasks section, expand the Statistics folder and double-click Summary Statistics. The user interface for the Summary Statistics task opens.
  2. On the Data tab, select the SASHELP.PRICEDATA data set.
  3. To the Analysis variables role, assign the sale column.
  4. On the Options tab, expand the Plots section and select the Histogram check box.
  5. To run the task, click Submit SAS code.
Here are the results:
Results from the MEAN and UNIVARIATE Procedures for Unit Sales

Assigning Data to Roles

To run the Summary Statistics task, you must assign a column to the Analysis variables role.
Role
Description
Roles
Analysis variables
The variables that you assign to this role are the numeric variables for which you want statistics. You must assign at least one variable to this role.
Classification variables
The variables that you assign to this role are character or discrete numeric variables that are used to divide the input data into categories or subgroups. The statistics are calculated on all selected analysis variables for each unique combination of classification variables.
Additional Roles
Group analysis by
The variables that you assign to this role are used to compute separate statistics for each distinct value or combination of values of the Group analysis by variables. The data is automatically sorted by the variables in this role before the statistics are computed.
Frequency count
When you assign a variable to this role, each observation in the table is assumed to represent n observations, where n is the value of the frequency count for that row. Statistics are calculated accordingly. You can assign a maximum of one variable to this role.
Weight variable
If you assign a variable to this role, the value of the variable for each observation is used to calculate weighted means, variances, and sums. You can assign a maximum of one variable to this role.

Setting Options

Option Name
Description
Basic Statistics
Mean
is the arithmetic average, calculated by adding the values of an analysis variable and dividing this sum by the number of nonmissing observations.
Standard deviation
is a statistical measure of the variability of a group of data values. This measure, which is the most widely used measure of the dispersion of a frequency distribution, is equal to the positive square root of the variance.
Minimum value
is the smallest value for an analysis variable.
Maximum value
is the largest value for an analysis variable.
Median
is the middle value for an analysis variable.
Number of observations
is the total number of observations with nonmissing values.
Number of missing values
is the number of observations with missing values.
Additional Statistics
Standard error
is the standard deviation of the sample mean. The standard error is defined as the ratio of the sample standard deviation to the square root of the sample size.
Note: This option is available only if Degrees of freedom is selected in the Divisor for standard deviation and variance drop-down list.
Variance
is a statistical measure of dispersion of data values. This measure is an average of the total squared dispersion between each observation and the sample mean.
Mode
is the most frequent value for the analysis variable.
Range
is the difference between the largest and the smallest values in the data.
Sum
is the sum of all values in the analysis variable.
Sum of weights
is the sum of the numeric variable that is used to weight each observation.
Note: You cannot compute the sum of the weights unless you assign a variable to the Weight variable role.
Confidence limits for the mean
are the two-sided confidence limits for the mean. A two-sided 100 open 1 minus alpha close percent  confidence interval for the mean has the following upper and lower limits:x with macron above , plus minus . t sub open 1 minus ,  alpha over 2 , semicolon n minus 1 close end sub . fraction s , over square root of n end fraction  , where s is square root of fraction 1 , over n minus 1 end fraction . cap sigma . open , x sub i , minus , x with macron above , close squared end root  and t sub open 1 minus ,  alpha over 2 , semicolon n minus 1 close end sub  is the 1 minus ,  alpha over 2  of the Student’s t statistics with n minus 1  degrees of freedom.
Coefficient of variation
is a unitless measure of relative variability. This measure is defined as the ratio of the standard deviation to the mean expressed as a percentage. The coefficient of variation is meaningful only if the variable is measured on a ratio scale.
Skewness
is skewness, which measures the tendency of the deviations to be larger in one direction than in the other.
Kurtosis
is the kurtosis, which measures the heaviness of tails.
Percentile Statistics
1st, 5th, 10th, Lower quartile, Median, Upper quartile, 90th, 95th, 99th, Interquartile range
choose the percentiles and quantiles to compute.
Quantile method
specifies the method that is used to compute the quantiles, median, and percentiles.
Order statistics
reads all of the data into memory and sorts it by the unique values.
Piecewise-parabolic algorithm
approximates the quantile and is a less memory-intensive method.
Plots
Histogram
creates a graph that is used to determine the distribution of the data. If you add a normal density curve, the task uses the sample mean and sample standard deviation for mu  and sigma  . If you add a kernel density curve, the task uses the AMISE method to compute the kernel density estimates.
To include the statistics in the graph, select the Add inset statistics check box.
Comparative box plot
creates a graph that shows a measure of central location (the median), two measures of dispersion (the range and interquartile range), the skewness (from the orientation of the median relative to the quartiles), and potential outliers. Box plots are especially useful in comparing two or more sets of data.
Note: The Comparative box plot option is available only when no column is assigned to the Classification variable role.
You can choose to add the overall inset statistics to the graph or only the inset statistics for each group.
Combine histogram and box plot
displays the histogram and box plots together in a single panel, sharing common X axes. You can choose to add the overall inset statistics to the graph.
Note: The Combine histogram and box plot option is available only when no column is assigned to the Classification variable role.
Details
Divisor for standard deviation and variance
specifies the divisor to use in the calculation of the variance and standard deviation. Here are the valid options:
Degrees of freedom
n minus 1
By default, the divisor for the variance is the degrees of freedom.
Number of observations
n
Sum of weights minus one
open , cap sigma sub i , w sub i , close minus 1
Sum of weights
cap sigma sub i , w sub i
Output Data Set
You can specify whether to save the statistics in an output data set. By default, this data set is saved in the Work library.