The UNIVARIATE Procedure

Descriptive Statistics

This section provides computational details for the descriptive statistics that are computed with the PROC UNIVARIATE statement. These statistics can also be saved in an OUT= data set by specifying keywords listed in Table 4.14 in the OUTPUT statement.

Standard algorithms (Fisher 1973) are used to compute the moment statistics. The computational methods used by the UNIVARIATE procedure are consistent with those used by other SAS procedures for calculating descriptive statistics.

The following sections give specific details on a number of statistics calculated by the UNIVARIATE procedure.

Mean

The sample mean is calculated as

\[ \bar{x}_ w = \frac{\sum ^ n_{i=1} w_ i x_ i}{\sum ^ n_{i=1} w_ i} \]

where n is the number of nonmissing values for a variable, $x_ i$ is the ith value of the variable, and $w_ i$ is the weight associated with the ith value of the variable. If there is no WEIGHT variable, the formula reduces to

\[ \bar{x} = \frac{1}{n} \sum ^ n_{i=1} x_ i \]

Sum

The sum is calculated as $\sum ^ n_{i=1} w_ i x_ i$, where n is the number of nonmissing values for a variable, $x_ i$ is the ith value of the variable, and $w_ i$ is the weight associated with the ith value of the variable. If there is no WEIGHT variable, the formula reduces to $\sum ^ n_{i=1} x_ i$.

Sum of the Weights

The sum of the weights is calculated as $~ \sum ^ n_{i=1} w_ i$, where n is the number of nonmissing values for a variable and $w_ i$ is the weight associated with the ith value of the variable. If there is no WEIGHT variable, the sum of the weights is n.

Variance

The variance is calculated as

\[ \frac{1}{d} \sum ^ n_{i=1} w_ i (x_ i-{\bar{x}}_ w)^2 \]

where n is the number of nonmissing values for a variable, $x_ i$ is the ith value of the variable, ${\bar{x}}_ w$ is the weighted mean, $w_ i$ is the weight associated with the ith value of the variable, and d is the divisor controlled by the VARDEF= option in the PROC UNIVARIATE statement:

\[ d = \left\{ \begin{array}{cl} n-1 & \mbox{if VARDEF=DF (default)} \\ n & \mbox{if VARDEF=N} \\ (\sum _ i w_ i) - 1 & \mbox{if VARDEF=WDF} \\ \sum _ i w_ i & \mbox{if VARDEF=WEIGHT | WGT} \end{array} \right. \]

If there is no WEIGHT variable, the formula reduces to

\[ \frac{1}{d} \sum ^ n_{i=1} (x_ i-\bar{x})^2 \]

Standard Deviation

The standard deviation is calculated as

\[ s_ w = \sqrt { \frac{1}{d} \sum ^ n_{i=1} w_ i (x_ i-\bar{x}_ w)^2 } \]

where n is the number of nonmissing values for a variable, $x_ i$ is the ith value of the variable, ${\bar{x}}_ w$ is the weighted mean, $w_ i$ is the weight associated with the ith value of the variable, and d is the divisor controlled by the VARDEF= option in the PROC UNIVARIATE statement. If there is no WEIGHT variable, the formula reduces to

\[ s = \sqrt { \frac{1}{d} \sum ^ n_{i=1} (x_ i-\bar{x})^2 } \]

Skewness

The sample skewness, which measures the tendency of the deviations to be larger in one direction than in the other, is calculated as follows depending on the VARDEF= option:

Table 4.29: Formulas for Skewness

VARDEF

Formula

DF (default)

${\displaystyle \frac{n}{(n-1)(n-2)} \sum _{i=1}^ n w_ i^{3/2} \left( \frac{x_ i-\bar{x}_ w}{s_ w} \right)^3}$

N

${\displaystyle \frac{1}{n} \sum _{i=1}^ n w_ i^{3/2} \left( \frac{x_ i-\bar{x}_ w}{s_ w} \right)^3}$

WDF

missing

WEIGHT | WGT

missing


where n is the number of nonmissing values for a variable, $x_ i$ is the ith value of the variable, $\bar{x}_ w$ is the sample average, s is the sample standard deviation, and $w_ i$ is the weight associated with the ith value of the variable. If VARDEF=DF, then n must be greater than 2. If there is no WEIGHT variable, then $w_ i = 1$ for all $i=1,\ldots ,n$.

The sample skewness can be positive or negative; it measures the asymmetry of the data distribution and estimates the theoretical skewness $\sqrt {\beta _1} = \mu _3 \mu _2^{-\frac{3}{2}}$, where $\mu _2$ and $\mu _3$ are the second and third central moments. Observations that are normally distributed should have a skewness near zero.

Kurtosis

The sample kurtosis, which measures the heaviness of tails, is calculated as follows depending on the VARDEF= option:

Table 4.30: Formulas for Kurtosis

VARDEF

Formula

DF (default)

${\displaystyle \frac{n (n+1)}{(n-1)(n-2)(n-3)} \sum _{i=1}^ n w_ i^2 \left( \frac{x_ i-\bar{x}_ w}{s_ w} \right)^4 - \frac{3 (n-1)^2}{(n-2)(n-3)}}$

N

${\displaystyle \frac{1}{n} \sum _{i=1}^ n w_ i^2 \left( \frac{x_ i-\bar{x}_ w}{s_ w} \right)^4 - 3}$

WDF

missing

WEIGHT | WGT

missing


where n is the number of nonmissing values for a variable, $x_ i$ is the ith value of the variable, $\bar{x}_ w$ is the sample average, $s_ w$ is the sample standard deviation, and $w_ i$ is the weight associated with the ith value of the variable. If VARDEF=DF, then n must be greater than 3. If there is no WEIGHT variable, then $w_ i = 1$ for all $i=1,\ldots ,n$.

The sample kurtosis measures the heaviness of the tails of the data distribution. It estimates the adjusted theoretical kurtosis denoted as $\beta _2-3$, where $\beta _2 = \frac{\mu _4}{{\mu _2}^2}$, and $\mu _4$ is the fourth central moment. Observations that are normally distributed should have a kurtosis near zero.

Coefficient of Variation (CV)

The coefficient of variation is calculated as

\[ CV = \frac{100 \times s_ w}{\bar{x}_ w} \]

Geometric Mean

The geometric mean is calculated as

\[ \left(\ \prod ^ n_{i=1} w_ i x_ i \right)^{1/\sum ^ n_{i=1} w_ i} \]

where n is the number of nonmissing values for a variable, $x_ i$ is the ith value of the variable, and $w_ i$ is the weight associated with the ith value of the variable.

If there is no WEIGHT variable, the formula reduces to

\[ \left(\ \prod ^ n_{i=1} x_ i \right)^{1/n} \]

If any $x_ i$ is negative, the geometric mean is set to missing.