The HPSUMMARY Procedure

Keywords and Formulas

Simple Statistics

The HPSUMMARY procedure uses a standardized set of keywords to refer to statistics. You specify these keywords in SAS statements to request the statistics to be displayed or stored in an output data set.

In the following notation, summation is over observations that contain nonmissing values of the analyzed variable and, except where shown, over nonmissing weights and frequencies of one or more:

$x_ i$

is the nonmissing value of the analyzed variable for observation $i$.

$f_ i$

is the frequency that is associated with $x_ i$ if you use a FREQ statement. If you omit the FREQ statement, then $f_ i = 1$ for all $i$.

$w_ i$

is the weight that is associated with $x_ i$ if you use a WEIGHT statement. The HPSUMMARY procedure automatically excludes the values of $x_ i$ with missing weights from the analysis.

By default, the HPSUMMARY procedure treats a negative weight as if it is equal to 0. However, if you use the EXCLNPWGT option in the PROC HPSUMMARY statement, then the procedure also excludes those values of with nonpositive weights.

If you omit the WEIGHT statement, then $w_ i = 1$ for all $i$.

$n$

is the number of nonmissing values of $x_ i$, $\sum f_ i$. If you use the EXCLNPWGT option and the WEIGHT statement, then $n$ is the number of nonmissing values with positive weights.

$\bar{x}$

is the mean

\[  \left.\sum w_ i x_ i\middle /\sum w_ i\right.  \]
$s^2$

is the variance

\[  \frac{1}{d} \sum w_ i (x_ i - \bar{x})^2  \]

where $d$ is the variance divisor (the VARDEF= option) that you specify in the PROC HPSUMMARY statement. Valid values are as follows:

Table 9.9: (continued)

When VARDEF=

$d$ equals

N

$n$

DF

$n - 1$

WEIGHT | WGT

$\sum _ i w_ i$

WDF

$\left(\sum _ i w_ i\right) - 1$


The default is DF.

$z_ i$

is the standardized variable

\[  (x_ i - \bar{x}) / s  \]

PROC HPSUMMARY calculates the following simple statistics:

  • number of missing values

  • number of nonmissing values

  • number of observations

  • sum of weights

  • mean

  • sum

  • extreme values

  • minimum

  • maximum

  • range

  • uncorrected sum of squares

  • corrected sum of squares

  • variance

  • standard deviation

  • standard error of the mean

  • coefficient of variation

  • skewness

  • kurtosis

  • confidence limits of the mean

  • median

  • mode

  • percentiles/deciles/quartiles

  • $t$ test for mean=0

The standard keywords and formulas for each statistic follow. Some formulas use keywords to designate the corresponding statistic.

Descriptive Statistics

The keywords for descriptive statistics are as follows:

CSS

is the sum of squares corrected for the mean, computed as

\[  \sum w_ i (x_ i - \bar{x})^2  \]
CV

is the percent coefficient of variation, computed as

\[  (100s) / \bar{x}  \]
KURTOSIS | KURT

is the kurtosis, which measures heaviness of tails. When VARDEF=DF, the kurtosis is computed as

\[  c_{4_ n} \sum z_ i^4 - \frac{3 \left( n-1 \right)^2}{\left( n-2 \right) \left( n-3 \right)}  \]

where $c_{4_ n}$ is $\frac{n \left( n+1 \right)}{\left( n-1 \right) \left( n-2 \right) \left( n-3 \right)}$

When VARDEF=N, the kurtosis is computed as

\[  \frac{1}{n} \sum z_ i^4 - 3  \]

The formula is invariant under the transformation $w_ i^* = zw_ i$, $z > 0$. When you use VARDEF=WDF or VARDEF=WEIGHT, the kurtosis is set to missing.

MAX

is the maximum value of $x_ i$.

MEAN

is the arithmetic mean $\bar{x}$.

MIN

is the minimum value of $x_ i$.

MODE

is the most frequent value of $x_ i$.

Note: When QMETHOD=P2, PROC HPSUMMARY does not compute MODE.

N

is the number of $x_ i$ values that are not missing. Observations with $f_ i < 1$ and $w_ i$ equal to missing or $w_ i \leq 0$ (when you use the EXCLNPWGT option) are excluded from the analysis and are not included in the calculation of N.

NMISS

is the number of $x_ i$ values that are missing. Observations with $f_ i < 1$ and $w_ i$ equal to missing or $w_ i \leq 0$ (when you use the EXCLNPWGT option) are excluded from the analysis and are not included in the calculation of NMISS.

NOBS

is the total number of observations and is calculated as the sum of N and NMISS. However, if you use the WEIGHT statement, then NOBS is calculated as the sum of N, NMISS, and the number of observations excluded because of missing or nonpositive weights.

RANGE

is the range and is calculated as the difference between maximum value and minimum value.

SKEWNESS | SKEW

is skewness, which measures the tendency of the deviations to be larger in one direction than in the other. When VARDEF=DF, the skewness is computed as

\[  c_{3_ n} \sum z_ i^3  \]

where $c_{3_ n}$ is $\frac{n}{\left( n-1 \right) \left( n-2 \right)}$.

When VARDEF=N, the skewness is computed as

\[  \frac{1}{n} \sum z_ i^3  \]

The formula is invariant under the transformation $w_ i^* = zw_ i$, $z > 0$. When you use VARDEF=WDF or VARDEF=WEIGHT, the skewness is set to missing.

STDDEV | STD

is the standard deviation $s$ and is computed as the square root of the variance, $s^2$.

STDERR | STDMEAN

is the standard error of the mean, computed as

\[  \frac{s}{\sqrt {\sum w_ i}}  \]

when VARDEF=DF, which is the default. Otherwise, STDERR is set to missing.

SUM

is the sum, computed as

\[  \sum w_ i x_ i  \]
SUMWGT

is the sum of the weights, $W$, computed as

\[  \sum w_ i  \]
USS

is the uncorrected sum of squares, computed as

\[  \sum w_ i x_ i^2  \]
VAR

is the variance $s^2$.

Quantile and Related Statistics

The keywords for quantiles and related statistics are as follows:

MEDIAN

is the middle value.

P$n$

is the $n$th percentile. For example, P1 is the first percentile, P5 is the fifth percentile, P50 is the 50th percentile, and P99 is the 99th percentile.

Q1

is the lower quartile (25th percentile).

Q3

is the upper quartile (75th percentile).

QRANGE

is interquartile range and is calculated as

\[  Q_3 - Q_1  \]

You use the QNTLDEF= option to specify the method that the HPSUMMARY procedure uses to compute percentiles. Let $n$ be the number of nonmissing values for a variable, and let $x_1, \,  x_2, \,  \ldots , \,  x_ n$ represent the ordered values of the variable such that $x_1$ is the smallest value, $x_2$ is the next smallest value, and $x_ n$ is the largest value. For the $t$th percentile between 0 and 1, let $p = t/100$. Then define $j$ as the integer part of $np$ and $g$ as the fractional part of $np$ or $(n+1)p$, so that

$\displaystyle  np  $
$\displaystyle = j + g \quad \mbox{when QNTLDEF=1, 2, 3, or 5}  $
$\displaystyle (n+1)p  $
$\displaystyle = j + g \quad \mbox{when QNTLDEF=4}  $

Here, QNTLDEF= specifies the method that the procedure uses to compute the $t$th percentile, as shown in Table 9.10.

When you use the WEIGHT statement, the $t$th percentile is computed as

\[  y = \begin{cases}  \frac{1}{2} \left( x_ i + x_{i+1} \right) &  \mbox{if} \displaystyle \sum _{j=1}^ i w_ j = pW \\ x_{i+1} &  \mbox{if} \displaystyle \sum _{j=1}^ i w_ j < pW < \displaystyle \sum _{j=1}^{i+1} w_ j \end{cases}  \]

where $w_ j$ is the weight associated with $x_ i$ and $W = \displaystyle \sum _{i=1}^{n} w_ i$ is the sum of the weights. When the observations have identical weights, the weighted percentiles are the same as the unweighted percentiles with QNTLDEF=5.

Table 9.10: Methods for Computing Quantile Statistics

QNTLDEF=

Description

Formula

 

1

Weighted average at $x_{np}$

$y = (1-g)x_ j + gx_{j+1}$

 
   

where $x_0$ is taken to be $x_1$

2

Observation numbered closest to $np$

$y = x_ i$

if $g \neq \frac{1}{2}$

   

$y = x_ j$

if $g = \frac{1}{2}$ and $j$ is even

   

$y = x_{j+1}$

if $g = \frac{1}{2}$ and $j$ is odd

   

where $i$ is the integer part of $np + \frac{1}{2}$

3

Empirical distribution function

$y = x_ j$

if $g = 0$

   

$y = x_{j+1}$

if $g > 0$

4

Weighted average aimed at $x_{(n+1)p}$

$y = (1-g)x_ j + gx_{j+1}$

 
   

where $x_{n+1}$ is taken to be $x_ n$

5

Empirical distribution function with averaging

$y=\frac{1}{2}(x_ j + x_{j+1})$

if $g = 0$

   

$y = x_{j+1}$

if $g > 0$


Hypothesis Testing Statistics

The keywords for hypothesis testing statistics are as follows:

T

is the Student’s $t$ statistic to test the null hypothesis that the population mean is equal to $\mu _0$ and is calculated as

\[  \frac{\bar{x} - \mu _0}{\left.s\middle /\sqrt {\sum w_ i}\right.}  \]

By default, $\mu _0$ is equal to zero. You must use VARDEF=DF, which is the default variance divisor; otherwise T is set to missing.

By default, when you use a WEIGHT statement, the procedure counts the $x_ i$ values with nonpositive weights in the degrees of freedom. Use the EXCLNPWGT option in the PROC HPSUMMARY statement to exclude values with nonpositive weights.

PROBT | PRT

is the two-tailed $p$-value for the Student’s $t$ statistic, T, with $n - 1$ degrees of freedom. This value is the probability under the null hypothesis of obtaining a more extreme value of T than is observed in this sample.

Confidence Limits for the Mean

The keywords for confidence limits are as follows:

CLM

is the two-sided confidence limit for the mean. A two-sided $100(1-\alpha )$ percent confidence interval for the mean has upper and lower limits

\[  \bar{x} \pm t_{(1-\alpha /2;\,  n-1)} \frac{s}{\sqrt {\sum w_ i}}  \]

where $s$ is $\sqrt {\frac{1}{n-1}\sum (x_ i - \bar{x})^2}$, $t_{(1-\alpha /2;\,  n-1)}$ is the $(1 - \alpha / 2)$ critical value of the Student’s $t$ statistic with $n-1$ degrees of freedom, and $\alpha $ is the value of the ALPHA= option which by default is 0.05. Unless you use VARDEF=DF (which is the default variance divisor), CLM is set to missing.

LCLM

is the one-sided confidence limit below the mean. The one-sided $100(1-\alpha )$ percent confidence interval for the mean has the lower limit

\[  \bar{x} - t_{(1-\alpha ;\,  n-1)} \frac{s}{\sqrt {\sum w_ i}}  \]

Unless you use VARDEF=DF (which is the default variance divisor), LCLM is set to missing.

UCLM

is the one-sided confidence limit above the mean. The one-sided $100(1-\alpha )$ percent confidence interval for the mean has the upper limit

\[  \bar{x} + t_{(1-\alpha ;\,  n-1)} \frac{s}{\sqrt {\sum w_ i}}  \]

Unless you use VARDEF=DF (which is the default variance divisor), UCLM is set to missing.

Data Requirements for the HPSUMMARY Procedure

The following are the minimal data requirements to compute unweighted statistics and do not describe recommended sample sizes. Statistics are reported as missing if VARDEF=DF (the default) and the following requirements are not met:

  • N and NMISS are computed regardless of the number of missing or nonmissing observations.

  • SUM, MEAN, MAX, MIN, RANGE, USS, and CSS require at least one nonmissing observation.

  • VAR, STD, STDERR, CV, T, PRT, and PROBT require at least two nonmissing observations.

  • SKEWNESS requires at least three nonmissing observations.

  • KURTOSIS requires at least four nonmissing observations.

  • SKEWNESS, KURTOSIS, T, PROBT, and PRT require that STD is greater than zero.

  • CV requires that MEAN is not equal to zero.

  • CLM, LCLM, UCLM, STDERR, T, PRT, and PROBT require that VARDEF=DF.