The SURVEYMEANS Procedure

Definitions and Notation

For a stratified clustered sample design, together with the sampling weights, the sample can be represented by an $n \times (P+1)$ matrix

$\begin{eqnarray*} (\mb {w,Y}) & =& \left( w_{hij}, \mb {y}_{hij} \right) \\ & =& \left( w_{hij}, y_{hij}^{(1)}, y_{hij}^{(2)}, \ldots , y_{hij}^{(P)}\right) \end{eqnarray*}$

where

$h=1, 2, \ldots , H$ is the stratum index
$i=1, 2, \ldots , n_ h$ is the cluster index within stratum h
$j=1, 2, \ldots , m_{hi}$ is the unit index within cluster i of stratum h
$p=1, 2, \ldots , P$ is the analysis variable number, with a total of P variables
$n=\sum _{h=1}^ H \sum _{i=1}^{n_ h} {m_{hi}}$ is the total number of observations in the sample
$w_{hij}$ denotes the sampling weight for unit j in cluster i of stratum h
$\mb {y}_{hij}=\left( y_{hij}^{(1}), y_{hij}^{(2)}, \ldots , y_{hij}^{(P)}\right)$ are the observed values of the analysis variables for unit j in cluster i of stratum h, including both the values of numerical variables and the values of indicator variables for levels of categorical variables.

For a categorical variable C, let l denote the number of levels of C, and denote the level values as $c_1, c_2, \ldots , c_ l$ . Let $y^{(q)}$ $(q\in \{ 1, 2, \ldots , P\} )$ be an indicator variable for the category $C=c_ k$ $(k=1, 2, \ldots , l)$ with the observed value in unit j in cluster i of stratum h:

$y_{hij}^{(q)} = I_{\{ C=c_ k\} }(h,i,j) = \left\{ \begin{array}{ll} 1 & \mbox{if $C_{hij}=c_ k$ } \\ 0 & \mbox{otherwise} \end{array} \right.$

Note that the indicator variable $y_{hij}^{(q)}$ is set to missing when $C_{hij}$ is missing. Therefore, the total number of analysis variables, P, is the total number of numerical variables plus the total number of levels of all categorical variables.

The sampling rate $f_ h$ for stratum h, which is used in Taylor series variance estimation, is the fraction of first-stage units (PSUs) selected for the sample. You can use the TOTAL= or RATE= option to input population totals or sampling rates. See the section Specification of Population Totals and Sampling Rates for details. If you input stratum totals, PROC SURVEYMEANS computes $f_ h$ as the ratio of the stratum sample size to the stratum total. If you input stratum sampling rates, PROC SURVEYMEANS uses these values directly for $f_ h$ . If you do not specify the TOTAL= or RATE= option, then the procedure assumes that the stratum sampling rates $f_ h$ are negligible, and a finite population correction is not used when computing variances. Replication methods specified by the VARMETHOD=BRR or the VARMETHOD=JACKKNIFE option do not use this finite population correction $f_ h$ .