Definitions and Notation

For a stratified clustered sample design, together with the sampling weights, the sample can be represented by an $n \times (P+1)$ matrix

\begin{eqnarray*}  (\mb {w,Y}) & =&  \left( w_{hij}, \mb {y}_{hij} \right) \\ & =&  \left( w_{hij}, y_{hij}^{(1)}, y_{hij}^{(2)}, \ldots , y_{hij}^{(P)}\right) \end{eqnarray*}


  • $h=1, 2, \ldots , H$   is the stratum index

  • $i=1, 2, \ldots , n_ h$   is the cluster index within stratum h

  • $j=1, 2, \ldots , m_{hi}$   is the unit index within cluster i of stratum h

  • $p=1, 2, \ldots , P$   is the analysis variable number, with a total of P variables

  • $n=\sum _{h=1}^ H \sum _{i=1}^{n_ h} {m_{hi}}$   is the total number of observations in the sample

  • $w_{hij}$   denotes the sampling weight for unit j in cluster i of stratum h

  • $\mb {y}_{hij}=\left( y_{hij}^{(1}), y_{hij}^{(2)}, \ldots , y_{hij}^{(P)}\right)$   are the observed values of the analysis variables for unit j in cluster i of stratum h, including both the values of numerical variables and the values of indicator variables for levels of categorical variables.

For a categorical variable C, let l denote the number of levels of C, and denote the level values as $c_1, c_2, \ldots , c_ l$. Let $y^{(q)}$ $(q\in \{ 1, 2, \ldots , P\} )$ be an indicator variable for the category $C=c_ k$ $(k=1, 2, \ldots , l)$ with the observed value in unit j in cluster i of stratum h:

\[  y_{hij}^{(q)} = I_{\{ C=c_ k\} }(h,i,j) = \left\{  \begin{array}{ll} 1 &  \mbox{if $C_{hij}=c_ k$ } \\ 0 &  \mbox{otherwise} \end{array} \right.  \]

Note that the indicator variable $ y_{hij}^{(q)}$ is set to missing when $C_{hij}$ is missing. Therefore, the total number of analysis variables, P, is the total number of numerical variables plus the total number of levels of all categorical variables.

The sampling rate $f_ h$ for stratum h, which is used in Taylor series variance estimation, is the fraction of first-stage units (PSUs) selected for the sample. You can use the TOTAL= or RATE= option to input population totals or sampling rates. See the section Specification of Population Totals and Sampling Rates for details. If you input stratum totals, PROC SURVEYMEANS computes $f_ h$ as the ratio of the stratum sample size to the stratum total. If you input stratum sampling rates, PROC SURVEYMEANS uses these values directly for $f_ h$. If you do not specify the TOTAL= or RATE= option, then the procedure assumes that the stratum sampling rates $f_ h$ are negligible, and a finite population correction is not used when computing variances. Replication methods specified by the VARMETHOD=BRR or the VARMETHOD=JACKKNIFE option do not use this finite population correction $f_ h$.