The SURVEYMEANS Procedure

Poststratification

After a probability sample is drawn and survey data are collected, researchers sometimes want to stratify the sample according to auxiliary information about the sampled population. This process is often called poststratification.

When poststratification is done properly, it can improve efficiency. It can also be used to adjust the sampling weights such that the marginal distribution of the sampling weights is in agreement with known auxiliary information from other resources, such as the census. The adjusted weight is often called the poststratification weight. It is quite common for researchers to use poststratification techniques in survey data analysis.

Poststratification is also used by epidemiologists, who frequently analyze health survey data. They often compute statistics based on a process called direct standardization, a form of poststratification. For example, certain diseases, such as cancer, are more common among older populations. Therefore, to compare the prevalence rates among geographic regions that are populated with different age groups, it is necessary to make adjustments according to such demographic categories and to compute relative prevalence rates of the diseases.

For more information about poststratification, see Fuller (2009); Lohr (2010); Wolter (2007); Rao, Yung, and Hidiroglou (2002).

After you provide the population controls for each poststratum that is defined by the poststratification variables, the SURVEYMEANS procedure creates the poststratification weights accordingly. Then the procedure computes statistics that you request by using poststratification weights.

You can save the poststratification weights in an OUTPSWGT= data to be used in subsequent analyses.

For a selected sample, let $p=1, 2, \ldots , P$ be the poststratum index; let $Z_1, Z_2, \ldots , Z_ P$ be the population totals for each corresponding poststratum; and let $I_ p$ be a corresponding indicator variable for the poststratum p defined by

\[  I_{p}(h,i,j) = \left\{  \begin{array}{ll} 1 &  \mbox{if observation $(h,i,j)$ belongs to \Mathtext{p}th poststratum} \\ 0 &  \mbox{otherwise} \end{array} \right.  \]

Denote the total sum of original weights in the sample for each poststratum as

\[  \psi _ p = \sum _{h=1}^ H\sum _{i=1}^{n_ h}\sum _{j=1}^{m_{hi}} ~  w_{hij} I_{p}(h,i,j)  \]

Then the poststratification weight for the observation (h, i, j) is

\[  \tilde{w}_{hij} = w_{hij} \frac{Z_ p}{\psi _ p}  \]

The SURVEYMEANS procedure computes statistics by using the poststratification weights $\tilde{w}_{hij}$ instead of the original weights $w_{hij}$.

The standard error and confidence intervals of computed statistics are based on the estimated variance, using either a replication method or the Taylor series method.

Replication Methods

When you specify VARMETHOD=BRR or VARMETHOD=JACKKNIFE, PROC SURVEYMEANS computes the variance of a statistic by using replication methods, as described in the section Replication Methods for Variance Estimation. However, with poststratification, an extra step is needed to adjust the weights.

First, PROC SURVEYMEANS constructs a replicate and computes appropriate replicate weights for the replicate. Then, by using the poststratification control totals, the procedure adjusts these replicate weights in the same way as described previously for constructing the poststratification weights for the full sample. Finally, PROC SURVEYMEANS computes the estimate for a desired statistics by using the poststratification weights that are adjusted from the replicate weights in the current replicate. Then the final variance is estimated by the variability among replicate estimates, as described in the section Replication Methods for Variance Estimation.

Taylor Series Method

When you specify VARMETHOD=TAYLOR, or by default when you do not specify the VARMETHOD= option, PROC SURVEYMEANS uses the Taylor series method to estimate the variances of requested statistics.

Variance of the Mean and Sum

The sum and mean of variable Y under poststratification is

\begin{eqnarray*}  {\hat{{Y}}}^{(PS)} &  = &  \sum _{h=1}^ H\sum _{i=1}^{n_ h} \sum _{j=1}^{m_{hi}} ~  \tilde{w}_{hij} ~  y_{hij} \\ {\hat{\bar{Y}}}^{(PS)} &  = &  {\hat{{Y}}}^{(PS)} / ~  \tilde{w}_{\cdot \cdot \cdot } \end{eqnarray*}

where

\[  \tilde{w}_{\cdot \cdot \cdot } = \sum _{h=1}^ H\sum _{i=1}^{n_ h} \sum _{j=1}^{m_{hi}} \tilde{w}_{hij}  \]

is the sum of the poststratification weights over all observations in the sample.

For each poststratum $p=1, 2, \ldots , P$, let the mean of variable Y in each poststratum be

\[  {\hat{\bar{Y}}}^{(p)} = \left( \sum _{h=1}^ H\sum _{i=1}^{n_ h} \sum _{j=1}^{m_{hi}} ~  I_{p}(h,i,j) ~  \tilde{w}_{hij} ~  y_{hij} \right) / ~  Z_ p  \]

where $Z_ p$ is the total of the poststratification weights in poststratum p.

For observation (h, i, j), assume that it belongs to the pth poststratum. Let

\[  \tilde{y}_{hij}=y_{hij}- \hat{\bar{Y}}^{(p)}  \]

PROC SURVEYMEANS estimates the variance of ${\hat{\bar{Y}}}^{(PS)}$ as

\[  \hat{V}\left({\hat{\bar{Y}}}^{(PS)}\right) =\sum _{h=1}^ H {\hat{V}_ h \left({\hat{\bar{Y}}}^{(PS)}\right)}  \]

where, if $n_ h>1$, then

\begin{eqnarray*}  \hat{V}_ h \left({\hat{\bar{Y}}}^{(PS)}\right) &  = &  \frac{n_ h(1-f_ h)}{n_ h-1} ~  \sum _{i=1}^{n_ h} {(e_{hi\cdot }-\bar{e}_{h\cdot \cdot })^2} \\ e_{hi\cdot }& =&  \left( \sum _{j=1}^{m_{hi}} \tilde{w}_{hij} ~  \tilde{y}_{hij} \right) / \tilde{w}_{\cdot \cdot \cdot } \\ \bar{e}_{h\cdot \cdot } & =&  \left( \sum _{i=1}^{n_ h}e_{hi\cdot } \right) / ~  n_ h \end{eqnarray*}

and if $n_ h=1$, then

\[  \hat{V}_ h \left({\hat{\bar{Y}}}^{(PS)}\right) = \left\{  \begin{array}{ll} \mbox{missing} &  \mbox{ if } n_{h}=1 \mbox{ for } h’=1, 2, \ldots , H \\ 0 &  \mbox{ if } n_{h}>1 \mbox{ for some } 1 \le h’ \le H \end{array} \right.  \]

PROC SURVEYMEANS estimates the variance of ${\hat{{Y}}}^{(PS)}$ as

\[  \hat{V}_ h \left({\hat{{Y}}}^{(PS)}\right) = \hat{V}_ h \left({\hat{\bar{Y}}}^{(PS)}\right) ~  {\tilde{w}}_{\cdot \cdot \cdot }^2  \]
Variance of the Domain Mean and Sum

For a domain D, let $I_ D$ be the corresponding indicator variable:

\[  I_{D}(h,i,j) = \left\{  \begin{array}{ll} 1 &  \mbox{if observation $(h,i,j)$ belongs to \Mathtext{D}} \\ 0 &  \mbox{otherwise} \end{array} \right.  \]

Let

\[ \tilde{v}_{hij}= \tilde{w}_{hij}I_ D(h,i,j) = \left\{  \begin{array}{ll} \tilde{w}_{hij} &  \mbox{if observation $(h,i,j)$ belongs to \Mathtext{D}} \\ 0 &  \mbox{otherwise} \end{array} \right.  \]

The sum and mean of variable Y under poststratification in domain D are

\begin{eqnarray*}  {\hat{{Y}}_ D}^{(PS)} &  = &  \sum _{h=1}^ H\sum _{i=1}^{n_ h} \sum _{j=1}^{m_{hi}} ~  \tilde{v}_{hij} ~  y_{hij} \\ {\hat{\bar{Y}}_ D}^{(PS)} &  = &  {\hat{{Y}}_ D}^{(PS)} / ~  \tilde{v}_{\cdot \cdot \cdot } \end{eqnarray*}

where

\[  \tilde{v}_{\cdot \cdot \cdot } = \sum _{h=1}^ H\sum _{i=1}^{n_ h} \sum _{j=1}^{m_{hi}} \tilde{v}_{hij}  \]

is the sum of the poststratification weights over all observations in the sample in domain D. For each poststratum $p=1, 2, \ldots , P$, let the mean of variable Y and the mean of the domain indicator variable in each poststratum be

\begin{eqnarray*}  {\hat{\bar{Y}}}_{D}^{(p)} &  = &  \left(\sum _{h=1}^ H\sum _{i=1}^{n_ h} \sum _{j=1}^{m_{hi}} ~  I_{p}(h,i,j) ~  I_{D}(h,i,j) ~  \tilde{w}_{hij} ~  y_{hij} \right) / ~  Z_ p \\ {\hat{\bar{I}}}_{D}^{(p)} &  = &  \left(\sum _{h=1}^ H\sum _{i=1}^{n_ h} \sum _{j=1}^{m_{hi}} ~  I_{p}(h,i,j) ~  I_{D}(h,i,j) ~  \tilde{w}_{hij} \right) / ~  Z_ p \end{eqnarray*}

Assume that the observation (h, i, j) belongs to the pth poststratum. Let

\begin{eqnarray*}  d_{hij} & =&  y_{hij}I_ D(h,i,j) ~ -~  {\hat{\bar{Y}}}_{D}^{(p)} \\ e_{hij} & =&  d_{hij} ~ -~  \left( ~  I_ D(h,i,j) ~ -~  {\hat{\bar{I}}}_{D}^{(p)} ~ \right) ~  {\hat{\bar{Y}}_ D}^{(PS)} \end{eqnarray*}

Then PROC SURVEYMEANS estimates the variance of domain sum ${\hat{{Y}}_ D}^{(PS)}$ as

\[  \hat{V}\left({\hat{{Y}}_ D}^{(PS)}\right) =\sum _{h=1}^ H {\hat{V}_ h \left({\hat{{Y}}_ D}^{(PS)}\right)}  \]

where, if $n_ h>1$, then

\begin{eqnarray*}  \hat{V}_ h \left({\hat{{Y}}_ D}^{(PS)}\right) &  = &  \frac{n_ h(1-f_ h)}{n_ h-1} ~  \sum _{i=1}^{n_ h} {(d_{hi\cdot }-\bar{d}_{h\cdot \cdot })^2} \\ d_{hi\cdot }& =&  \sum _{j=1}^{m_{hi}}~ \tilde{w}_{hij} d_{hij} \\ \bar{d}_{h\cdot \cdot } & =&  \left( \sum _{i=1}^{n_ h}d_{hi\cdot } \right) / ~  n_ h \end{eqnarray*}

and if $n_ h=1$, then

\[  \hat{V}_ h \left({\hat{{Y}}_ D}^{(PS)}\right) = \left\{  \begin{array}{ll} \mbox{missing} &  \mbox{ if } n_{h}=1 \mbox{ for } h’=1, 2, \ldots , H \\ 0 &  \mbox{ if } n_{h}>1 \mbox{ for some } 1 \le h’ \le H \end{array} \right.  \]

Then PROC SURVEYMEANS estimates the variance of domain mean ${\hat{\bar{Y}}_ D}^{(PS)}$ as

\[  \hat{V}\left({\hat{\bar{Y}}_ D}^{(PS)}\right) =\sum _{h=1}^ H {\hat{V}_ h \left({\hat{\bar{Y}}_ D}^{(PS)}\right)}  \]

where, if $n_ h>1$, then

\begin{eqnarray*}  \hat{V}_ h \left({\hat{\bar{Y}}_ D}^{(PS)}\right) &  = &  \frac{n_ h(1-f_ h)}{n_ h-1} ~  \sum _{i=1}^{n_ h} {(e_{hi\cdot }-\bar{e}_{h\cdot \cdot })^2} \\ e_{hi\cdot }& =&  \sum _{j=1}^{m_{hi}}~ \tilde{w}_{hij} e_{hij} /\tilde{w}_{\cdot \cdot \cdot } \\ \bar{e}_{h\cdot \cdot } & =&  \left( \sum _{i=1}^{n_ h}e_{hi\cdot } \right) / ~  n_ h \end{eqnarray*}

and if $n_ h=1$, then

\[  \hat{V}_ h \left({\hat{\bar{Y}}_ D}^{(PS)}\right) = \left\{  \begin{array}{ll} \mbox{missing} &  \mbox{ if } n_{h}=1 \mbox{ for } h’=1, 2, \ldots , H \\ 0 &  \mbox{ if } n_{h}>1 \mbox{ for some } 1 \le h’ \le H \end{array} \right.  \]
Variance of the Ratio

Suppose you want to calculate the ratio of variable Y to variable X. Let $x_{hij}$ and $y_{hij}$ be the values of variable X and variable Y, respectively, for observation (h, i, j).

The ratio of Y to X after poststratification is

\[  \hat{R}^{(PS)} = \frac{ \sum _{h=1}^ H\sum _{i=1}^{n_ h} \sum _{j=1}^{m_{hi}} ~  \tilde{w}_{hij} ~  y_{hij} }{ \sum _{h=1}^ H\sum _{i=1}^{n_ h} \sum _{j=1}^{m_{hi}} ~  \tilde{w}_{hij} ~  x_{hij} }  \]

where $\tilde{w}_{hij}$ is the poststratification weight for observation $(h, i, j)$.

Assume that the observation (h, i, j) belongs to the pth poststratum. Let

\begin{eqnarray*}  \tilde{y}_{hij} & =&  y_{hij}- \hat{\bar{Y}}^{(p)} \\ \tilde{x}_{hij} & =&  x_{hij}- \hat{\bar{X}}^{(p)} \end{eqnarray*}

where $\hat{\bar{Y}}^{(p)}$ and $\hat{\bar{X}}^{(p)}$ are the means of variable Y and variable X, respectively, in poststratum p.

The variance of $\hat{R}^{(PS)}$ is estimated by

\[  \hat{V}(\hat{R}^{(PS)}) = \sum _{h=1}^ H \hat{V_ h}(\hat{R}^{(PS)})  \]

where, if $n_ h>1$, then

\begin{eqnarray*}  \hat{V_ h}(\hat{R}^{(PS)}) & =&  \frac{n_ h(1-f_ h)}{n_ h-1} ~  \sum _{i=1}^{n_ h} {(g_{hi\cdot }-\bar{g}_{h\cdot \cdot })^2}\\ g_{hi\cdot }& =&  \frac{\sum _{j=1}^{m_{hi}}\tilde{w}_{hij}~ (\tilde{y}_{hij}- \tilde{x}_{hij}\hat{R}^{(PS)}) }{\sum _{h=1}^ H\sum _{i=1}^{n_ h} \sum _{j=1}^{m_{hi}} ~  \tilde{w}_{hij} ~  x_{hij}}\\ \bar{g}_{h\cdot \cdot } & =&  \left( \sum _{i=1}^{n_ h}g_{hi\cdot } \right) / ~  n_ h \end{eqnarray*}

and if $n_ h=1$, then

\[  \hat{V_ h}(\hat{R}^{(PS)}) = \left\{  \begin{array}{ll} \mbox{missing} &  \mbox{ if } n_{h}=1 \mbox{ for } h’=1, 2, \ldots , H \\ 0 &  \mbox{ if } n_{h}>1 \mbox{ for some } 1 \le h’ \le H \end{array} \right.  \]
Variance of the Domain Ratio

For a domain D, let $I_ D$ be the corresponding indicator variable:

\[  I_{D}(h,i,j) = \left\{  \begin{array}{ll} 1 &  \mbox{if observation $(h,i,j)$ belongs to \Mathtext{D}} \\ 0 &  \mbox{otherwise} \end{array} \right.  \]

Let

\[ \tilde{v}_{hij}= \tilde{w}_{hij}I_ D(h,i,j) = \left\{  \begin{array}{ll} \tilde{w}_{hij} &  \mbox{if observation $(h,i,j)$ belongs to \Mathtext{D}} \\ 0 &  \mbox{otherwise} \end{array} \right.  \]

The ratio of variable Y to variable X in domain D after poststratification is estimated by

\[  {\hat{R}}_ D^{(PS)} = \frac{ \sum _{h=1}^ H\sum _{i=1}^{n_ h} \sum _{j=1}^{m_{hi}} ~  \tilde{w}_{hij} ~  y_{hij} ~ I_ D(h,i,j) }{ \sum _{h=1}^ H\sum _{i=1}^{n_ h} \sum _{j=1}^{m_{hi}} ~  \tilde{w}_{hij} ~  x_{hij} ~  I_ D(h,i,j) }  \]

For each poststratum $p=1, 2, \ldots , P$, let the mean of variable X and Y in each poststratum be

\begin{eqnarray*}  {\hat{\bar{Y}}}_{D}^{(p)} &  = &  \left(\sum _{h=1}^ H\sum _{i=1}^{n_ h} \sum _{j=1}^{m_{hi}} ~  I_{p}(h,i,j) ~  \tilde{v}_{hij} ~  x_{hij} \right) / ~  Z_ p \\ {\hat{\bar{X}}}_{D}^{(p)} &  = &  \left(\sum _{h=1}^ H\sum _{i=1}^{n_ h} \sum _{j=1}^{m_{hi}} ~  I_{p}(h,i,j) ~  \tilde{v}_{hij} ~  y_{hij} \right) / ~  Z_ p \end{eqnarray*}

Assume that the observation (h, i, j) belongs to the pth poststratum. Let

\[  r_{hij} = y_{hij}I_ D(h,i,j) ~ -~  {\hat{\bar{Y}}}_{D}^{(p)} ~ -~  \left( ~  x_{hij}I_ D(h,i,j) ~ -~  {\hat{\bar{X}}}_{D}^{(p)} ~ \right) ~  {\hat{R}}_ D^{(PS)}  \]

Then PROC SURVEYMEANS estimates the variance of domain ratio ${\hat{R}}_ D^{(PS)}$ after poststratification as

\[  \hat{V}\left({\hat{R}}_ D^{(PS)}\right) =\sum _{h=1}^ H {\hat{V}_ h \left({\hat{{R}}_ D}^{(PS)}\right)}  \]

where, if $n_ h>1$, then

\begin{eqnarray*}  \hat{V}_ h \left({\hat{R}}_ D^{(PS)}\right) &  = &  \frac{n_ h(1-f_ h)}{n_ h-1} ~  \sum _{i=1}^{n_ h} {(r_{hi\cdot }-\bar{r}_{h\cdot \cdot })^2} \\ r_{hi\cdot }& =&  \sum _{j=1}^{m_{hi}}~ \tilde{w}_{hij} r_{hij} / \sum _{h=1}^ H\sum _{i=1}^{n_ h} \sum _{j=1}^{m_{hi}} ~  \tilde{v}_{hij} ~  x_{hij} \\ \bar{r}_{h\cdot \cdot } & =&  \left( \sum _{i=1}^{n_ h}r_{hi\cdot } \right) \big / ~  n_ h \end{eqnarray*}

and if $n_ h=1$, then

\[  \hat{V}_ h \left({\hat{R}}_ D^{(PS)}\right) = \left\{  \begin{array}{ll} \mbox{missing} &  \mbox{ if } n_{h}=1 \mbox{ for } h’=1, 2, \ldots , H \\ 0 &  \mbox{ if } n_{h}>1 \mbox{ for some } 1 \le h’ \le H \end{array} \right.  \]