The SURVEYMEANS Procedure

Quantiles

Let Y be the variable of interest in a complex survey. Denote $F(t)=\Pr (Y\le t)$ as the cumulative distribution function of Y. For $0<p<1$ , the pth quantile of the population cumulative distribution function is

$Q(p)=\inf \{ y: F(y)\ge p\}$

Estimate of Quantile

Let $\{ y_{hij}, w_{hij}\}$ be the observed values for variable Y that are associated with sampling weights, where $(h,i,j)$ are the stratum index, cluster index, and member index, respectively, as shown in the section Definitions and Notation. Let $y_{(1)} < y_{(2)} < \cdots < y_{(n)}$ denote the sample order statistics for variable Y.

An estimate of quantile $Q(p)$ is

$\hat Q(p)= \left\{ \begin{array}{ll} y_{(1)} & \mbox{ if } p<\hat F(y_{(1)}) \\ y_{(k)}+\displaystyle {\frac{p-\hat F(y_{(k)})}{\hat F(y_{(k+1)})-\hat F(y_{(k)})}} (y_{(k+1)}-y_{(k)}) & \mbox{ if } \hat F(y_{(k)}) \le p < \hat F(y_{(k+1)}) \\ y_{(n)} & \mbox{ if } p=1 \end{array} \right.$

where $\hat F(t)$ is the estimated cumulative distribution for Y,

$\hat F(t)=\frac{\sum _{h=1}^ H\sum _{i=1}^{n_ h} \sum _{j=1}^{m_{hi}}w_{hij}I(y_{hij}\le t)}{\sum _{h=1}^ H\sum _{i=1}^{n_ h}\sum _{j=1}^{m_{hi}}w_{hij}}$

and $I(\cdot )$ is the indicator function.

Standard Error

When you specify VARMETHOD=TAYLOR, or by default if you do not specify the VARMETHOD= option, PROC SURVEYMEANS uses Woodruff’s method (Dorfman and Valliant, 1993; Särndal, Swensson, and Wretman, 1992; Francisco and Fuller, 1991) to estimate the variances of quantiles. This method first constructs a confidence interval on a quantile. Then it uses the width of the confidence interval to estimate the standard error of a quantile.

In order to estimate the variance of $\hat Q(p)$ , PROC SURVEYMEANS first estimates the variance of the estimated distribution function $\hat F(\hat Q(p))$ by

$\hat V(\hat F(\hat Q(p))) =\sum _{h=1}^ H \frac{n_ h(1-f_ h)}{n_ h-1} ~ \sum _{i=1}^{n_ h} {(e_{hi\cdot }-\bar{e}_{h\cdot \cdot })^2}$

where

$\begin{eqnarray*} e_{hi\cdot }& =& \left( \sum _{j=1}^{m_{hi}}w_{hij}~ (I(y_{hij} \le \hat Q(p)) - \hat F(\hat Q(p))) \right) / ~ w_{\cdot \cdot \cdot } \\ \bar{e}_{h\cdot \cdot } & =& \left( \sum _{i=1}^{n_ h}e_{hi\cdot } \right) / ~ n_ h \\ w_{\cdot \cdot \cdot } & =& \sum _{h=1}^ H\sum _{i=1}^{n_ h} \sum _{j=1}^{m_{hi}} w_{hij} \end{eqnarray*}$

Then $100(1-\alpha )$ % confidence limits for $\hat F(\hat Q(p))$ can be constructed by

$\left(\hat p_ L, \, \, \, \, \, \, \hat p_ U\right)=\left(\hat F(\hat Q(p))-t_{\mi {df},\, \, \alpha /2}\sqrt {\hat V(\hat F(\hat Q(p)))}, \, \, \, \, \, \, \, \, \hat F(\hat Q(p))+t_{\mi {df},\, \, \alpha /2}\sqrt {\hat V(\hat F(\hat Q(p)))}\right)$

where $t_{\mi {df},\, \, \alpha /2}$ is the $100(1-\alpha /2)$ percentile of the t distribution with df degrees of freedom, described in the section Degrees of Freedom.

When $(\hat p_ L, \hat p_ U)$ is out of the range of [0,1], the procedure does not compute the standard error of $\hat Q(p)$ .

The $\hat p_ L$ th quantile is defined as

$\hat Q(\hat p_ L)= \left\{ \begin{array}{ll} y_{(1)} & \mbox{ if } \hat p_ L<\hat F(y_{(1)}) \\ y_{(k_ L)}+\displaystyle {\frac{\hat p_ L-\hat F(y_{(k_ L)})}{\hat F(y_{(k_ L+1)})-\hat F(y_{(k_ L)})}} (y_{(k_ L+1)}-y_{(k_ L)}) & \mbox{ if } \hat F(y_{(k_ L)}) \le \hat p_ L < \hat F(y_{(k_ L+1)}) \\ y_{(d)} & \mbox{ if } \hat p_ L=1 \end{array} \right.$

and the $\hat p_ U$ th quantile is defined as

$\hat Q(\hat p_ U)= \left\{ \begin{array}{ll} y_{(1)} & \mbox{ if } \hat p_ U<\hat F(y_{(1)}) \\ y_{(k_ U)}+\displaystyle {\frac{\hat p_ U-\hat F(y_{(k_ U)})}{\hat F(y_{(k_ U+1)})-\hat F(y_{(k_ U)})}} (y_{(k_ U+1)}-y_{(k_ U)}) & \mbox{ if } \hat F(y_{(k_ U)}) \le \hat p_ U < \hat F(y_{(k_ U+1)}) \\ y_{(d)} & \mbox{ if } \hat p_ U=1 \end{array} \right.$

The standard error of $\hat Q(p)$ is then estimated by

$\hat{\mr {sd}}(\hat Q(p)) = \frac{\hat Q(\hat p_ U) - \hat Q(\hat p_ L)}{2t_{\mi {df},\, \, \alpha /2}}$

where $t_{\mi {df},\, \, \alpha /2}$ is the $100(1-\alpha /2)$ percentile of the t distribution with df degrees of freedom.

When you use the replication method, PROC SURVEYMEANS uses the usual variance estimates for a quantile as described in the section Replication Methods for Variance Estimation. However, you should proceed cautiously, because this variance estimator can have poor properties (Dorfman and Valliant, 1993).

Confidence Limits

Symmetric $100(1-\alpha )$ % confidence limits are computed as

$\left( \hat Q(p) - \hat{\mr {sd}}(\hat Q(p)) ~ t_{\mi {df},\, \, \alpha /2} , \, \, \, \, \, \, \hat Q(p) + \hat{\mr {sd}}(\hat Q(p)) ~ t_{\mi {df},\, \, \alpha /2} \right)$

If you specify the NONSYMCL option in the PROC SURVEYMEANS statement when you use the VARMETHOD=TAYLOR option, the procedure computes $100(1-\alpha )$ % nonsymmetric confidence limits:

$\left(\hat Q(\hat p_ L), \, \, \, \, \, \hat Q(\hat p_ U) \right)$

Quantile Estimation with Poststratification

When you specify a POSTSTRATA statement, the quantile estimation and its variance estimation incorporate poststratification. For more information about poststratification, see the section Poststratification.

For a selected sample, let $r=1, 2, \ldots , R$ be the poststratum index; let $Z_1, Z_2, \ldots , Z_ R$ be the population totals for each corresponding poststratum, and let $I_ r$ be the indicator variable for the poststratum r that is defined by

$I_{r}(h,i,j) = \left\{ \begin{array}{ll} 1 & \mbox{if observation } (h,i,j) \mbox{ belongs to the } r\mbox{th poststratum} \\ 0 & \mbox{otherwise} \end{array} \right.$

Denote the total sum of original weights in the sample for each poststratum as

$\psi _ r = \sum _{h=1}^ H\sum _{i=1}^{n_ h}\sum _{j=1}^{m_{hi}} ~ w_{hij} I_{r}(h,i,j)$

Assume that the observation (h, i, j) belongs to the rth poststratum. Then the poststratification weight for the observation (h, i, j) is

$\tilde{w}_{hij} = w_{hij} \frac{Z_ r}{\psi _ r}$

Then the estimated cumulative distribution function of Y, $\hat F(t)$ and the estimated pth quantile estimation $\hat Q(p)$ can be computed as in the section Estimate of Quantile by replacing the original weights, $w_{hij}$ , with the poststratification weights, $\tilde{w}_{hij}$ .

When you specify VARMETHOD=TAYLOR (or by default), the variance of $\hat Q(p)$ is estimated as in the section Standard Error, except that the variance of the estimated distribution function $\hat F(\hat Q(p))$ is computed as follows.

For each poststratum $r=1, 2, \ldots , r$ , define

${\hat{\theta }}^{(r)}(\hat Q(p))= Z_ r^{-1} \sum _{h=1}^ H\sum _{i=1}^{n_ h} \sum _{j=1}^{m_{hi}} ~ I_{r}(h,i,j) ~ \tilde{w}_{hij} ~ (I(y_{hij} \le \hat Q(p)) - \hat F(\hat Q(p)))$

where $I(\cdot )$ is the indicator function.

Assume that the observation (h, i, j) belongs to the rth poststratum. Let

$\tilde{y}_{hij}=I(y_{hij} \le \hat Q(p)) - \hat F(\hat Q(p)) - {\hat{\theta }}^{(r)}(\hat Q(p))$

PROC SURVEYMEANS estimates the variance of the estimated distribution function $\hat F(\hat Q(p))$ with poststratification by

$\hat V(\hat F(\hat Q(p))) =\sum _{h=1}^ H \frac{n_ h(1-f_ h)}{n_ h-1} ~ \sum _{i=1}^{n_ h} {(u_{hi\cdot }-\bar{u}_{h\cdot \cdot })^2}$

where

$\begin{eqnarray*} u_{hi\cdot }& =& \left( \sum _{j=1}^{m_{hi}}\tilde{w}_{hij}~ \tilde{y}_{hij} \right) / ~ \tilde{w}_{\cdot \cdot \cdot } \\ \bar{u}_{h\cdot \cdot } & =& \left( \sum _{i=1}^{n_ h}u_{hi\cdot } \right) / ~ n_ h \\ \tilde{w}_{\cdot \cdot \cdot } & =& \sum _{h=1}^ H\sum _{i=1}^{n_ h} \sum _{j=1}^{m_{hi}} \tilde{w}_{hij} \end{eqnarray*}$

Domain Quantile

Let Y be the variable of interest in a complex survey, and let a subpopulation of interest be domain D. Denote $F_ D(t)$ as the cumulative distribution function of Y in domain D. For $0<p<1$ , the pth quantile of the population cumulative distribution function is

$Q_ D(p)=\inf \{ y: F_ D(y)\ge p\}$

Let $I_ D$ be the corresponding indicator variable:

$I_{D}(h,i,j) = \left\{ \begin{array}{ll} 1 & \mbox{if observation $(h,i,j)$ belongs to \Mathtext{D}} \\ 0 & \mbox{otherwise} \end{array} \right.$

Assume that there are a total of d observations among the n observations in the entire sample that belong to domain D. Let $y_{(1)} < y_{(2)} < \cdots < y_{(d)}$ denote the order statistics of variable Y for these d observations that fall in domain D.

The cumulative distribution function of Y in domain D is estimated by

$\hat F_ D(t)=\frac{\sum _{h=1}^ H\sum _{i=1}^{n_ h} \sum _{j=1}^{m_{hi}}w_{hij}I(y_{hij}\le t)I_{D}(h,i,j)}{\sum _{h=1}^ H\sum _{i=1}^{n_ h}\sum _{j=1}^{m_{hi}}w_{hij}I_{D}(h,i,j)}$

and $I(\cdot )$ is the indicator function. Then the estimated quantile in domain D is

$\hat Q_ D(p)= \left\{ \begin{array}{ll} y_{(1)} & \mbox{ if } p<\hat F_ D(y_{(1)}) \\ y_{(k)}+\displaystyle {\frac{p-\hat F_ D(y_{(k)})}{\hat F_ D(y_{(k+1)})-\hat F_ D(y_{(k)})}} (y_{(k+1)}-y_{(k)}) & \mbox{ if } \hat F_ D(y_{(k)}) \le p < \hat F_ D(y_{(k+1)}) \\ y_{(d)} & \mbox{ if } p=1 \end{array} \right.$

In order to estimate the variance for $\hat Q_ D(p)$ , PROC SURVEYMEANS first estimates the variance of the estimated distribution function $\hat F_ D(\hat Q_ D(p))$ in domain D. When you specify VARMETHOD=TAYLOR (or by default), the variance of $\hat F_ D(\hat Q_ D(p))$ is estimated by

$\hat V(\hat F_ D(\hat Q_ D(p))) =\sum _{h=1}^ H \frac{n_ h(1-f_ h)}{n_ h-1} ~ \sum _{i=1}^{n_ h} {(d_{hi\cdot }-\bar{d}_{h\cdot \cdot })^2}$

where

$\begin{eqnarray*} v_{hij} & =& I_{D}(h,i,j) w_{hij} \\ v_{\cdot \cdot \cdot } & =& \sum _{h=1}^ H\sum _{i=1}^{n_ h} \sum _{j=1}^{m_{hi}} v_{hij} \\ d_{hi\cdot }& =& \left( \sum _{j=1}^{m_{hi}}v_{hij}~ (I(y_{hij} \le \hat Q_ D(p)) - \hat F_ D(\hat Q_ D(p))) \right) / ~ v_{\cdot \cdot \cdot } \\ \bar{d}_{h\cdot \cdot } & =& \left( \sum _{i=1}^{n_ h}d_{hi\cdot } \right) / ~ n_ h \end{eqnarray*}$

Then $100(1-\alpha )$ % confidence limits for $\hat F_ D(\hat Q_ D(p))$ can be constructed by $(\hat{p}_{DL}, \hat{p}_{DU})$ , where

$\begin{eqnarray*} \hat{p}_{DL} & =& \hat F_ D(\hat Q_ D(p)) -t_{\mi {df},\, \, \alpha /2}\sqrt {\hat V(\hat F_ D(\hat Q_ D(p)))} \\ \hat{p}_{DU} & =& \hat F_ D(\hat Q_ D(p)) +t_{\mi {df},\, \, \alpha /2}\sqrt {\hat V(\hat F_ D(\hat Q_ D(p)))} \end{eqnarray*}$

and $t_{\mi {df},\, \, \alpha /2}$ is the $100(1-\alpha /2)$ percentile of the t distribution with df degrees of freedom, described in the section Degrees of Freedom. When $(\hat{p}_{DL}, \hat{p}_{DU})$ is out of the range of [0,1], PROC SURVEYMEANS does not compute the standard error of $\hat Q_ D(p)$ .

The $\hat{p}_{DL}$ th quantile is then estimated as

$\hat Q_ D(\hat{p}_{DL})= \left\{ \begin{array}{ll} y_{(1)} & \mbox{ if } \hat{p}_{DL}<\hat F_ D(y_{(1)}) \\ y_{(k_ L)}+\displaystyle {\frac{(\hat{p}_{DL}-\hat F_ D(y_{(k_ L)}))(y_{(k_ L+1)}-y_{(k_ L)})}{\hat F_ D(y_{(k_ L+1)})-\hat F_ D(y_{(k_ L)})}} & \mbox{ if } \hat F_ D(y_{(k_ L)}) \le \hat p_{DL} < \hat F_ D(y_{(k_ L+1)}) \\ y_{(d)} & \mbox{ if } \hat p_{DL}=1 \end{array} \right.$

The $\hat{p}_{DU}$ th quantile is then estimated as

$\hat Q_ D(\hat{p}_{DU})= \left\{ \begin{array}{ll} y_{(1)} & \mbox{ if } \hat{p}_{DU}<\hat F_ D(y_{(1)}) \\ y_{(k_ U)}+\displaystyle {\frac{(\hat{p}_{DU}-\hat F_ D(y_{(k_ U)}))(y_{(k_ U+1)}-y_{(k_ U)})}{\hat F_ D(y_{(k_ U+1)})-\hat F_ D(y_{(k_ U)})}} & \mbox{ if } \hat F_ D(y_{(k_ U)}) \le \hat p_{DU} < \hat F_ D(y_{(k_ U+1)}) \\ y_{(d)} & \mbox{ if } \hat p_{DU}=1 \end{array} \right.$

The standard error of $\hat Q_ D(p)$ is then estimated by

$\hat{\mr {sd}}(\hat Q_ D(p)) = \frac{\hat Q_ D(\hat{p}_{DU}) - \hat Q_ D(\hat{p}_{DL})}{2t_{\mi {df},\, \, \alpha /2}}$

where $t_{\mi {df},\, \, \alpha /2}$ is the $100(1-\alpha /2)$ percentile of the t distribution with df degrees of freedom.

Symmetric $100(1-\alpha )$ % confidence limits for $\hat Q_ D(p)$ are computed as

$\left( \hat Q_ D(p) - \hat{\mr {sd}}(\hat Q_ D(p)) ~ t_{\mi {df},\, \, \alpha /2} , \, \, \, \, \, \, \hat Q_ D(p) + \hat{\mr {sd}}(\hat Q_ D(p)) ~ t_{\mi {df},\, \, \alpha /2} \right)$

If you specify the NONSYMCL option in the PROC SURVEYMEANS statement, the procedure displays $100(1-\alpha )$ % nonsymmetric confidence limits as

$\left(\hat Q_ D(\hat{p}_{DL}), \, \, \, \, \, \hat Q_ D(\hat{p}_{DU}) \right)$

Domain Quantile Estimation with Poststratification

When you specify both a POSTSTRATA statement and a DOMAIN statement, the domain quantile estimation and its variance estimation incorporate poststratification. For more information about poststratification, see the section Poststratification.

For a selected sample, let $r=1, 2, \ldots , R$ be the poststratum index, let $Z_1, Z_2, \ldots , Z_ R$ be the population totals for each corresponding poststratum, and let $I_ r$ be the indicator variable for the poststratum r:

$I_{r}(h,i,j) = \left\{ \begin{array}{ll} 1 & \mbox{if observation } (h,i,j) \mbox{ belongs to the } r\mbox{th poststratum} \\ 0 & \mbox{otherwise} \end{array} \right.$

The poststratification weights, $\tilde{w}_{hij}$ , are defined as in the section Quantile Estimation with Poststratification.

For domain D, let $I_ D$ be the corresponding indicator variable:

$I_{D}(h,i,j) = \left\{ \begin{array}{ll} 1 & \mbox{if observation $(h,i,j)$ belongs to \Mathtext{D}} \\ 0 & \mbox{otherwise} \end{array} \right.$

With poststratification, for variable Y, the estimated cumulative distribution in domain D, $\hat F_ D(t)$ , and its pth quantile estimation, $\hat Q_ D(p)$ , can be computed as in the section Domain Quantile by replacing the original weights, $w_{hij}$ , with the poststratification weights, $\tilde{w}_{hij}$ . However, the variance of $\hat F_ D(\hat Q_ D(p))$ , which is described in the section Domain Quantile, is computed as follows when you specify the VARMETHOD=TAYLOR option (or by default).

Define

$\begin{eqnarray*} {\hat{\theta }}^{(r)}_ D(\hat Q_ D(p))& =& Z_ r^{-1} \sum _{h=1}^ H\sum _{i=1}^{n_ h} \sum _{j=1}^{m_{hi}} ~ I_{D}(h,i,j) I_{r}(h,i,j) ~ \tilde{w}_{hij} ~ (I(y_{hij} \le \hat Q_ D(p)) - \hat F_ D(\hat Q_ D(p))) \\ {\hat{\bar{I}}}_{D}^{(r)} & = & Z_ r^{-1} \sum _{h=1}^ H\sum _{i=1}^{n_ h} \sum _{j=1}^{m_{hi}} ~ I_{r}(h,i,j) ~ I_{D}(h,i,j) ~ \tilde{w}_{hij} \\ \hat\theta _ D(p) & =& \frac{\sum _{h=1}^ H\sum _{i=1}^{n_ h} \sum _{j=1}^{m_{hi}} ~ I_{D}(h,i,j) ~ \tilde{w}_{hij} ~ (I(y_{hij} \le \hat Q_ D(p)) - \hat F_ D(\hat Q_ D(p))) }{\sum _{h=1}^ H\sum _{i=1}^{n_ h} \sum _{j=1}^{m_{hi}} ~ I_{D}(h,i,j) ~ \tilde{w}_{hij}} \end{eqnarray*}$

Assume that the observation (h, i, j) belongs to the rth poststratum. Then the variance of $\hat F_ D(\hat Q_ D(p))$ is estimated by

$\begin{eqnarray*} \hat V(\hat F_ D(\hat Q_ D(p))) & =& \sum _{h=1}^ H \frac{n_ h(1-f_ h)}{n_ h-1} ~ \sum _{i=1}^{n_ h} {(e_{hi\cdot }-\bar{e}_{h\cdot \cdot })^2} \\ e_{hij} & =& I_ D(h,i,j) \left( I(y_{hij} \le \hat Q_ D(p)) - \hat F_ D(\hat Q_ D(p))\right) ~ -~ {\hat{\theta }}^{(r)}_ D(\hat Q_ D(p)) \\ & & ~ -~ \left( ~ I_ D(h,i,j) ~ -~ {\hat{\bar{I}}}_{D}^{(r)} ~ \right) ~ \hat\theta _ D(p) \\ e_{hi\cdot }& =& \sum _{j=1}^{m_{hi}}~ \tilde{w}_{hij} e_{hij} /\tilde{w}_{\cdot \cdot \cdot } \\ \bar{e}_{h\cdot \cdot } & =& \left( \sum _{i=1}^{n_ h}e_{hi\cdot } \right) / ~ n_ h \end{eqnarray*}$