FOCUS AREAS

SAS/STAT Examples

Using Bootstrap Replicate Weights with SAS/STAT Survey Procedures


Contents | SAS Program | PDF

Overview

Providing bootstrap replicate weights with survey data for the purpose of variance estimation is a common practice within the survey sampling community. However, the SAS/STAT survey procedures, at present, do not provide options specifically designed to accommodate bootstrap replicate weights. Fortunately, some survey bootstrap variance estimators share a common mathematical form with the balanced repeated replication (BRR) variance estimator, and all of the SAS/STAT survey procedures do support BRR variance estimation. This example shows how, and under what conditions, this commonality can be exploited so that bootstrap replicate weights can be used with existing SAS/STAT survey procedures to perform variance estimation.

Analysis

Suppose that $\theta $ is a population parameter of interest. Let $\hat{\theta }$ be the estimate from the full sample for $\theta $. Let $\hat{\theta _ r}$ be the estimate from the $r$the bootstrap replicate subsample and let represent the total number of replicates. Lehtonen and Pahkinen (2004) show that when there are an equal number, $\alpha $, of primary sampling units (PSUs) per stratum, a consistent bootstrap estimator for the variance of $\hat{\theta }$ has the following form:

\[  \widehat{V}(\hat{\theta }) = \frac{\alpha }{\alpha - 1}\frac{1}{R} \sum _{r=1}^ R \left( \hat{\theta _ r} - \hat{\theta } \right)^2  \]

Now compare the bootstrap variance estimator to the variance estimator that is used by the SAS/STAT survey estimators when you use Fay’s BRR method:

\[  \widehat{V}(\hat{\theta }) = \frac{1}{R{(1-\epsilon )}^2} \sum _{r=1}^ R \left( \hat{\theta _ r} - \hat{\theta } \right)^2  \]

where $\epsilon $ is the Fay coefficient. The two variance estimators are equivalent if you set

\[  \frac{\alpha }{\alpha - 1} = \frac{1}{{(1-\epsilon )}^2}  \]

or

\[  \epsilon = 1 - \sqrt {\frac{\alpha -1}{\alpha }}  \]

The equivalence of these two estimators means that you can use bootstrap replicate weights with any of the SAS/STAT survey estimators by specifying VARMETHOD=BRR and setting the appropriate value for the FAY=$\epsilon $ method-option.

Example

To demonstrate, this example uses a subset of the Mini-Finland Health Survey data set which is available at the Virtual Laboratory Survey Sampling (VLSS) Web site. The data set has a stratified, two-stage sampling design with two clusters per stratum and equal probabilities of selection. The data set includes the variables STR, CLU, CHRON, SYSBP, X, and a constructed variable WGT. STR and CLU are strata and cluster identifiers, respectively. They are consequential to the example only because they enable you to verify that the sampling design satisfies the requirement that there be a constant number of PSUs per stratum. In this case, there are two clusters per stratum, so $\alpha =2$. The variable CHRON contains sums of the observations within each cluster of a binary response variable that indicates chronic morbidity. The variable SYSBP contains sums of the observations within each cluster of a continuous response variable that measures systolic blood pressure. The variable X contains the number of respondents within each cluster. The constructed variable WGT represents the sampling weight and has a constant value of 1, representing the equal probability of selection.


data MFH;
   input STR CLU CHRON SYSBP X;
   wgt=1;
   label
      STR="Stratum ID"
      CLU="Cluster ID"
      CHRON="Cluster sample sum of Chronic morbidity (CHRON)"
      SYSBP="Cluster sample sum for Systolic blood pressure (SYSBP)"
      X="Number of sample elements in the clusters";
datalines;
   1  1 70 29056  204
   1  2 74 29417  210
   2  1 12  3692   26
   2  2 14  4564   30
   3  1 15  7741   59
   3  2 16  8585   63
   4  1  9  6277   45
   4  2 14  5668   43
   5  1 10  2322   17
   5  2 16  3960   30
   6  1 10  3080   21
   6  2  6  3252   22
   7  1 10  3966   27
   7  2  4  3261   24
   8  1 12  4156   28
   8  2  6  2852   20
   9  1 15  6617   46
   9  2 23  6616   48
   10 1 37 10552   73
   10 2 25 11032   77
   11 1 11  8759   60
   11 2 25  9876   72
   12 1 33  9901   69
   12 2 24  6828   47
   13 1 31  8624   61
   13 2 27  9390   66
   14 1 22  6960   48
   14 2 20  7130   49
   15 1 18  6646   49
   15 2 22  7094   49
   16 1 24  9841   69
   16 2 37 11786   83
   17 1 19  6910   48
   17 2 23  6446   45
   18 1 25 10742   73
   18 2 29  9026   61
   19 1 36  9350   65
   19 2 34  8912   62
   20 1  9  3810   26
   20 2 22  7098   51
   21 1 18  6998   53
   21 2 34  9970   69
   22 1 29 11146   79
   22 2 41 13215   94
   23 1 22  6596   48
   23 2 18  6002   41
   24 1 15  3808   27
   24 2  7  3148   22
;
run;

A data set that contains 1000 bootstrap replicate weights was produced using the SAS macro %BOOT, which is available for download from the VLSS web site. This data set is then merged with the MFH data set.


data MFH;
   merge mfh survey.weights;
run;

The population parameters that are to be estimated are the proportion of the population that exhibits chronic morbidity and the mean systolic blood pressure. You can use the SURVEYMEANS procedure to estimate both proportions and means. Before submitting a call to PROC SURVEYMEANS, use the knowledge that $\alpha = 2$ to calculate the value for $\epsilon $ as follows:

\[  \epsilon = 1 - \sqrt {\frac{2-1}{2}} = 0.29289322  \]

In the SURVEYMEANS statement, specify VARMETHOD=BRR and specify the FAY=$\epsilon $ method-option. Because both the proportion and the mean to be estimated are ratios, the two numerators, CHRON and SYSBP, and the common denominator X must be specified in a VAR statement and the form of the ratios must be specified in a RATIO statement. Use the REPWEIGHTS statement to specify the variables that contain the bootstrap weights, and use the WEIGHT statement to specify the variable that contains the sampling weights.


proc surveymeans data=mfh varmethod=brr(fay=.29289322);
   var chron sysbp x;
   ratio chron sysbp/ x;
   repweights w1-w1000;
   weight wgt;
run;

Figure 1 displays the results of the analysis. The estimates of the population parameters of interest and their bootstrapped standard errors are listed under the heading Ratio Analysis. The estimate of the proportion of the population that exhibits chronic morbidity is 0.397, and its standard error is 0.0108. The estimate of the population mean systolic blood pressure is 141.785, and its standard error is 0.497.

Figure 1: SURVEYMEANS Procedure Results Using Bootstrap Replicate Weights

The SURVEYMEANS Procedure

Data Summary
Number of Observations 48
Sum of Weights 48

Variance Estimation
Method BRR
Replicate Weights MFH
Number of Replicates 1000
Fay Coefficient 0.29289322

Statistics
Variable Label N Mean Std Error of Mean 95% CL for Mean
CHRON Cluster sample sum of Chronic morbidity (CHRON) 48 22.354167 0.819272 20.74648 23.96186
SYSBP Cluster sample sum for Systolic blood pressure (SYSBP) 48 7972.458333 142.064951 7693.67873 8251.23794
X Number of sample elements in the clusters 48 56.229167 1.002902 54.26113 58.19720

Ratio Analysis
Numerator Denominator N Ratio Std Err 95% CL for Ratio
CHRON X 48 0.397555 0.010810 0.376342 0.418767
SYSBP X 48 141.785106 0.497474 140.808894 142.761317


A variant of the bootstrap is the mean bootstrap. To compute mean bootstrap weights, you compute regular bootstrap weights and average the weights in groups of size C. Thus, if you generate K bootstrap weights and average them in groups of size C, you are left with $R=\frac{K}{C}$ mean bootstrap replicates. The mean bootstrap variance estimator has the following form:

\[  \widehat{V}(\hat{\theta }) = \frac{C\alpha }{\alpha - 1}\frac{1}{R} \sum _{r=1}^ R \left( \hat{\theta _ r} - \hat{\theta } \right)^2  \]

The procedure for using mean bootstrap weights with the SAS/STAT survey estimators is the same as demonstrated in the preceding example except that $\epsilon $ is now computed as follows:

\[  \epsilon = 1 - \sqrt {\frac{\alpha - 1}{C\alpha }}  \]

Consistency of the variance estimator for both the naive bootstrap and the mean bootstrap requires that the number of PSUs per stratum be constant so that the squared deviations can be properly scaled. However, Rao and Wu (1988) show that when the number of PSUs per stratum is variable, alternative methods do exist for computing bootstrap weights. The squared deviations of the variance estimator must still be properly scaled, but the scaling factor is no longer the single constant $\frac{\alpha }{\alpha - 1}$. Therefore, when the sampling design includes a variable number of PSUs per stratum, the scaling factor must be applied directly to the bootstrap weights. If the scaling factors are included in the bootstrap replicate weights, you can still use the SAS/STAT survey procedures to estimate population parameters and their variances by using the BRR method (VARMETHOD=BRR), but you no longer need to specify the FAY=$\epsilon $ method-option.

References

  • Lehtonen, R. and Pahkinen, E. (2004), Practical Methods for Design and Analysis of Complex Surveys, 2nd Edition, Chichester, UK: John Wiley & Sons.

  • Rao, J. N. K. and Wu, C. F. J. (1988), “Resampling Inference with Complex Survey Data,” Journal of the American Statistical Association, 83, 231–241.