# SAS/STAT Examples

## Using Bootstrap Replicate Weights with SAS/STAT Survey Procedures

Contents | SAS Program | PDF

## Overview

Providing bootstrap replicate weights with survey data for the purpose of variance estimation is a common practice within the survey sampling community. However, the SAS/STAT survey procedures, at present, do not provide options specifically designed to accommodate bootstrap replicate weights. Fortunately, some survey bootstrap variance estimators share a common mathematical form with the balanced repeated replication (BRR) variance estimator, and all of the SAS/STAT survey procedures do support BRR variance estimation. This example shows how, and under what conditions, this commonality can be exploited so that bootstrap replicate weights can be used with existing SAS/STAT survey procedures to perform variance estimation.

## Analysis

Suppose that is a population parameter of interest. Let be the estimate from the full sample for . Let be the estimate from the the bootstrap replicate subsample and let represent the total number of replicates. Lehtonen and Pahkinen (2004) show that when there are an equal number, , of primary sampling units (PSUs) per stratum, a consistent bootstrap estimator for the variance of has the following form:

Now compare the bootstrap variance estimator to the variance estimator that is used by the SAS/STAT survey estimators when you use Fay’s BRR method:

where is the Fay coefficient. The two variance estimators are equivalent if you set

or

The equivalence of these two estimators means that you can use bootstrap replicate weights with any of the SAS/STAT survey estimators by specifying VARMETHOD=BRR and setting the appropriate value for the FAY= method-option.

## Example

To demonstrate, this example uses a subset of the Mini-Finland Health Survey data set which is available at the Virtual Laboratory Survey Sampling (VLSS) Web site. The data set has a stratified, two-stage sampling design with two clusters per stratum and equal probabilities of selection. The data set includes the variables STR, CLU, CHRON, SYSBP, X, and a constructed variable WGT. STR and CLU are strata and cluster identifiers, respectively. They are consequential to the example only because they enable you to verify that the sampling design satisfies the requirement that there be a constant number of PSUs per stratum. In this case, there are two clusters per stratum, so . The variable CHRON contains sums of the observations within each cluster of a binary response variable that indicates chronic morbidity. The variable SYSBP contains sums of the observations within each cluster of a continuous response variable that measures systolic blood pressure. The variable X contains the number of respondents within each cluster. The constructed variable WGT represents the sampling weight and has a constant value of 1, representing the equal probability of selection.


data MFH;
input STR CLU CHRON SYSBP X;
wgt=1;
label
STR="Stratum ID"
CLU="Cluster ID"
CHRON="Cluster sample sum of Chronic morbidity (CHRON)"
SYSBP="Cluster sample sum for Systolic blood pressure (SYSBP)"
X="Number of sample elements in the clusters";
datalines;
1  1 70 29056  204
1  2 74 29417  210
2  1 12  3692   26
2  2 14  4564   30
3  1 15  7741   59
3  2 16  8585   63
4  1  9  6277   45
4  2 14  5668   43
5  1 10  2322   17
5  2 16  3960   30
6  1 10  3080   21
6  2  6  3252   22
7  1 10  3966   27
7  2  4  3261   24
8  1 12  4156   28
8  2  6  2852   20
9  1 15  6617   46
9  2 23  6616   48
10 1 37 10552   73
10 2 25 11032   77
11 1 11  8759   60
11 2 25  9876   72
12 1 33  9901   69
12 2 24  6828   47
13 1 31  8624   61
13 2 27  9390   66
14 1 22  6960   48
14 2 20  7130   49
15 1 18  6646   49
15 2 22  7094   49
16 1 24  9841   69
16 2 37 11786   83
17 1 19  6910   48
17 2 23  6446   45
18 1 25 10742   73
18 2 29  9026   61
19 1 36  9350   65
19 2 34  8912   62
20 1  9  3810   26
20 2 22  7098   51
21 1 18  6998   53
21 2 34  9970   69
22 1 29 11146   79
22 2 41 13215   94
23 1 22  6596   48
23 2 18  6002   41
24 1 15  3808   27
24 2  7  3148   22
;
run;


A data set that contains 1000 bootstrap replicate weights was produced using the SAS macro %BOOT, which is available for download from the VLSS web site. This data set is then merged with the MFH data set.


data MFH;
merge mfh survey.weights;
run;


The population parameters that are to be estimated are the proportion of the population that exhibits chronic morbidity and the mean systolic blood pressure. You can use the SURVEYMEANS procedure to estimate both proportions and means. Before submitting a call to PROC SURVEYMEANS, use the knowledge that to calculate the value for as follows:

In the SURVEYMEANS statement, specify VARMETHOD=BRR and specify the FAY= method-option. Because both the proportion and the mean to be estimated are ratios, the two numerators, CHRON and SYSBP, and the common denominator X must be specified in a VAR statement and the form of the ratios must be specified in a RATIO statement. Use the REPWEIGHTS statement to specify the variables that contain the bootstrap weights, and use the WEIGHT statement to specify the variable that contains the sampling weights.


proc surveymeans data=mfh varmethod=brr(fay=.29289322);
var chron sysbp x;
ratio chron sysbp/ x;
repweights w1-w1000;
weight wgt;
run;


Figure 1 displays the results of the analysis. The estimates of the population parameters of interest and their bootstrapped standard errors are listed under the heading Ratio Analysis. The estimate of the proportion of the population that exhibits chronic morbidity is 0.397, and its standard error is 0.0108. The estimate of the population mean systolic blood pressure is 141.785, and its standard error is 0.497.

Figure 1: SURVEYMEANS Procedure Results Using Bootstrap Replicate Weights

The SURVEYMEANS Procedure

Data Summary
Number of Observations 48
Sum of Weights 48

Variance Estimation
Method BRR
Replicate Weights MFH
Number of Replicates 1000
Fay Coefficient 0.29289322

Statistics
Variable Label N Mean Std Error of Mean 95% CL for Mean
CHRON Cluster sample sum of Chronic morbidity (CHRON) 48 22.354167 0.819272 20.74648 23.96186
SYSBP Cluster sample sum for Systolic blood pressure (SYSBP) 48 7972.458333 142.064951 7693.67873 8251.23794
X Number of sample elements in the clusters 48 56.229167 1.002902 54.26113 58.19720

Ratio Analysis
Numerator Denominator N Ratio Std Err 95% CL for Ratio
CHRON X 48 0.397555 0.010810 0.376342 0.418767
SYSBP X 48 141.785106 0.497474 140.808894 142.761317

A variant of the bootstrap is the mean bootstrap. To compute mean bootstrap weights, you compute regular bootstrap weights and average the weights in groups of size C. Thus, if you generate K bootstrap weights and average them in groups of size C, you are left with mean bootstrap replicates. The mean bootstrap variance estimator has the following form:

The procedure for using mean bootstrap weights with the SAS/STAT survey estimators is the same as demonstrated in the preceding example except that is now computed as follows:

Consistency of the variance estimator for both the naive bootstrap and the mean bootstrap requires that the number of PSUs per stratum be constant so that the squared deviations can be properly scaled. However, Rao and Wu (1988) show that when the number of PSUs per stratum is variable, alternative methods do exist for computing bootstrap weights. The squared deviations of the variance estimator must still be properly scaled, but the scaling factor is no longer the single constant . Therefore, when the sampling design includes a variable number of PSUs per stratum, the scaling factor must be applied directly to the bootstrap weights. If the scaling factors are included in the bootstrap replicate weights, you can still use the SAS/STAT survey procedures to estimate population parameters and their variances by using the BRR method (VARMETHOD=BRR), but you no longer need to specify the FAY= method-option.

## References

• Lehtonen, R. and Pahkinen, E. (2004), Practical Methods for Design and Analysis of Complex Surveys, 2nd Edition, Chichester, UK: John Wiley & Sons.

• Rao, J. N. K. and Wu, C. F. J. (1988), “Resampling Inference with Complex Survey Data,” Journal of the American Statistical Association, 83, 231–241.