Stratified Sampling

Suppose that the sample of students described in the previous section was actually selected by using stratified random sampling. In stratified sampling, the study population is divided into nonoverlapping strata, and samples are selected from each stratum independently.

The list of students in this junior high school was stratified by grade, yielding three strata: grades 7, 8, and 9. A simple random sample of students was selected from each grade. Table 88.1 shows the total number of students in each grade.

Table 88.1 Number of Students by Grade

Grade

Number of Students

7

1,824

8

1,025

9

1,151

Total

4,000

To analyze this stratified sample, you need to provide the population totals for each stratum to PROC SURVEYMEANS. The SAS data set StudentTotals contains the information from Table 88.1:

data StudentTotals;
   input Grade _total_; 
   datalines;
7 1824
8 1025
9 1151
;

The variable Grade is the stratum identification variable, and the variable _TOTAL_ contains the total number of students for each stratum. PROC SURVEYMEANS  requires you to use the variable name _TOTAL_ for the stratum population totals.

The procedure uses the stratum population totals to adjust variance estimates for the effects of sampling from a finite population. If you do not provide population totals or sampling rates, then the procedure assumes that the proportion of the population in the sample is very small, and the computation does not involve a finite population correction.

In a stratified sample design, when the sampling rates in the strata are unequal, you need to use sampling weights to reflect this information in order to produce an unbiased mean estimator. In this example, the appropriate sampling weights are reciprocals of the probabilities of selection. You can use the following DATA step to create the sampling weights:

data IceCream; 
   set IceCream; 
   if Grade=7 then Prob=20/1824;
   if Grade=8 then Prob=9/1025;
   if Grade=9 then Prob=11/1151;
   Weight=1/Prob;
run;

When you use PROC SURVEYSELECT to select your sample, the procedure creates these sampling weights for you.

The following SAS statements perform the stratified analysis of the survey data:

title1 'Analysis of Ice Cream Spending';
title2 'Stratified Sample Design';
proc surveymeans data=IceCream total=StudentTotals;
   stratum Grade / list; 
   var Spending Group;
   weight Weight;
run;

The PROC SURVEYMEANS  statement invokes the procedure. The DATA= option names the SAS data set IceCream as the input data set to be analyzed. The TOTAL= option names the data set StudentTotals as the input data set that contains the stratum population totals. Comparing this to the analysis in the section Simple Random Sampling, notice that the TOTAL=StudentTotals option is used here instead of the TOTAL=4000 option. In this stratified sample design, the population totals are different for different strata, and so you need to provide them to PROC SURVEYMEANS  in a SAS data set.

The STRATA statement identifies the stratification variable Grade. The LIST option in the STRATA statement requests that the procedure display stratum information. The WEIGHT statement tells the procedure that the variable Weight contains the sampling weights.

Figure 88.2 displays information about the input data set. There are three strata in the design and 40 observations in the sample. The categorical variable Group has two levels, 'less' and 'more.'

Figure 88.3 displays information for each stratum. The table displays a stratum index and the values of the STRATA variable. The stratum index identifies each stratum by a sequentially assigned number. For each stratum, the table gives the population total (total number of students), the sampling rate, and the sample size. The stratum sampling rate is the ratio of the number of students in the sample to the number of students in the population for that stratum. The table also lists each analysis variable and the number of stratum observations for that variable. For categorical variables, the table lists each level and the number of sample observations in that level.

Figure 88.2 Data Summary
Analysis of Ice Cream Spending
Stratified Sample Design

The SURVEYMEANS Procedure

Data Summary
Number of Strata 3
Number of Observations 40
Sum of Weights 4000

Class Level Information
Class Variable Levels Values
Group 2 less more

Figure 88.3 Stratum Information
Stratum Information
Stratum
Index
Grade Population Total Sampling Rate N Obs Variable Level N
1 7 1824 1.10% 20 Spending   20
          Group less 17
            more 3
2 8 1025 0.88% 9 Spending   9
          Group less 0
            more 9
3 9 1151 0.96% 11 Spending   11
          Group less 6
            more 5

Figure 88.4 shows the following:

  • The estimate of average weekly ice cream expense is $9.14 for students in this school, with a standard error of $0.53, and a 95% confidence interval from $8.06 to $10.22.

  • An estimate of 54.5% of all students spend less than $10 weekly on ice cream, and 45.5% spend more, with a standard error of 5.8%.

Figure 88.4 Analysis of Ice Cream Spending
Statistics
Variable Level N Mean Std Error of Mean 95% CL for Mean
Spending   40 9.141298 0.531799 8.06377052 10.2188254
Group less 23 0.544555 0.058424 0.42617678 0.6629323
  more 17 0.455445 0.058424 0.33706769 0.5738232