Example 94.2 Cluster Sampling

This example illustrates the use of regression analysis in a simple random cluster sample design. The data are from Särndal, Swensson, and Wretman (1992, p. 652). A total of 284 Swedish municipalities are grouped into 50 clusters of neighboring municipalities. Five clusters with a total of 32 municipalities are randomly selected. The results from the regression analysis in which clusters are used in the sample design are compared to the results of a regression analysis that ignores the clusters. The linear relationship between the population in 1975 and in 1985 is investigated.

The 32 selected municipalities in the sample are saved in the data set Municipalities:

data Municipalities;
input Municipality Cluster Population85 Population75;
datalines;
205   37    5    5
206   37   11   11
207   37   13   13
208   37    8    8
209   37   17   19
6    2   16   15
7    2   70   62
8    2   66   54
9    2   12   12
10    2   60   50
94   17    7    7
95   17   16   16
96   17   13   11
97   17   12   11
98   17   70   67
99   17   20   20
100   17   31   28
101   17   49   48
276   50    6    7
277   50    9   10
278   50   24   26
279   50   10    9
280   50   67   64
281   50   39   35
282   50   29   27
283   50   10    9
284   50   27   31
52   10    7    6
53   10    9    8
54   10   28   27
55   10   12   11
56   10  107  108
;

The variable Municipality identifies the municipalities in the sample; the variable Cluster indicates the cluster to which a municipality belongs; and the variables Population85 and Population75 contain the municipality populations in 1985 and in 1975 (in thousands), respectively. A regression analysis is performed by PROC SURVEYREG with a CLUSTER statement:

title1 'Regression Analysis for Swedish Municipalities';
title2 'Cluster Sampling';
proc surveyreg data=Municipalities total=50;
cluster Cluster;
model Population85=Population75;
run;

The TOTAL=50 option specifies the total number of clusters in the sampling frame.

Output 94.2.1 displays the data and design summary. Since the sample design includes clusters, the procedure displays the total number of clusters in the sample in the Design Summary table.

Output 94.2.1: Regression Analysis for Cluster Sampling

 Regression Analysis for Swedish Municipalities Cluster Sampling

The SURVEYREG Procedure

Regression Analysis for Dependent Variable Population85

Data Summary
Number of Observations 32
Mean of Population85 27.50000
Sum of Population85 880.00000

Design Summary
Number of Clusters 5

Output 94.2.2 displays the fit statistics and regression coefficient estimates. In the Estimated Regression Coefficients table, the estimated slope for the linear relationship is 1.05, which is significant at the 5% level; but the intercept is not significant. This suggests that a regression line crossing the original can be established between populations in 1975 and in 1985.

Output 94.2.2: Regression Analysis for Cluster Sampling

Fit Statistics
R-square 0.9860
Root MSE 3.0488
Denominator DF 4

Estimated Regression Coefficients
Parameter Estimate Standard Error t Value Pr > |t|
Intercept -0.0191292 0.89204053 -0.02 0.9839
Population75 1.0546253 0.05167565 20.41 <.0001

 Note: The denominator degrees of freedom for the t tests is 4.

The CLUSTER statement is necessary in PROC SURVEYREG in order to incorporate the sample design. If you do not specify a CLUSTER statement in the regression analysis, as in the following statements, the standard deviation of the regression coefficients are incorrectly estimated.

title1 'Regression Analysis for Swedish Municipalities';
title2 'Simple Random Sampling';
proc surveyreg data=Municipalities total=284;
model Population85=Population75;
run;

The analysis ignores the clusters in the sample, assuming that the sample design is a simple random sampling. Therefore, the TOTAL= option specifies the total number of municipalities, which is 284.

Output 94.2.3 displays the regression results ignoring the clusters. Compared to the results in Output 94.2.2, the regression coefficient estimates are the same. However, without using clusters, the regression coefficients have a smaller variance estimate, as in Output 94.2.3. By using clusters in the analysis, the estimated regression coefficient for effect Population75 is 1.05, with the estimated standard error 0.05, as displayed in Output 94.2.2; without using the clusters, the estimate is 1.05, but with the estimated standard error 0.04, as displayed in Output 94.2.3. To estimate the variance of the regression coefficients correctly, you should include the clustering information in the regression analysis.

Output 94.2.3: Regression Analysis for Simple Random Sampling

 Regression Analysis for Swedish Municipalities Simple Random Sampling

The SURVEYREG Procedure

Regression Analysis for Dependent Variable Population85

Data Summary
Number of Observations 32
Mean of Population85 27.50000
Sum of Population85 880.00000

Fit Statistics
R-square 0.9860
Root MSE 3.0488
Denominator DF 31

Estimated Regression Coefficients
Parameter Estimate Standard Error t Value Pr > |t|
Intercept -0.0191292 0.67417606 -0.03 0.9775
Population75 1.0546253 0.03668414 28.75 <.0001

 Note: The denominator degrees of freedom for the t tests is 31.