This example illustrates the use of regression analysis in a simple random cluster sample design. The data are from Särndal, Swensson, and Wretman (1992, p. 652). A total of 284 Swedish municipalities are grouped into 50 clusters of neighboring municipalities. Five clusters with a total of 32 municipalities are randomly selected. The results from the regression analysis in which clusters are used in the sample design are compared to the results of a regression analysis that ignores the clusters. The linear relationship between the population in 1975 and in 1985 is investigated.
The 32 selected municipalities in the sample are saved in the data set Municipalities
:
data Municipalities; input Municipality Cluster Population85 Population75; datalines; 205 37 5 5 206 37 11 11 207 37 13 13 208 37 8 8 209 37 17 19 6 2 16 15 7 2 70 62 8 2 66 54 9 2 12 12 10 2 60 50 94 17 7 7 95 17 16 16 96 17 13 11 97 17 12 11 98 17 70 67 99 17 20 20 100 17 31 28 101 17 49 48 276 50 6 7 277 50 9 10 278 50 24 26 279 50 10 9 280 50 67 64 281 50 39 35 282 50 29 27 283 50 10 9 284 50 27 31 52 10 7 6 53 10 9 8 54 10 28 27 55 10 12 11 56 10 107 108 ;
The variable Municipality
identifies the municipalities in the sample; the variable Cluster
indicates the cluster to which a municipality belongs; and the variables Population85
and Population75
contain the municipality populations in 1985 and in 1975 (in thousands), respectively. A regression analysis is performed
by PROC SURVEYREG with a CLUSTER statement:
title1 'Regression Analysis for Swedish Municipalities'; title2 'Cluster Sampling'; proc surveyreg data=Municipalities total=50; cluster Cluster; model Population85=Population75; run;
The TOTAL=50 option specifies the total number of clusters in the sampling frame.
Output 101.2.1 displays the data and design summary. Since the sample design includes clusters, the procedure displays the total number of clusters in the sample in the "Design Summary" table.
Output 101.2.2 displays the fit statistics and regression coefficient estimates. In the "Estimated Regression Coefficients" table, the estimated slope for the linear relationship is 1.05, which is significant at the 5% level; but the intercept is not significant. This suggests that a regression line crossing the original can be established between populations in 1975 and in 1985.
The CLUSTER statement is necessary in PROC SURVEYREG in order to incorporate the sample design. If you do not specify a CLUSTER statement in the regression analysis, as in the following statements, the standard deviation of the regression coefficients are incorrectly estimated.
title1 'Regression Analysis for Swedish Municipalities'; title2 'Simple Random Sampling'; proc surveyreg data=Municipalities total=284; model Population85=Population75; run;
The analysis ignores the clusters in the sample, assuming that the sample design is a simple random sampling. Therefore, the TOTAL= option specifies the total number of municipalities, which is 284.
Output 101.2.3 displays the regression results ignoring the clusters. Compared to the results in Output 101.2.2, the regression coefficient estimates are the same. However, without using clusters, the regression coefficients have a smaller
variance estimate, as in Output 101.2.3. By using clusters in the analysis, the estimated regression coefficient for effect Population75
is 1.05, with the estimated standard error 0.05, as displayed in Output 101.2.2; without using the clusters, the estimate is 1.05, but with the estimated standard error 0.04, as displayed in Output 101.2.3. To estimate the variance of the regression coefficients correctly, you should include the clustering information in the
regression analysis.