24555 - Using PROC SURVEYSELECT for single-stage cluster sampling

Usage Note 24555: Using PROC SURVEYSELECT for single-stage cluster sampling

Background

Cluster sampling involves sampling units that are groups or clusters, each consisting of one or more subunits. Often, a listing of clusters is available while the complete listing of subunits or observations within clusters is not. Clusters can be sampled, and an enumeration of subunits obtained later for data collection or further subsampling. Even if an enumerated list is available, there could be other constraints on collecting data from units that are selected randomly from among the entire population, and cluster sampling is done instead. (Note that when a listing of all subunits is available, estimates based on a random sample from the entire population are often more precise than those obtained from a cluster sample. This is because of the tendency for units within clusters to be more alike than units between clusters.)

SAS/STAT^® 9.22 in SAS^® 9.2 TS2M3

Beginning with SAS/STAT 9.22 in SAS 9.2 TS2M3, use the SAMPLINGUNIT or CLUSTER statement to name variables that identify the sampling units as groups of observations (clusters).

For example, suppose you have 10 different clusters with one to five people per cluster.

data A; do ClusterID=1 to 10; do i=1 to 1+int(5*ranuni(34920)); if i=1 then PersonID=0; PersonID+1; output; end; end; drop i; run;

These statements select a simple random sample of three clusters without replacement:

proc surveyselect data=a out=sample method=srs sampsize=3 seed=377183 noprint; samplingunit ClusterID; run;

Prior to SAS/STAT 9.22 in SAS 9.2 TS2M3

If a listing of the entire target population is available and you want to carry out a cluster sample, the following shows how PROC SURVEYSELECT can be used in releases prior to SAS 9.2 TS2M3. The steps are to identify the individual clusters, select a random sample of clusters, and then collect all the original observations from each sampled cluster.

Using the same 10 cluster data set above, first identify the individual clusters:

proc freq data=A noprint; tables ClusterID / out=ClusterIDList(drop=count percent); run;

The following statements select a simple random sample without replacement of three of the cluster ID's:

proc surveyselect data=ClusterIDList out=ClusterSample method=srs n=3 noprint; run;

Collect all the observations for each sampled cluster from the original data set to create the final sample:

data Sample; merge ClusterSample(in=sample) A(in=all); by ClusterID; if Sample and All; run;

The IN= data set option creates a new variable that indicates whether the data set contributes to the current observation. Using the MERGE and BY statements above to match-merge the sample of clusters with the original data, and then subsetting using the IF statement causes only those CLUSTERIDs that exist in both the sample of clusters and the original data set to be included in the SAMPLE data set.

proc print data=Sample; run;

Operating System and Release Information

Product Family	Product	System	SAS Release
			Reported	Fixed*
SAS System	SAS/STAT	All	n/a

* For software releases that are not yet generally available, the Fixed Release is the software release in which the problem is planned to be fixed.

Type:	Usage Note
Priority:	low
Topic:	Analytics ==> Survey Sampling and Analysis SAS Reference ==> Procedures ==> SURVEYSELECT

Date Modified:	2011-01-05 14:46:58
Date Created:	2007-02-12 11:13:22

Support

Usage Note 24555: Using PROC SURVEYSELECT for single-stage cluster sampling

Background

SAS/STAT® 9.22 in SAS® 9.2 TS2M3

Prior to SAS/STAT 9.22 in SAS 9.2 TS2M3

Operating System and Release Information

SAS/STAT^® 9.22 in SAS^® 9.2 TS2M3