SAS Institute. The Power to Know

FOCUS AREAS

Sample Survey Design and Analysis

Overview

Researchers often use sample survey methodology to obtain information about a large aggregate or population by selecting and measuring a sample from that population. Due to the variability of characteristics among items in the population, researchers apply scientific sample designs in the sample selection process to reduce the risk of a distorted view of the population, and they make inferences about the population based on the information from the sample survey data. In order to make statistically valid inferences for the population, they must incorporate the sample design in the data analysis.

Traditional SAS procedures, such as the MEANS procedure and the GLM procedure, compute statistics under the assumption that the sample is drawn from an infinite population by simple random sampling. These procedures generally do not correctly estimate the variance of an estimator if they are applied to a sample drawn by a complex sample design. SAS users have requested procedures that analyze data from complex sample surveys. In response to this request , SAS/STAT software now provides the SURVEYSELECT, SURVEYMEANS, SURVEYFREQ, SURVEYREG, and SURVEYLOGISTIC procedures.

To select probability-based random samples from a study population, you can use the SURVEYSELECT procedure, which provides a variety of methods for probability sampling. To analyze sample survey data, you can use the SURVEYMEANS, SURVEYFREQ, SURVEYREG, and SURVEYLOGISTIC procedures, which incorporate the sample design into the analyses. These procedures can be used for multistage designs or for single-stage designs, with or without stratification, and with or without unequal weighting.

PROC SURVEYSELECT
  • simple random sampling
  • unrestricted random sampling (with replacement)
  • systematic
  • sequential
  • selection probability proportional to size (PPS) with and without replacement
  • PPS systematic
  • PPS for two units per stratum
  • sequential PPS with minimum replacement
PROC SURVEYMEANS
  • estimates of population means and totals
  • estimates of population proportions
  • standard errors
  • confidence limits
  • hypothesis tests (t tests)
  • domain analysis
  • ratio estimates
PROC SURVEYFREQ
  • estimates of population means and totals
  • estimates of population proportions
  • standard errors
  • confidence limits
  • hypothesis tests (t tests)
  • domain analysis
  • ratio estimates
PROC SURVEYREG
  • linear regression model fitting
  • estimates of regression coefficients
  • estimates of covariance matrices
  • significance tests
  • confidence limits
  • estimable functions
  • estimates and standard errors for contrasts
PROC SURVEYLOGISTIC
  • cumulative logit regression model fitting
  • logit, complementary log-log and probit link functions
  • generalized logit regression model fitting
  • estimates of regression coefficients
  • estimates of covariance matrices
  • hypothesis tests
  • model diagnostics
  • estimates of odds ratios
  • confidence limits
  • estimable functions
  • estimates and standard errors for contrasts

The SURVEYSELECT Procedure

The SURVEYSELECT procedure provides a variety of methods for selecting probability-based random samples. The procedure can select a simple random sample or can sample according to a complex multistage sample design that includes stratification, clustering, and unequal probabilities of selection. With probability sampling, each unit in the survey population has a known, positive probability of selection. This property of probability sampling avoids selection bias and enables you to use statistical theory to make valid inferences from the sample to survey population.

In addition to methods for equal probability selection, PROC SURVEYSELECT also provides probability proportional to size (PPS) selection methods. In PPS sampling, a unit's selection probability is proportional to its size measure. PPS sampling is often used in cluster sampling, where you select clusters (groups of sampling units) of varying sizes in the first stage of selection. Available PPS methods include without replacement, with replacement, systematic, and sequential with minimum replacement. The procedure can apply these methods for stratified and replicated sample designs.

To select a sample with PROC SURVEYSELECT, you input a SAS data set that contains the sampling frame or list the units from which the sample will be selected. You also specify the selection methods, the desired sample size or sampling rate, and other parameters. The SURVEYSELECT procedure selects the sample, producing an output data set that contains the selected units, their selection probabilities, and sampling weights.

The SURVEYSELECT procedure uses fast, efficient algorithms for these sample selection methods. Thus, it performs well even for very large input data sets or sampling frames, which may occur in practice for large scale sample surveys.

The following statements illustrate the syntax of PROC SURVEYSELECT:

 
proc surveyselect data=Customers
   method=srs n=15
   seed=1953 out=SampleStrata;
   strata State Type;
run;

The SURVEYMEANS Procedure

The SURVEYMEANS procedure provides estimates of population means and population totals from sample survey data. The sample design can be a complex sample design with stratification, clustering, and unequal weighting. PROC SURVEYMEANS also provides domain analysis (subgroup or subpopulation analysis). The procedure provides estimates of design-based variances for the quantities of interest using the Taylor series expansion method.

You can use the SURVEYMEANS procedure to compute the following statistics:

The following statements illustrate the syntax of PROC SURVEYMEANS:

proc surveymeans data=Company
        mean sum;
   var Asset Sale Value Profit;            
   weight Weight;
run;

The SURVEYFREQ Procedure

The SURVEYFREQ procedure produces one-way to n-way frequency and crosstabulation tables from sample survey data. These tables include estimates of population totals, population proportions (overall proportions, and also row and column proportions), and corresponding standard errors. Confidence limits, coefficients of variation, and design effects are also available. The procedure provides a variety of options to customize your table display.

For one-way frequency tables, PROC SURVEYFREQ provides Rao-Scott chi-square goodness-of-fit tests, which are adjusted for the sample design. You can test a null hypothesis of equal proportions for a one-way frequency table, or you can input other null hypothesis proportions for the test. For two-way frequency tables, PROC SURVEYFREQ provides design-based tests of independence, or no association, between the row and column variables. These tests include the Rao-Scott chi-square test, the Rao-Scott likelihood-ratio test, the Wald chi-square test, and the Wald loglinear chi-square test.

The following statements illustrate the syntax of PROC SURVEYFREQ:

proc surveyfreq data=SIS_Survey; 
      tables Response; 
   strata  State  NewUser; 
   cluster School; 
   weight SamplingWeight; 
run;

The SURVEYREG Procedure

PROC SURVEYREG fits linear regression models for survey data and computes regression coefficients and their variance-covariance matrix. The procedure allows you to specify classification effects using the same syntax as in the GLM procedure. The procedure also provides hypothesis tests for the model effects, for any specified estimable linear functions of model parameters, and for custom hypothesis tests of linear combinations of the regression parameters. The procedure computes design-based confidence limits of the parameter estimates and their linear estimable functions.

The following statements illustrate the syntax of PROC SURVEYREG:

proc surveyreg data=Farms
      total=TotalInStrata;
   strata State Region / list;
   model CornYield = FarmArea/ covb;
   weight Weight;
run;

The SURVEYLOGISTIC Procedure

The SURVEYLOGISTIC procedure fits linear logistic regression models for discrete response survey data by the method of pseudo maximum likelihood, incorporating the sample design into the analysis. The SURVEYLOGISTIC procedure enables you to use categorical classification variables as explanatory variables, using the familiar syntax for main effects and interactions employed in the GLM and LOGISTIC procedures. The following link functions are available in PROC SURVEYLOGISTIC: the cumulative logit function (CLOGIT), the generalized logit function (GLOGIT), the probit function (PROBIT), and the complementary log-log function (CLOGLOG). Design-based variances of the estimated regression parameters and the odds ratios are estimated using a Taylor series expansion approximation.

The following statements illustrate the syntax of PROC SURVEYLOGISTIC:

proc surveylogistic data=SampleStrata; 
   strata state type/list; 
   model Rating (order=internal) = Usage; 
   weight SamplingWeight; 
run;

A Note on Variance Estimation

The survey analysis procedures use the Taylor series expansion method to estimate sampling errors of estimators based on complex sample designs. This method obtains a linear approximation of an estimator based on Taylor series expansion and then estimates the variance of this approximation. When there are clusters, or primary sampling units (PSUs), in the sample design, the procedures estimate the variance from the variation among the PSUs. When the design is stratified, the procedures pool stratum variance estimates to compute the overall variance estimate.

For a multistage sample design, the variance estimator depends only on the first stage of the sample design. Thus, the required input includes only first stage cluster (PSU) and first-stage stratum identification. You do not need to input design information about any additional stages of sampling. This variance estimation method assumes that the first stage sampling fraction is small or that the first-stage sample is drawn with replacement, as it often is in practice.

Updates in Release 9.1.3

Release 9.1.3 includes two new procedures, PROC SURVEYFREQ and PROC SURVEYLOGISTIC, for the analysis of survey data. These two procedures provide more functionality to analyze categorical responses for survey data.

Documentation

For more information, refer to the chapters "Introduction to Survey Procedures", "The SURVEYSELECT Procedure", "The SURVEYMEANS Procedure", "The SURVEYFREQ Procedure", "The SURVEYREG Procedure", and "The SURVEYLOGISTIC Procedure" in the SAS/STAT User's Guide, Version 9.1.3.


Statistics and Operations Research Home Page | What's New in Data Analysis