This chapter introduces the SAS/STAT procedures for survey sampling and describes how you can use these procedures to analyze survey data.
Researchers often use sample survey methodology to obtain information about a large population by selecting and measuring a sample from that population. Because of variability among items, researchers apply probability-based scientific designs to select the sample. This reduces the risk of a distorted view of the population and enables statistically valid inferences to be made from the sample. For more information about statistical sampling and analysis of complex survey data, see Lohr (2010); Kalton (1983); Cochran (1977); Kish (1965). To select probability-based random samples from a study population, you can use the SURVEYSELECT procedure, which provides a variety of methods of probability sampling. To perform imputation of missing values in survey data, you can use the SURVEYIMPUTE procedure, which provides methods of donor-based imputation. To analyze sample survey data, you can use the SURVEYMEANS, SURVEYFREQ, SURVEYREG, SURVEYLOGISTIC, and SURVEYPHREG procedures, which incorporate the sample design into the analyses.
Many SAS/STAT procedures, such as the MEANS, FREQ, GLM, LOGISTIC, and PHREG procedures, can compute sample means, produce crosstabulation tables, and estimate regression relationships. However, in most of these procedures, statistical inference is based on the assumption that the sample is drawn from an infinite population by simple random sampling. If the sample is in fact selected from a finite population by using a complex survey design, these procedures generally do not calculate the estimates and their variances according to the design actually used. Using analyses that are not appropriate for your sample design can lead to incorrect statistical inferences.
The SURVEYMEANS, SURVEYFREQ, SURVEYREG, SURVEYLOGISTIC, and SURVEYPHREG procedures properly analyze complex survey data by taking into account the sample design. You can use these procedures for multistage or single-stage designs, with or without stratification, and with or without unequal weighting. The survey analysis procedures provide a choice of variance estimation methods, which include Taylor series linearization, balanced repeated replication (BRR), and the jackknife.
Table 14.1 briefly describes the SAS/STAT sampling and analysis procedures.
Table 14.1: Survey Sampling and Analysis Procedures
Selection Methods |
Simple random sampling (without replacement) |
Unrestricted random sampling (with replacement) |
|
Systematic |
|
Sequential |
|
Bernoulli |
|
Poisson |
|
Probability proportional to size (PPS) sampling, |
|
with and without replacement |
|
PPS systematic |
|
PPS for two units per stratum |
|
PPS sequential with minimum replacement |
|
Allocation Methods |
Proportional |
Optimal |
|
Neyman |
|
Sampling Tools |
Stratified sampling |
Cluster sampling |
|
Replicated sampling |
|
Serpentine sorting |
|
Random assignment |
|
Imputation Methods |
Single and multiple hot-deck |
Approximate Bayesian bootstrap |
|
Fully efficient fractional |
|
Statistics |
Means and totals |
Proportions |
|
Quantiles |
|
Geometric means |
|
Ratios |
|
Standard errors |
|
Confidence limits |
|
Analyses |
Hypothesis tests |
Domain analysis |
|
Poststratification |
|
Graphics |
Histograms |
Box plots |
|
Summary panel plots |
|
Domain box plots |
|
Tables |
One-way frequency tables |
Two-way and multiway crosstabulation tables |
|
Estimates of population totals and proportions |
|
Standard errors |
|
Confidence limits |
|
Analyses |
Tests of goodness of fit |
Tests of independence |
|
Risks and risk differences |
|
Odds ratios and relative risks |
|
Kappa coefficients |
|
Graphics |
Weighted frequency and percent plots |
Mosaic plots |
|
Odds ratio, relative risk, and risk difference plots |
|
Kappa plots |
|
Analyses |
Linear regression model fitting |
Regression coefficients |
|
Covariance matrices |
|
Confidence limits |
|
Hypothesis tests |
|
Estimable functions |
|
Contrasts |
|
Least squares means (LS-means) of effects |
|
Custom hypothesis tests among LS-means |
|
Regression with constructed effects |
|
Predicted values and residuals |
|
Domain analysis |
|
Graphics |
Fit plots |
Analyses |
Cumulative logit regression model fitting |
Logit, probit, and complementary log-log link functions |
|
Generalized logit regression model fitting |
|
Regression coefficients |
|
Covariance matrices |
|
Confidence limits |
|
Hypothesis tests |
|
Odds ratios |
|
Estimable functions |
|
Contrasts |
|
Least squares means (LS-means) of effects |
|
Custom hypothesis tests among LS-means |
|
Regression with constructed effects |
|
Model diagnostics |
|
Domain analysis |
|
Analyses |
Proportional hazards regression model fitting |
Breslow and Efron likelihoods |
|
Regression coefficients |
|
Covariance matrices |
|
Confidence limits |
|
Hypothesis tests |
|
Hazard ratios |
|
Contrasts |
|
Predicted values and standard errors |
|
Martingale, Schoenfeld, score, and deviance residuals |
|
Domain analysis |