Contents | SAS Program | PDF

This example uses PROC SURVEYMEANS to obtain poststratified totals, means, and ratios. The data are sampled from county-level data sets that are publicly available from the USDA Economic Research Service website, at http://www.ers.usda.gov/data-products/county-level-data-sets.aspx. The sample consists of the county-level information about population size, the number of individuals in the labor force, and the number of unemployed persons in the 48 contiguous states of the United States of America in 2011. The sampling frame is stratified by state, and a simple random sample of two counties per state is selected. The analysis consists of a comparison between the non-poststratified estimates and the poststratified estimates of the total and average labor force size, number of unemployed, population size, and two ratios: the unemployment rate and the labor force participation rate. Table 1 describes the contents of the sample data set `Unemployment`

, and Table 2 describes the interpretation of the six levels of the National Center for Health Statistics (NCHS) urban-rural classification for each county.

Table 1: Example Data Set `Unemployment`

Variable |
Description |
---|---|

FIPS |
Federal information processing standards (FIPS) code for counties |

ST_FIPS |
FIPS code for states |

State |
Abbreviation of state name |

County |
County name |

Code2006 |
National Center for Health Statistics (NCHS) 2006 urban-rural classification code |

Population |
Resident total population estimate as of July 1, 2011 |

LaborForce |
Number of individuals in the civilian labor force in 2011 |

Unemployed |
Number of unemployed individuals in 2011 |

SamplingWeight |
Sampling weight generated by yhe SURVEYSELECT procedure |

Table 2: 2006 NCHS Urban-Rural Classification Scheme

Code |
Urbanization Level |
Classification Rules |
---|---|---|

1 |
Large metro, central |
Counties in micropolitan statistical area (MSA) with population of 1 million |

or more that have the following characteristics: |
||

1) contain the entire population of the largest principal city of the MSA, or |
||

2) are completely contained within the largest principal city of the MSA, or |
||

3) contain at least 250,000 residents of any principal city in the MSA |
||

2 |
Large metro, fringe |
Counties in MSA with 1 million or more population that do not qualify as large central |

3 |
Medium metro |
Counties in MSA with 250,000–999,999 population |

4 |
Small metro |
Counties in MSA with 50,000–249,999 population |

5 |
Micropolitan |
Counties in micropolitan statistical area |

6 |
Noncore |
Counties not in micropolitan statistical area |

The following SAS statements create the SAS data set `Unemployment`

:

data unemployment; input FIPS 1-5 ST_FIPS 7-8 State $ 10-11 County $ 13-34 Code2006 35 Population 37-45 LaborForce 46-52 Unemployed 53-58 SamplingWeight 59-64; datalines; 1005 1 AL Barbour County 5 27313 9761 1110 33.5 1019 1 AL Cherokee County 6 26094 11696 1020 33.5 4021 4 AZ Pinal County 2 383553 139864 14466 7.5 4027 4 AZ Yuma County 4 200374 89500 24270 7.5 5105 5 AR Perry County 3 10384 4788 414 37.5 ... more lines ... 55119 55 WI Taylor County 6 20759 10406 915 36.0 56025 56 WY Natrona County 4 76356 42907 2537 11.5 56037 56 WY Sweetwater County 5 44078 25138 1271 11.5 ; run;

You begin the comparative analysis by using PROC SURVEYMEANS as in the following statements to estimate the means, totals, and ratios of interest. The MEAN and SUM keywords in the PROC SURVEYMEANS statement request estimates of the population means and totals, respectively. The VAR statement requests estimates of the variables `LaborForce`

, `Unemployed`

, and `Population`

. So, for example, if you specify the keyword MEAN in the PROC SURVEYMEANS statement and the variable `Unemployed`

in the VAR statement, you are requesting an estimate of how many unemployed persons, on average, reside in a county. The first RATIO statement requests an estimate of the population’s unemployment rate, which is the ratio of the number of unemployed to the size of the labor force. The second RATIO statement requests an estimate of the labor force participation rate, which is the ratio of the size of the labor force to the size of the population of the county. The STRATA and WEIGHT statements identify the sampling design: the STRATA statement specifies that the strata are identified by the variable `ST_FIPS`

, and the WEIGHT statement specifies that the sampling weights are contained in the variable `SamplingWeight`

.

proc surveymeans data=unemployment mean sum; strata st_fips; weight SamplingWeight; var LaborForce Unemployed Population; ratio 'Unemployment Rate' Unemployed / LaborForce; ratio 'Labor Force Participation Rate' LaborForce / Population; run;

Output 1 displays the estimated means, totals, ratios, and their standard errors. For example, on average there are 110,064 individuals in a county and 53,472 individuals in the labor force, and 4,925 individuals are unemployed. On average, the unemployment rate is 9.2%, and the labor force participation rate is 48.58%.

Output 1: Stratified Design

The SURVEYMEANS Procedure

Data Summary | |
---|---|

Number of Strata | 48 |

Number of Observations | 96 |

Sum of Weights | 3108 |

Statistics | ||||
---|---|---|---|---|

Variable | Mean | Std Error of Mean | Sum | Std Dev |

LaborForce | 53472 | 6488.570784 | 166190527 | 20166478 |

Unemployed | 4924.943050 | 594.657745 | 15306723 | 1848196 |

Population | 110064 | 13105 | 342078597 | 40729501 |

Ratio Analysis: Unemployment Rate | |||
---|---|---|---|

Numerator | Denominator | Ratio | Std Err |

Unemployed | LaborForce | 0.092103 | 0.003090 |

Ratio Analysis: Labor Force Participation Rate | |||
---|---|---|---|

Numerator | Denominator | Ratio | Std Err |

LaborForce | Population | 0.485826 | 0.004186 |

In addition to the sample, the NCHS urban-rural classification code (Ingram and Franco, 2012) for each county in the sample and the total number of counties in the population that have each of the six levels of the NCHS classification are known. If the totals, means, and ratios of the variables of interest are homogeneous for counties that have the same NCHS urban-rural classification, but there is significant heterogeneity between counties whose classifications differ, then poststratifying by the NCHS urban-rural classification can potentially yield more efficient estimates.

The following SAS statements create the poststratum totals data set `Poststrata`

. This data set is to be used in the PSTOTAL= option of the SURVEYMEANS procedure’s POSTSTRATA statement. A poststratum total data set must contain all the poststratification variables that are listed in the POSTSTRATA statement, and it must have a variable named `_PSTOTAL_`

that contains the poststratum totals. In the `Poststrata`

data set, the variable `Code2006`

contains the poststratum identification code, and the variable `_PSTOTAL_`

contains the total number of counties in that poststratum in 2011.

data poststrata; input Code2006 _PSTOTAL_ ; datalines; 1 62 2 354 3 329 4 340 5 688 6 1336 ; run;

Figure 1 compares the distributions of `Code2006`

in the population and the weighted sample. Based on the weighted sample, counties that have values of 3 and 4 are overrepresented in the sample, and counties that have values of 5 and 6 are underrepresented in the sample. Poststratifying on `Code2006`

reweights the data such that the poststratified weighted sample distribution of `Code2006`

equals the population distribution.

To perform a poststratified analysis, you simply add a POSTSTRATA statement to the SURVEYMEANS procedure, as in the following statements. Specifically, you designate `Code2006`

as the poststratification variable, and you specify the SAS data set `Poststrata`

in the PSTOTAL= option. The OUT= option saves the poststratification weights to the SAS data set `Pswgt`

.

proc surveymeans data=unemployment mean sum; strata st_fips; weight SamplingWeight; var LaborForce Unemployed Population; ratio 'Unemployment Rate' Unemployed / LaborForce; ratio 'Labor Force Participation Rate' LaborForce / Population; poststrata code2006 / pstotal=poststrata out=pswgt; run;

Figure 2 shows the ratios of the poststratification weights to the original sampling weights for each category of `Code2006`

. Poststratification reduces the weights for counties that have `Code2006`

values of 3 and 4 and increases the weights for counties that have `Code2006`

values of 5 and 6.

Figure 3 shows that, as expected, the poststratified weighted sample has the same distribution as the population.

Output 2 displays the poststratified estimates and their standard errors. All the poststratified estimates of the population means and totals are smaller than the non-poststratified estimates, but the two poststratified ratio estimates are larger. For example, the poststratified estimates indicate that on average there are 100,215 individuals in a county and 48,755 individuals in the labor force, and 4,518 individuals are unemployed. On average, the unemployment rate is 9.3%, and the labor force participation rate is 48.65%. Without exception, the variances of the estimates are smaller for the poststratified analysis, indicating that the poststratified estimates are more efficient for this sample.

Output 2: Poststratified Analysis

The SURVEYMEANS Procedure

Data Summary | |
---|---|

Number of Strata | 48 |

Number of Poststrata | 6 |

Number of Observations | 96 |

Sum of Weights | 3108 |

Statistics | ||||
---|---|---|---|---|

Variable | Mean | Std Error of Mean | Sum | Std Dev |

LaborForce | 48755 | 4808.671480 | 151579056 | 14950160 |

Unemployed | 4517.976061 | 477.440072 | 14046388 | 1484361 |

Population | 100215 | 9964.992605 | 311568502 | 30981162 |

Ratio Analysis: Unemployment Rate | |||
---|---|---|---|

Numerator | Denominator | Ratio | Std Err |

Unemployed | LaborForce | 0.092667 | 0.002727 |

Ratio Analysis: Labor Force Participation Rate | |||
---|---|---|---|

Numerator | Denominator | Ratio | Std Err |

LaborForce | Population | 0.486503 | 0.003853 |

Suppose you want to compare the mortality rates of Florida and California. If you have samples from the two populations, computing the crude mortality rate for each population is straightforward. However, because many health outcomes vary by age and the two populations have different age distributions, a direct comparison of the crude mortality rates might be inappropriate. To make a relative comparison, you can use age-adjusted mortality rates. A common method of computing age-adjusted rates is called *direct standardization*; it is mathematically equivalent to poststratification.

The following SAS statements create the data sets `Florida`

and `California`

, which contain samples from a one-stage clustered sampling design that has a sampling rate of 0.5; the clusters consist of counties from the respective states, and the observations are age-specific groups. Each observation records the variable `FIPS`

, which identifies the clusters (counties); the categorical variable `Age`

, which identifies the age group; the variable `Population`

, which records the total number of individuals in an age-specific group in 1968; the variable `Deaths`

, which records the total number of recorded deaths in an age-specific group in 1968; and the variable `SamplingWeights`

, which is the inverse of the probability of selecting a county in the sample. The data are sampled from the Compressed Mortality File (CMF), which is publicly available from the Centers for Disease Control and Prevention website, at http://www.cdc.gov/nchs/data_access/cmf.htm#data_availability.

data Florida; input FIPS Age Population Deaths; SamplingWeight=1.9705882353; datalines; 12011 4 7730 177 12011 5 32956 44 12011 6 49587 22 12011 7 49407 23 12011 8 40175 46 12011 9 29425 52 ... more lines ... 12133 11 1048 5 12133 12 1149 13 12133 13 1252 20 12133 14 896 33 12133 15 425 33 12133 16 92 27 ;

data California; input FIPS Age Population Deaths; SamplingWeight=2; datalines; 6001 4 17412 348 6001 5 72709 58 6001 6 101367 41 6001 7 95572 33 6001 8 89730 87 6001 9 107173 124 ... more lines ... 6115 11 5421 11 6115 12 3720 34 6115 13 2766 58 6115 14 1752 77 6115 15 796 74 6115 16 180 39 ;

Table 3 describes the different levels of the categorical variable `Age`

.

Table 3: Age Categories

Age Category |
Description |
---|---|

4 |
Less than 1 year |

5 |
1–4 years |

6 |
5–9 years |

7 |
10–14 years |

8 |
15–19 years |

9 |
20–24 years |

10 |
25–34 years |

11 |
35–44 years |

12 |
45–54 years |

13 |
55–64 years |

14 |
65–74 years |

15 |
75–84 years |

16 |
85+ years |

The following SAS statements use the SURVEYMEANS procedure to estimate the crude mortality rates for Florida and California. The RATE= option in the PROC SURVEYMEANS statement identifies the sampling rate. The SURVEYMEANS procedure uses the sampling rate to compute a finite population correction for the Taylor series variance estimates. The RATIO and SUM keywords in the PROC SURVEYMEANS statement request estimates of the population ratios and totals, respectively. The VAR statement requests estimates of the variables `Deaths`

and `Population`

. The CLUSTER statement specifies that the variable `FIPS`

identify the primary sampling units. The WEIGHT statement specifies that the variable `SamplingWeight`

contain the sampling weights. The RATIO statement identifies the ratio of interest to be the number of deaths divided by the population size.

proc surveymeans data=Florida ratio sum rate=.5; cluster fips; weight SamplingWeight; var deaths population; ratio 'Florida Crude Mortality Rate' deaths/population; run;

proc surveymeans data=California ratio sum rate=.5; cluster fips; weight SamplingWeight; var deaths population; ratio 'California Crude Mortality Rate' deaths/population; run;

Output 3 and Output 4 show the estimation results.

Output 3: Crude Mortality Rate for Florida

The SURVEYMEANS Procedure

Data Summary | |
---|---|

Number of Clusters | 34 |

Number of Observations | 442 |

Sum of Weights | 871 |

Ratio Analysis: Florida Crude Mortality Rate | |||
---|---|---|---|

Numerator | Denominator | Ratio | Std Err |

Deaths | Population | 0.010774 | 0.000464 |

Output 4: Crude Mortality Rate for California

The SURVEYMEANS Procedure

Data Summary | |
---|---|

Number of Clusters | 29 |

Number of Observations | 377 |

Sum of Weights | 754 |

Ratio Analysis: California Crude Mortality Rate | |||
---|---|---|---|

Numerator | Denominator | Ratio | Std Err |

Deaths | Population | 0.007702 | 0.000595 |

The estimated crude mortality rates for Florida and California are 1.08% and 0.77%, respectively. The ratio of the crude mortality rates is 1.40. However, before you conclude that the mortality rate is higher in Florida than in California, consider the following two exhibits. Figure 4 shows that the age-specific mortality rates are decidedly a function of age in both states.

Figure 5 shows that the populations in Florida and California exhibit different age distributions. The percentage of residents in the age groups 13, 14, and 15 is higher in Florida than in California, whereas the percentage of residents in the age groups 5, 6, 7, 8, 9, 10, and 11 is lower in Florida than in California. Together these facts indicate that the crude mortality rates are not an appropriate measure for comparing differences between these two populations (Curtin and Klein, 1995).

**Note**: The SAS statements that generate Figure 4 and Figure 5 are not shown here but are included in the downloadable SAS program that is available with this web example.

Because the crude rate is not appropriate, and because age-specific mortality rates provide too much detail and require a large number of comparisons, you can use a summary measure that controls for a population’s age distribution. A commonly used measure is the age-adjusted mortality rate, which you can compute by performing direct standardization (Curtin and Klein, 1995).

As mentioned earlier, direct standardization is mathematically equivalent to poststratification. The difference between poststratification for the purpose of performing direct standardization and other forms of poststratification is this: when you perform direct standardization, the poststratum totals or proportions represent a standard or reference population rather than the population from which your sample was drawn.

To compute comparable age-adjusted rates for Florida and California by using poststratification, you need a data set that contains the age distribution proportions from a standard or reference population. The following SAS statements create the data set `USbyAge`

, which contains the age-specific proportions for the US population in 1968:

data USbyAge; input Age _PSPCT_; datalines; 4 0.01755 5 0.07291 6 0.10231 7 0.10202 8 0.09116 9 0.07545 10 0.11879 11 0.11822 12 0.11391 13 0.09065 14 0.06103 15 0.02980 16 0.00621 ;

You can then use PROC SUVEYMEANS to compute age-adjusted mortality rates for Florida and California. The procedure specification in the following SAS statements is the same as when you compute the crude rates, except that you add a POSTSTRATA statement, which specifies poststratification on the variable `Age`

, and the PSPCT= option, which specifies that the population proportions be contained in the data set `USbyAge`

.

proc surveymeans data=Florida ratio rate=.5; cluster fips; weight SamplingWeight; var deaths population; poststrata age / pspct=USbyAge; ratio 'Florida Standardized Mortality Rate' deaths/population; run;

proc surveymeans data=California ratio rate=.5; cluster fips; weight SamplingWeight; var deaths population; poststrata age / pspct=USbyAge; ratio 'California Standardized Mortality Rate' deaths/population; run;

Output 5 and Output 6 show the estimation results. The age-adjusted mortality rates for Florida and California are 0.70% and 0.48%, respectively. The ratio of the age-adjusted mortality rates is 1.45. Therefore, on an age-adjusted basis, the mortality rate in Florida in 1968 is almost 1.5 times the mortality rate in California in the same year.

Output 5: Standardized Mortality Rate for Florida

The SURVEYMEANS Procedure

Data Summary | |
---|---|

Number of Clusters | 34 |

Number of Poststrata | 13 |

Number of Observations | 442 |

Sum of Weights | 871 |

Ratio Analysis: Florida Standardized Mortality Rate |
|||
---|---|---|---|

Numerator | Denominator | Ratio | Std Err |

Deaths | Population | 0.006952 | 0.000248 |

Output 6: Standardized Mortality Rate for California

The SURVEYMEANS Procedure

Data Summary | |
---|---|

Number of Clusters | 29 |

Number of Poststrata | 13 |

Number of Observations | 377 |

Sum of Weights | 754 |

Ratio Analysis: California Standardized Mortality Rate |
|||
---|---|---|---|

Numerator | Denominator | Ratio | Std Err |

Deaths | Population | 0.004791 | 0.000385 |

Curtin, L. R. and Klein, R. J. (1995), “Direct Standardization (Age-Adjusted Death Rates),” Healthy People 2000: Statistical Notes, DHHS Publication No. (PHS) 95-1237.

Ingram, D. D. and Franco, S. J. (2012), “NCHS Urban-Rural Classification Scheme for Counties,” Vital and Health Statistics, Series 2: Data Evaluation and Methods Research no. 154, DHHS publication no. (PHS) 2012-1354.

Lehtonen, R. and Pahkinen, E. (2004),

*Practical Methods for Design and Analysis of Complex Surveys*, 2nd Edition, Chichester, UK: John Wiley & Sons.Lohr, S. L. (2010),

*Sampling: Design and Analysis*, 2nd Edition, Boston: Brooks/Cole.Särndal, C. E., Swensson, B., and Wretman, J. (1992),

*Model Assisted Survey Sampling*, New York: Springer-Verlag.