The SURVEYIMPUTE Procedure

Example 110.3 Fully Efficient Fractional Imputation, Fay’s Balanced Repeated Replication, and Domain Analysis

This example demonstrates the FEFI method by using data from the third National Health and Nutrition Examination Survey (NHANES III). The data set contains a set of BRR replicate weights. The REPWEIGHTS statement in PROC SURVEYIMPUTE is used to create imputation-adjusted replicate weights. The imputed data set and the imputation-adjusted replicate weights are then used in PROC SURVEYFREQ to create crosstabulation tables and to perform domain analysis.

The objective of NHANES is to study the health and nutritional status of the US population. NHANES uses a multistage stratified area sample with typically two PSUs per stratum. Strata are created based on geographic location, Metropolitan Statistical Areas (MSAs), and other demographics. An MSA or a group of counties are selected as PSUs from each stratum. Sampling weights are unequal because of different selection probabilities among different subgroups and for reasons such as nonresponse and undercoverage. For more information about NHANES, see http://www.cdc.gov/nchs/nhanes/about_nhanes.htm.

NHANES III data contain missing values in many items. Multiple imputation was used to impute some of the missing items. Five multiply imputed data sets are available for public use. Because FEFI will be used in this example to impute the missing values, you need the observed data, the missing (or imputation) flag for every item, and only one imputed data set. The data sets Core and IMP1 have been downloaded from http://www.cdc.gov/nchs/nhanes/nh3data.htm#7a. The Core data set contains the demographic variables, full sample weights, replicate weights, and imputation flags. The replicate weights are created by using Fay’s BRR method with a Fay coefficient of 0.3. The IMP1 data set contains the first version of the five multiply imputed data sets.

For this example, a new data set, Smoke, is created by merging the Core and IMP1 data sets by the observation sequence number, SEQN. The Smoke data set contains the following items:

  • SEQN: observation sequence number

  • WTPFQX6: observation weight

  • WTPQRP1 to WTPQRP52: 52 replicate weights from the BRR method

  • DMARETHN: race-ethnicity; 1 for white, 2 for black, 3 for Mexican American, and 4 for other

  • HSSEX: gender; 1 for male and 2 for female

  • HFF1IF: imputation flag for HFF1MI; 1 for observed and 2 for imputed

  • HAN6SRIF: imputation flag for HAN6SRMI; 0 for not applicable, 1 for observed, and 2 for imputed

  • HAR3RIF: imputation flag for HAR3RMI; 0 for not applicable, 1 for observed, and 2 for imputed

  • HAT28IF: imputation flag for HAT28MI; 0 for not applicable, 1 for observed, and 2 for imputed

  • HFF1MI: anyone smokes cigarettes in the home; 1 for yes, and 2 for no

  • HAN6SRMI: beer, wine, or liquor per month; –9 for not applicable, 1 for 0 time in the past month, 2 for 1 to 10 times in the past month, and 3 for more than 10 times in the past month

  • HAR3RMI: smoke cigarettes now; –9 for not applicable, 1 for yes, and 2 for no

  • HAT28MI: activity level compared to others; –9 for not applicable, 1 for more active, 2 for less active, and 3 for about the same

  • Education: highest education attained; levels are elementary, high school, college, and unknown

For donor-based imputation methods, auxiliary information is used to create imputation cells. Imputation cells divide the data into groups of similar units such that the recipient units share similar characteristics with the donor units in the same group. Characteristics for imputation cells might come from the same survey or from other auxiliary sources such as census data or previous surveys. The cell identification is known for every unit in the sample. Categorical levels of auxiliary variables are often used to create imputation cells. For a helpful review, see Brick and Kalton (1996). For the purpose of this example, seven imputation cells were created by using only two demographic variables: race-ethnicity status (DMARETHN) and gender (HSSEX). Both variables are available in the Core data set, and both have no missing values. The imputation cells are identified by the variable ImputationCells in the Smoke data set.

The following DATA step creates the imputation cells and the variable Education, and replaces the multiply imputed values with missing values:

/*--Create education levels, imputation cells and 
    assign . to missing items --*/
data Smoke; set Smoke;
   if HFA7 <=8                 then Education='Elementary ';
   if HFA7 > 8  and HFA7 <= 12 then Education='High School';
   if HFA7 > 12 and HFA7 <= 17 then Education='College    ';
   if HFA7 > 17                then Education='Unknown    ';
   if DMARETHN = 1 & HSSEX = 1 then ImputationCells=1;
   if DMARETHN = 1 & HSSEX = 2 then ImputationCells=2;
   if DMARETHN = 2 & HSSEX = 1 then ImputationCells=3;
   if DMARETHN = 2 & HSSEX = 2 then ImputationCells=4;
   if DMARETHN = 3 & HSSEX = 1 then ImputationCells=5;
   if DMARETHN = 3 & HSSEX = 2 then ImputationCells=6;
   if DMARETHN = 4             then ImputationCells=7;
   if HFF1IF   = 2 then HFF1MI   = .;
   if HAN6SRIF = 2 then HAN6SRMI = .;
   if HAR3RIF  = 2 then HAR3RMI  = .;
   if HAT28IF  = 2 then HAT28MI  = .;
run;   

The following statements request that the missing values be imputed by using the FEFI method:

proc surveyimpute data=Smoke method=FEFI varmethod=BRR;
   weight wtpfqx6;
   repweights wtpqrp:;
   id seqn;
   class hff1mi han6srmi har3rmi hat28mi;
   var   hff1mi han6srmi har3rmi hat28mi;
   cells ImputationCells;
   output out=SmokeImputed;
run;

The PROC SURVEYIMPUTE statement invokes the procedure, the DATA= option specifies the input data set Smoke, the METHOD= option requests the FEFI method, and the VARMETHOD= option requests the imputation-adjusted BRR replication weights. The WEIGHT statement specifies the weight variable, and the REPWEIGHTS statement specifies the unadjusted BRR replicate weights. Because you specify replicate weights by using the REPWEIGHTS statement, you do not need to specify the Fay coefficient in PROC SURVEYIMPUTE. The variable SEQN in the ID statement identifies the observation units. The VAR statement specifies the variables to be imputed, the CELLS statement identifies the imputation cell variable ImputationCells, and the OUT= option in the OUTPUT statement names the output data set SmokeImputed. You request that all four variables be imputed jointly and that the imputed data be saved in the SmokeImputed data set.

Note that this example creates imputation-adjusted BRR replicate weights from the unadjusted BRR replicate weights that are available for these data. If the unadjusted BRR replicate weights are not available to you, then PROC SURVEYIMPUTE first creates the unadjusted BRR replicate weights and then updates the unadjusted weights for imputation to create the imputation-adjusted BRR replicate weights. For more information, see the section Balanced Repeated Replication (BRR) Method.

Summary information about the number of observations and class level information are shown in Output 110.3.1. The "Number of Observations" table displays the number of observations (33,994) that are read and used. The weighted number of observations that are read shows that the 33,994 observation units in the sample represent over 251,000,000 observation units in the population. The "Class Level Information" table displays the observed levels for the analysis variables. The "Missing Data Patterns" table shows an arbitrary missing pattern. There are 12 different missing pattern groups. An "X" denotes that the variable is observed in that group, and a "." denotes that the variable is missing. Almost 94% of the observation units have no missing values (Group 1), 4.5% of the observation units have missing values for the variable HAN6SRMI (Group 4), and 1% of the observation units have missing values for the variable HAT28MI (Group 2).

Output 110.3.1: Imputation Information

The SURVEYIMPUTE Procedure

Number of Observations Read 33994
Number of Observations Used 33994
Sum of Weights Read 2.511E8
Sum of Weights Used 2.511E8

Class Level Information
Class Levels Values
HFF1MI 2 1 2
HAN6SRMI 4 -9 1 2 3
HAR3RMI 3 -9 1 2
HAT28MI 4 -9 1 2 3



Output 110.3.2: Missing Data Patterns

Missing Data Patterns
Group HFF1MI HAN6SRMI HAR3RMI HAT28MI Freq Sum of
Weights
Unweighted
Percent
Weighted
Percent
Group Means
HFF1MI 1 HFF1MI 2 HAN6SRMI -9 HAN6SRMI 1 HAN6SRMI 2 HAN6SRMI 3 HAR3RMI -9 HAR3RMI 1 HAR3RMI 2 HAT28MI -9 HAT28MI 1 HAT28MI 2 HAT28MI 3
1 X X X X 31916 2.2918E8 93.89 91.27 0.370883 0.629117 0.275976 0.365368 0.219609 0.139048 0.275976 0.199395 0.524629 0.275976 0.239253 0.160019 0.324752
2 X X X . 383 3584033 1.13 1.43 0.312278 0.687722 0 0.533112 0.322946 0.143942 0 0.231400 0.768600 . . . .
3 X X . X 2 6696.18 0.01 0.00 0 1.000000 0 0 0.695886 0.304114 . . . 0 0.695886 0 0.304114
4 X . X X 1536 17201676 4.52 6.85 0.417227 0.582773 . . . . 0 0.362741 0.637259 0 0.339890 0.222725 0.437384
5 X . X . 25 255366.8 0.07 0.10 0.512251 0.487749 . . . . 0 0.371613 0.628387 . . . .
6 X . . . 1 1137.27 0.00 0.00 0 1.000000 . . . . . . . . . . .
7 . X X X 106 723162 0.31 0.29 . . 0.277402 0.570801 0.099198 0.052599 0.277402 0.156784 0.565813 0.277402 0.185712 0.192402 0.344483
8 . X X . 5 21175.81 0.01 0.01 . . 0 0.414365 0.585635 0 0 0.078044 0.921956 . . . .
9 . X . . 2 43867.66 0.01 0.02 . . 0 0.262239 0.737761 0 . . . . . . .
10 . . X X 4 11751.6 0.01 0.00 . . . . . . 0 0 1.000000 0 0.664584 0 0.335416
11 . . X . 1 4483.45 0.00 0.00 . . . . . . 0 0 1.000000 . . . .
12 . . . . 13 59745.61 0.04 0.02 . . . . . . . . . . . . .



The "Iteration History" table shown in Output 110.3.3 displays the maximum absolute and relative differences of the fractional weights for the EM algorithm for the full sample. The algorithm converged after four iterations. The "Imputation Summary" table shown in Output 110.3.4 displays the number of observed units (31,961), the number of missing units (2,078), and the number of imputed units. All units that have missing values have been imputed.

Output 110.3.3: Iteration History for the EM

Iteration History
Iteration
Number
Maximum
Absolute
Difference
Maximum
Relative
Difference
1 830.4733 0.18278
2 93.33655 0.00904
3 16.14668 0.00237
4 4.138731 0.00061



Output 110.3.4: Imputation Summary

Imputation Summary
Observation Status Number of
Observations
Sum of
Weights
Nonmissing 31916 2.2918E8
Missing 2078 21913095
Missing, Imputed 2078 21913095
Missing, Not Imputed 0 0
Missing, Partially Imputed 0 0



The imputed data set SmokeImputed contains the imputation-adjusted weight (ImpWt) and 52 imputation-adjusted replicate weights (ImpRepWt_1 to ImpRepWt_52). The SmokeImputed data set has 38,701 data lines. The number of imputed values for an observation unit ranges from two to six, but around 80% of the units are imputed by using two or three imputed values.

You can use the imputed data set, the imputation-adjusted replicate weights, and the appropriate Fay coefficient to compute any estimators from your imputed data. You should use the REPWEIGHTS statement in SAS/STAT survey analysis procedures to specify the imputation-adjusted replicate weights. This example uses PROC SURVEYFREQ to perform the following analyses:

  • estimate the percentage of smokers and nonsmokers in the population

  • describe the smoking habits of an individual and of anyone who smokes in the home

  • perform a domain analysis of activity levels for different levels of education

The PROC SURVEYFREQ statement invokes the procedure, the DATA= option names the imputed data set SmokeImputed, and the VARMETHOD=option requests the BRR variance estimation. The FAY= option for VARMETHOD=BRR specifies the Fay coefficient 0.3. Because your replicate weights come from Fay’s BRR method, you must specify the FAY= option in the SAS/STAT survey analysis procedures to appropriately estimate the variance. The VARHEADER=LABEL option in the PROC SURVEYFREQ statement requests that the labels of the variables be displayed in the output. The WEIGHT statement specifies the imputation-adjusted full sample weights, and the REPWEIGHTS statement specifies the imputation-adjusted replicate weights. Note that the imputation-adjusted full sample and replicate weights are created by PROC SURVEYIMPUTE, and they are different from the unadjusted weights available in the Smoke data set. The first TABLE statement requests a two-way frequency analysis for HFF1MI and HAR3RMI. The second TABLE statement requests a domain analysis for HAT28MI, where the variable Education is used as the domain variable. The ROW option in the TABLE statement is required in order to compute the distribution of HAT28MI for different levels of Education. The NOTOTAL, NOFREQ, and NOWT options suppress some output columns.

proc surveyfreq data=SmokeImputed varmethod=brr(fay=0.3) varheader=label;
   weight ImpWt;
   repweights ImpRepWt_:;
   table HFF1MI*HAR3RMI;
   table Education*HAT28MI / row nototal nofreq nowt;
run;

The data summary and the variance estimation information are displayed in Output 110.3.5. There are 38,701 data lines in the SmokeImputed data set. These 38,701 data lines represent the 33,994 observation units in the Smoke data set. The observation units are identified by the variable SEQN. The sum of weights is over 251,000,000, which is the same as the sum of weights in the Smoke data set. The sum of weights is an estimate of the population size. The "Variance Estimation" table shows that 52 replicate weights from Fay’s BRR method are used for variance estimation with the Fay coefficient 0.3.

Output 110.3.5: Summary Information

The SURVEYFREQ Procedure

Data Summary
Number of Observations 38701
Sum of Weights 251097002

Variance Estimation
Method BRR
Replicate Weights SMOKEIMPUTED
Number of Replicates 52
Fay Coefficient 0.300



A two-way table for the smoking habit of the observation unit and smoking in the home is shown in Output 110.3.6. There are 21% smokers and 54% nonsmokers in the population. Nearly 19% of the individuals are smokers and live in a home where at least one person smokes in the home, but only 2% of the individuals are smokers and live in a home where no other household member smokes in the home. However, almost 9% of the individuals are nonsmokers but live in a home where at least one household member smokes in the home. The standard errors that are reported in the table properly account for the imputation.

Output 110.3.6: Two-Way Table for Smoking Status

Table of Anyone living here smoke cigs in home by Smoke cigarettes now (recode)
Anyone living here smoke cigs in home Smoke cigarettes now (recode) Frequency Weighted
Frequency
Std Err of
Wgt Freq
Percent Std Err of
Percent
1 -9 5431 24762830 717089 9.8619 0.2856
  1 5788 47166758 1361381 18.7843 0.5422
  2 3341 21790932 614088 8.6783 0.2446
  Total 14560 93720520 2180565 37.3244 0.8684
2 -9 8582 38702422 701855 15.4133 0.2795
  1 881 5837874 397358 2.3249 0.1582
  2 14678 112836186 1602093 44.9373 0.6380
  Total 24141 157376482 2180565 62.6756 0.8684
Total -9 14013 63465252 260068 25.2752 0.1036
  1 6669 53004633 1276926 21.1092 0.5085
  2 18019 134627118 1328833 53.6156 0.5292
  Total 38701 251097002 2.25270 100.000  



Suppose you want to perform a domain analysis by using the imputed data. If a list of domain variables is available before the imputation, then sometimes it is desirable to use the domain variables to create the imputation cells. However, requests for domain analyses often come after the imputation. In addition, data users might use domain variables that are different from what are used to create the imputation cells. In this example, the domain variable Education was not used to create the imputation cells. Although education level is not used in the imputation, it is reasonable to use the imputed data to perform domain analysis for every level of education. Domain analysis for activity levels for different education levels is shown in Output 110.3.7. If the highest education level is college, then 38% are reported as more active and 21% are reported as less active than their peers. If the highest education level is high school, then 28% are reported as more active and 20% are reported as less active than their peers. The standard errors that are reported in the table properly account for the imputation.

Output 110.3.7: Domain Analysis for Activity Levels by Education

Table of Education by Compare own activity level to others
Education Compare own activity level to others Percent Std Err of
Percent
Row
Percent
Std Err of
Row Percent
College -9 0.0017 0.0016 0.0055 0.0052
  1 11.9908 0.4117 38.3446 0.9792
  2 6.5292 0.3253 20.8795 0.8461
  3 12.7494 0.4641 40.7704 0.9978
Elementary -9 21.8961 0.1269 74.2986 0.9154
  1 1.8571 0.1193 6.3015 0.3596
  2 2.0051 0.1500 6.8039 0.4513
  3 3.7121 0.1970 12.5961 0.5420
High School -9 3.2191 0.1238 8.3653 0.3251
  1 10.6977 0.2990 27.7997 0.6634
  2 7.8486 0.2782 20.3959 0.6333
  3 16.7160 0.4958 43.4391 0.7062
Unknown -9 0.1583 0.0289 20.3674 3.2674
  1 0.1981 0.0463 25.4966 3.7217
  2 0.1502 0.0346 19.3311 3.3973
  3 0.2704 0.0426 34.8049 3.2211