This example demonstrates the FEFI method by using data from the third National Health and Nutrition Examination Survey (NHANES III). The data set contains a set of BRR replicate weights. The REPWEIGHTS statement in PROC SURVEYIMPUTE is used to create imputation-adjusted replicate weights. The imputed data set and the imputation-adjusted replicate weights are then used in PROC SURVEYFREQ to create crosstabulation tables and to perform domain analysis.
The objective of NHANES is to study the health and nutritional status of the US population. NHANES uses a multistage stratified area sample with typically two PSUs per stratum. Strata are created based on geographic location, Metropolitan Statistical Areas (MSAs), and other demographics. An MSA or a group of counties are selected as PSUs from each stratum. Sampling weights are unequal because of different selection probabilities among different subgroups and for reasons such as nonresponse and undercoverage. For more information about NHANES, see http://www.cdc.gov/nchs/nhanes/about_nhanes.htm.
NHANES III data contain missing values in many items. Multiple imputation was used to impute some of the missing items. Five
multiply imputed data sets are available for public use. Because FEFI will be used in this example to impute the missing values,
you need the observed data, the missing (or imputation) flag for every item, and only one imputed data set. The data sets
Core
and IMP1
have been downloaded from http://www.cdc.gov/nchs/nhanes/nh3data.htm#7a. The Core
data set contains the demographic variables, full sample weights, replicate weights, and imputation flags. The replicate
weights are created by using Fay’s BRR method with a Fay coefficient of 0.3. The IMP1
data set contains the first version of the five multiply imputed data sets.
For this example, a new data set, Smoke
, is created by merging the Core
and IMP1
data sets by the observation sequence number, SEQN
. The Smoke
data set contains the following items:
SEQN
: observation sequence number
WTPFQX6
: observation weight
WTPQRP1
to WTPQRP52
: 52 replicate weights from the BRR method
DMARETHN
: race-ethnicity; 1 for white, 2 for black, 3 for Mexican American, and 4 for other
HSSEX
: gender; 1 for male and 2 for female
HFF1IF
: imputation flag for HFF1MI
; 1 for observed and 2 for imputed
HAN6SRIF
: imputation flag for HAN6SRMI
; 0 for not applicable, 1 for observed, and 2 for imputed
HAR3RIF
: imputation flag for HAR3RMI
; 0 for not applicable, 1 for observed, and 2 for imputed
HAT28IF
: imputation flag for HAT28MI
; 0 for not applicable, 1 for observed, and 2 for imputed
HFF1MI
: anyone smokes cigarettes in the home; 1 for yes, and 2 for no
HAN6SRMI
: beer, wine, or liquor per month; –9 for not applicable, 1 for 0 time in the past month, 2 for 1 to 10 times in the past
month, and 3 for more than 10 times in the past month
HAR3RMI
: smoke cigarettes now; –9 for not applicable, 1 for yes, and 2 for no
HAT28MI
: activity level compared to others; –9 for not applicable, 1 for more active, 2 for less active, and 3 for about the same
Education
: highest education attained; levels are elementary, high school, college, and unknown
For donor-based imputation methods, auxiliary information is used to create imputation cells. Imputation cells divide the
data into groups of similar units such that the recipient units share similar characteristics with the donor units in the
same group. Characteristics for imputation cells might come from the same survey or from other auxiliary sources such as census
data or previous surveys. The cell identification is known for every unit in the sample. Categorical levels of auxiliary variables
are often used to create imputation cells. For a helpful review, see Brick and Kalton (1996). For the purpose of this example, seven imputation cells were created by using only two demographic variables: race-ethnicity
status (DMARETHN
) and gender (HSSEX
). Both variables are available in the Core
data set, and both have no missing values. The imputation cells are identified by the variable ImputationCells
in the Smoke
data set.
The following DATA step creates the imputation cells and the variable Education
, and replaces the multiply imputed values with missing values:
/*--Create education levels, imputation cells and assign . to missing items --*/ data Smoke; set Smoke; if HFA7 <=8 then Education='Elementary '; if HFA7 > 8 and HFA7 <= 12 then Education='High School'; if HFA7 > 12 and HFA7 <= 17 then Education='College '; if HFA7 > 17 then Education='Unknown '; if DMARETHN = 1 & HSSEX = 1 then ImputationCells=1; if DMARETHN = 1 & HSSEX = 2 then ImputationCells=2; if DMARETHN = 2 & HSSEX = 1 then ImputationCells=3; if DMARETHN = 2 & HSSEX = 2 then ImputationCells=4; if DMARETHN = 3 & HSSEX = 1 then ImputationCells=5; if DMARETHN = 3 & HSSEX = 2 then ImputationCells=6; if DMARETHN = 4 then ImputationCells=7; if HFF1IF = 2 then HFF1MI = .; if HAN6SRIF = 2 then HAN6SRMI = .; if HAR3RIF = 2 then HAR3RMI = .; if HAT28IF = 2 then HAT28MI = .; run;
The following statements request that the missing values be imputed by using the FEFI method:
proc surveyimpute data=Smoke method=FEFI varmethod=BRR; weight wtpfqx6; repweights wtpqrp:; id seqn; class hff1mi han6srmi har3rmi hat28mi; var hff1mi han6srmi har3rmi hat28mi; cells ImputationCells; output out=SmokeImputed; run;
The PROC SURVEYIMPUTE statement invokes the procedure, the DATA= option specifies the input data set Smoke
, the METHOD= option requests the FEFI method, and the VARMETHOD= option requests the imputation-adjusted BRR replication
weights. The WEIGHT statement specifies the weight variable, and the REPWEIGHTS statement specifies the unadjusted BRR replicate
weights. Because you specify replicate weights by using the REPWEIGHTS statement, you do not need to specify the Fay coefficient
in PROC SURVEYIMPUTE. The variable SEQN
in the ID statement identifies the observation units. The VAR statement specifies the variables to be imputed, the CELLS
statement identifies the imputation cell variable ImputationCells
, and the OUT= option in the OUTPUT statement names the output data set SmokeImputed
. You request that all four variables be imputed jointly and that the imputed data be saved in the SmokeImputed
data set.
Note that this example creates imputation-adjusted BRR replicate weights from the unadjusted BRR replicate weights that are available for these data. If the unadjusted BRR replicate weights are not available to you, then PROC SURVEYIMPUTE first creates the unadjusted BRR replicate weights and then updates the unadjusted weights for imputation to create the imputation-adjusted BRR replicate weights. For more information, see the section Balanced Repeated Replication (BRR) Method.
Summary information about the number of observations and class level information are shown in Output 110.3.1. The "Number of Observations" table displays the number of observations (33,994) that are read and used. The weighted number
of observations that are read shows that the 33,994 observation units in the sample represent over 251,000,000 observation
units in the population. The "Class Level Information" table displays the observed levels for the analysis variables. The
"Missing Data Patterns" table shows an arbitrary missing pattern. There are 12 different missing pattern groups. An "X" denotes
that the variable is observed in that group, and a "." denotes that the variable is missing. Almost 94% of the observation
units have no missing values (Group 1), 4.5% of the observation units have missing values for the variable HAN6SRMI
(Group 4), and 1% of the observation units have missing values for the variable HAT28MI
(Group 2).
Output 110.3.1: Imputation Information
Output 110.3.2: Missing Data Patterns
Missing Data Patterns | |||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Group | HFF1MI | HAN6SRMI | HAR3RMI | HAT28MI | Freq | Sum of Weights |
Unweighted Percent |
Weighted Percent |
Group Means | ||||||||||||
HFF1MI 1 | HFF1MI 2 | HAN6SRMI -9 | HAN6SRMI 1 | HAN6SRMI 2 | HAN6SRMI 3 | HAR3RMI -9 | HAR3RMI 1 | HAR3RMI 2 | HAT28MI -9 | HAT28MI 1 | HAT28MI 2 | HAT28MI 3 | |||||||||
1 | X | X | X | X | 31916 | 2.2918E8 | 93.89 | 91.27 | 0.370883 | 0.629117 | 0.275976 | 0.365368 | 0.219609 | 0.139048 | 0.275976 | 0.199395 | 0.524629 | 0.275976 | 0.239253 | 0.160019 | 0.324752 |
2 | X | X | X | . | 383 | 3584033 | 1.13 | 1.43 | 0.312278 | 0.687722 | 0 | 0.533112 | 0.322946 | 0.143942 | 0 | 0.231400 | 0.768600 | . | . | . | . |
3 | X | X | . | X | 2 | 6696.18 | 0.01 | 0.00 | 0 | 1.000000 | 0 | 0 | 0.695886 | 0.304114 | . | . | . | 0 | 0.695886 | 0 | 0.304114 |
4 | X | . | X | X | 1536 | 17201676 | 4.52 | 6.85 | 0.417227 | 0.582773 | . | . | . | . | 0 | 0.362741 | 0.637259 | 0 | 0.339890 | 0.222725 | 0.437384 |
5 | X | . | X | . | 25 | 255366.8 | 0.07 | 0.10 | 0.512251 | 0.487749 | . | . | . | . | 0 | 0.371613 | 0.628387 | . | . | . | . |
6 | X | . | . | . | 1 | 1137.27 | 0.00 | 0.00 | 0 | 1.000000 | . | . | . | . | . | . | . | . | . | . | . |
7 | . | X | X | X | 106 | 723162 | 0.31 | 0.29 | . | . | 0.277402 | 0.570801 | 0.099198 | 0.052599 | 0.277402 | 0.156784 | 0.565813 | 0.277402 | 0.185712 | 0.192402 | 0.344483 |
8 | . | X | X | . | 5 | 21175.81 | 0.01 | 0.01 | . | . | 0 | 0.414365 | 0.585635 | 0 | 0 | 0.078044 | 0.921956 | . | . | . | . |
9 | . | X | . | . | 2 | 43867.66 | 0.01 | 0.02 | . | . | 0 | 0.262239 | 0.737761 | 0 | . | . | . | . | . | . | . |
10 | . | . | X | X | 4 | 11751.6 | 0.01 | 0.00 | . | . | . | . | . | . | 0 | 0 | 1.000000 | 0 | 0.664584 | 0 | 0.335416 |
11 | . | . | X | . | 1 | 4483.45 | 0.00 | 0.00 | . | . | . | . | . | . | 0 | 0 | 1.000000 | . | . | . | . |
12 | . | . | . | . | 13 | 59745.61 | 0.04 | 0.02 | . | . | . | . | . | . | . | . | . | . | . | . | . |
The "Iteration History" table shown in Output 110.3.3 displays the maximum absolute and relative differences of the fractional weights for the EM algorithm for the full sample. The algorithm converged after four iterations. The "Imputation Summary" table shown in Output 110.3.4 displays the number of observed units (31,961), the number of missing units (2,078), and the number of imputed units. All units that have missing values have been imputed.
Output 110.3.3: Iteration History for the EM
Output 110.3.4: Imputation Summary
The imputed data set SmokeImputed
contains the imputation-adjusted weight (ImpWt
) and 52 imputation-adjusted replicate weights (ImpRepWt_1
to ImpRepWt_52
). The SmokeImputed
data set has 38,701 data lines. The number of imputed values for an observation unit ranges from two to six, but around 80%
of the units are imputed by using two or three imputed values.
You can use the imputed data set, the imputation-adjusted replicate weights, and the appropriate Fay coefficient to compute any estimators from your imputed data. You should use the REPWEIGHTS statement in SAS/STAT survey analysis procedures to specify the imputation-adjusted replicate weights. This example uses PROC SURVEYFREQ to perform the following analyses:
estimate the percentage of smokers and nonsmokers in the population
describe the smoking habits of an individual and of anyone who smokes in the home
perform a domain analysis of activity levels for different levels of education
The PROC SURVEYFREQ statement invokes the procedure, the DATA= option names the imputed data set SmokeImputed
, and the VARMETHOD=option requests the BRR variance estimation. The FAY= option for VARMETHOD=BRR specifies the Fay coefficient
0.3. Because your replicate weights come from Fay’s BRR method, you must specify the FAY= option in the SAS/STAT survey analysis
procedures to appropriately estimate the variance. The VARHEADER=LABEL option in the PROC SURVEYFREQ statement requests that
the labels of the variables be displayed in the output. The WEIGHT statement specifies the imputation-adjusted full sample
weights, and the REPWEIGHTS statement specifies the imputation-adjusted replicate weights. Note that the imputation-adjusted
full sample and replicate weights are created by PROC SURVEYIMPUTE, and they are different from the unadjusted weights available
in the Smoke
data set. The first TABLE statement requests a two-way frequency analysis for HFF1MI
and HAR3RMI
. The second TABLE statement requests a domain analysis for HAT28MI
, where the variable Education
is used as the domain variable. The ROW option in the TABLE statement is required in order to compute the distribution of
HAT28MI
for different levels of Education
. The NOTOTAL, NOFREQ, and NOWT options suppress some output columns.
proc surveyfreq data=SmokeImputed varmethod=brr(fay=0.3) varheader=label; weight ImpWt; repweights ImpRepWt_:; table HFF1MI*HAR3RMI; table Education*HAT28MI / row nototal nofreq nowt; run;
The data summary and the variance estimation information are displayed in Output 110.3.5. There are 38,701 data lines in the SmokeImputed
data set. These 38,701 data lines represent the 33,994 observation units in the Smoke
data set. The observation units are identified by the variable SEQN
. The sum of weights is over 251,000,000, which is the same as the sum of weights in the Smoke
data set. The sum of weights is an estimate of the population size. The "Variance Estimation" table shows that 52 replicate
weights from Fay’s BRR method are used for variance estimation with the Fay coefficient 0.3.
Output 110.3.5: Summary Information
A two-way table for the smoking habit of the observation unit and smoking in the home is shown in Output 110.3.6. There are 21% smokers and 54% nonsmokers in the population. Nearly 19% of the individuals are smokers and live in a home where at least one person smokes in the home, but only 2% of the individuals are smokers and live in a home where no other household member smokes in the home. However, almost 9% of the individuals are nonsmokers but live in a home where at least one household member smokes in the home. The standard errors that are reported in the table properly account for the imputation.
Output 110.3.6: Two-Way Table for Smoking Status
Table of Anyone living here smoke cigs in home by Smoke cigarettes now (recode) | ||||||
---|---|---|---|---|---|---|
Anyone living here smoke cigs in home | Smoke cigarettes now (recode) | Frequency | Weighted Frequency |
Std Err of Wgt Freq |
Percent | Std Err of Percent |
1 | -9 | 5431 | 24762830 | 717089 | 9.8619 | 0.2856 |
1 | 5788 | 47166758 | 1361381 | 18.7843 | 0.5422 | |
2 | 3341 | 21790932 | 614088 | 8.6783 | 0.2446 | |
Total | 14560 | 93720520 | 2180565 | 37.3244 | 0.8684 | |
2 | -9 | 8582 | 38702422 | 701855 | 15.4133 | 0.2795 |
1 | 881 | 5837874 | 397358 | 2.3249 | 0.1582 | |
2 | 14678 | 112836186 | 1602093 | 44.9373 | 0.6380 | |
Total | 24141 | 157376482 | 2180565 | 62.6756 | 0.8684 | |
Total | -9 | 14013 | 63465252 | 260068 | 25.2752 | 0.1036 |
1 | 6669 | 53004633 | 1276926 | 21.1092 | 0.5085 | |
2 | 18019 | 134627118 | 1328833 | 53.6156 | 0.5292 | |
Total | 38701 | 251097002 | 2.25270 | 100.000 |
Suppose you want to perform a domain analysis by using the imputed data. If a list of domain variables is available before
the imputation, then sometimes it is desirable to use the domain variables to create the imputation cells. However, requests
for domain analyses often come after the imputation. In addition, data users might use domain variables that are different
from what are used to create the imputation cells. In this example, the domain variable Education
was not used to create the imputation cells. Although education level is not used in the imputation, it is reasonable to
use the imputed data to perform domain analysis for every level of education. Domain analysis for activity levels for different
education levels is shown in Output 110.3.7. If the highest education level is college, then 38% are reported as more active and 21% are reported as less active than
their peers. If the highest education level is high school, then 28% are reported as more active and 20% are reported as less
active than their peers. The standard errors that are reported in the table properly account for the imputation.
Output 110.3.7: Domain Analysis for Activity Levels by Education
Table of Education by Compare own activity level to others | |||||
---|---|---|---|---|---|
Education | Compare own activity level to others | Percent | Std Err of Percent |
Row Percent |
Std Err of Row Percent |
College | -9 | 0.0017 | 0.0016 | 0.0055 | 0.0052 |
1 | 11.9908 | 0.4117 | 38.3446 | 0.9792 | |
2 | 6.5292 | 0.3253 | 20.8795 | 0.8461 | |
3 | 12.7494 | 0.4641 | 40.7704 | 0.9978 | |
Elementary | -9 | 21.8961 | 0.1269 | 74.2986 | 0.9154 |
1 | 1.8571 | 0.1193 | 6.3015 | 0.3596 | |
2 | 2.0051 | 0.1500 | 6.8039 | 0.4513 | |
3 | 3.7121 | 0.1970 | 12.5961 | 0.5420 | |
High School | -9 | 3.2191 | 0.1238 | 8.3653 | 0.3251 |
1 | 10.6977 | 0.2990 | 27.7997 | 0.6634 | |
2 | 7.8486 | 0.2782 | 20.3959 | 0.6333 | |
3 | 16.7160 | 0.4958 | 43.4391 | 0.7062 | |
Unknown | -9 | 0.1583 | 0.0289 | 20.3674 | 3.2674 |
1 | 0.1981 | 0.0463 | 25.4966 | 3.7217 | |
2 | 0.1502 | 0.0346 | 19.3311 | 3.3973 | |
3 | 0.2704 | 0.0426 | 34.8049 | 3.2211 |