The SURVEYIMPUTE Procedure

Example 110.3 Fully Efficient Fractional Imputation, Fay’s Balanced Repeated Replication, and Domain Analysis

This example demonstrates the FEFI method by using data from the third National Health and Nutrition Examination Survey (NHANES III). The data set contains a set of BRR replicate weights. The REPWEIGHTS statement in PROC SURVEYIMPUTE is used to create imputation-adjusted replicate weights. The imputed data set and the imputation-adjusted replicate weights are then used in PROC SURVEYFREQ to create crosstabulation tables and to perform domain analysis.

The objective of NHANES is to study the health and nutritional status of the US population. NHANES uses a multistage stratified area sample with typically two PSUs per stratum. Strata are created based on geographic location, Metropolitan Statistical Areas (MSAs), and other demographics. An MSA or a group of counties are selected as PSUs from each stratum. Sampling weights are unequal because of different selection probabilities among different subgroups and for reasons such as nonresponse and undercoverage. For more information about NHANES, see http://www.cdc.gov/nchs/nhanes/about_nhanes.htm.

NHANES III data contain missing values in many items. Multiple imputation was used to impute some of the missing items. Five multiply imputed data sets are available for public use. Because FEFI will be used in this example to impute the missing values, you need the observed data, the missing (or imputation) flag for every item, and only one imputed data set. The data sets Core and IMP1 have been downloaded from http://www.cdc.gov/nchs/nhanes/nh3data.htm#7a. The Core data set contains the demographic variables, full sample weights, replicate weights, and imputation flags. The replicate weights are created by using Fay’s BRR method with a Fay coefficient of 0.3. The IMP1 data set contains the first version of the five multiply imputed data sets.

For this example, a new data set, Smoke, is created by merging the Core and IMP1 data sets by the observation sequence number, SEQN. The Smoke data set contains the following items:

SEQN: observation sequence number
WTPFQX6: observation weight
WTPQRP1 to WTPQRP52: 52 replicate weights from the BRR method
DMARETHN: race-ethnicity; 1 for white, 2 for black, 3 for Mexican American, and 4 for other
HSSEX: gender; 1 for male and 2 for female
HFF1IF: imputation flag for HFF1MI; 1 for observed and 2 for imputed
HAN6SRIF: imputation flag for HAN6SRMI; 0 for not applicable, 1 for observed, and 2 for imputed
HAR3RIF: imputation flag for HAR3RMI; 0 for not applicable, 1 for observed, and 2 for imputed
HAT28IF: imputation flag for HAT28MI; 0 for not applicable, 1 for observed, and 2 for imputed
HFF1MI: anyone smokes cigarettes in the home; 1 for yes, and 2 for no
HAN6SRMI: beer, wine, or liquor per month; –9 for not applicable, 1 for 0 time in the past month, 2 for 1 to 10 times in the past month, and 3 for more than 10 times in the past month
HAR3RMI: smoke cigarettes now; –9 for not applicable, 1 for yes, and 2 for no
HAT28MI: activity level compared to others; –9 for not applicable, 1 for more active, 2 for less active, and 3 for about the same
Education: highest education attained; levels are elementary, high school, college, and unknown

For donor-based imputation methods, auxiliary information is used to create imputation cells. Imputation cells divide the data into groups of similar units such that the recipient units share similar characteristics with the donor units in the same group. Characteristics for imputation cells might come from the same survey or from other auxiliary sources such as census data or previous surveys. The cell identification is known for every unit in the sample. Categorical levels of auxiliary variables are often used to create imputation cells. For a helpful review, see Brick and Kalton (1996). For the purpose of this example, seven imputation cells were created by using only two demographic variables: race-ethnicity status (DMARETHN) and gender (HSSEX). Both variables are available in the Core data set, and both have no missing values. The imputation cells are identified by the variable ImputationCells in the Smoke data set.

The following DATA step creates the imputation cells and the variable Education, and replaces the multiply imputed values with missing values:

/*--Create education levels, imputation cells and 
    assign . to missing items --*/
data Smoke; set Smoke;
   if HFA7 <=8                 then Education='Elementary ';
   if HFA7 > 8  and HFA7 <= 12 then Education='High School';
   if HFA7 > 12 and HFA7 <= 17 then Education='College    ';
   if HFA7 > 17                then Education='Unknown    ';
   if DMARETHN = 1 & HSSEX = 1 then ImputationCells=1;
   if DMARETHN = 1 & HSSEX = 2 then ImputationCells=2;
   if DMARETHN = 2 & HSSEX = 1 then ImputationCells=3;
   if DMARETHN = 2 & HSSEX = 2 then ImputationCells=4;
   if DMARETHN = 3 & HSSEX = 1 then ImputationCells=5;
   if DMARETHN = 3 & HSSEX = 2 then ImputationCells=6;
   if DMARETHN = 4             then ImputationCells=7;
   if HFF1IF   = 2 then HFF1MI   = .;
   if HAN6SRIF = 2 then HAN6SRMI = .;
   if HAR3RIF  = 2 then HAR3RMI  = .;
   if HAT28IF  = 2 then HAT28MI  = .;
run;

The following statements request that the missing values be imputed by using the FEFI method:

proc surveyimpute data=Smoke method=FEFI varmethod=BRR;
   weight wtpfqx6;
   repweights wtpqrp:;
   id seqn;
   class hff1mi han6srmi har3rmi hat28mi;
   var   hff1mi han6srmi har3rmi hat28mi;
   cells ImputationCells;
   output out=SmokeImputed;
run;

The PROC SURVEYIMPUTE statement invokes the procedure, the DATA= option specifies the input data set Smoke, the METHOD= option requests the FEFI method, and the VARMETHOD= option requests the imputation-adjusted BRR replication weights. The WEIGHT statement specifies the weight variable, and the REPWEIGHTS statement specifies the unadjusted BRR replicate weights. Because you specify replicate weights by using the REPWEIGHTS statement, you do not need to specify the Fay coefficient in PROC SURVEYIMPUTE. The variable SEQN in the ID statement identifies the observation units. The VAR statement specifies the variables to be imputed, the CELLS statement identifies the imputation cell variable ImputationCells, and the OUT= option in the OUTPUT statement names the output data set SmokeImputed. You request that all four variables be imputed jointly and that the imputed data be saved in the SmokeImputed data set.

Note that this example creates imputation-adjusted BRR replicate weights from the unadjusted BRR replicate weights that are available for these data. If the unadjusted BRR replicate weights are not available to you, then PROC SURVEYIMPUTE first creates the unadjusted BRR replicate weights and then updates the unadjusted weights for imputation to create the imputation-adjusted BRR replicate weights. For more information, see the section Balanced Repeated Replication (BRR) Method.

Summary information about the number of observations and class level information are shown in Output 110.3.1. The "Number of Observations" table displays the number of observations (33,994) that are read and used. The weighted number of observations that are read shows that the 33,994 observation units in the sample represent over 251,000,000 observation units in the population. The "Class Level Information" table displays the observed levels for the analysis variables. The "Missing Data Patterns" table shows an arbitrary missing pattern. There are 12 different missing pattern groups. An "X" denotes that the variable is observed in that group, and a "." denotes that the variable is missing. Almost 94% of the observation units have no missing values (Group 1), 4.5% of the observation units have missing values for the variable HAN6SRMI (Group 4), and 1% of the observation units have missing values for the variable HAT28MI (Group 2).

Output 110.3.1: Imputation Information

The SURVEYIMPUTE Procedure

Number of Observations Read	33994
Number of Observations Used	33994
Sum of Weights Read	2.511E8
Sum of Weights Used	2.511E8

Class Level Information
Class	Levels	Values
HFF1MI	2	1 2
HAN6SRMI	4	-9 1 2 3
HAR3RMI	3	-9 1 2
HAT28MI	4	-9 1 2 3

Output 110.3.2: Missing Data Patterns

Missing Data Patterns
Group	HFF1MI	HAN6SRMI	HAR3RMI	HAT28MI	Freq	Sum of Weights	Unweighted Percent	Weighted Percent	Group Means
Group	HFF1MI	HAN6SRMI	HAR3RMI	HAT28MI	Freq	Sum of Weights	Unweighted Percent	Weighted Percent	HFF1MI 1	HFF1MI 2	HAN6SRMI -9	HAN6SRMI 1	HAN6SRMI 2	HAN6SRMI 3	HAR3RMI -9	HAR3RMI 1	HAR3RMI 2	HAT28MI -9	HAT28MI 1	HAT28MI 2	HAT28MI 3
1	X	X	X	X	31916	2.2918E8	93.89	91.27	0.370883	0.629117	0.275976	0.365368	0.219609	0.139048	0.275976	0.199395	0.524629	0.275976	0.239253	0.160019	0.324752
2	X	X	X	.	383	3584033	1.13	1.43	0.312278	0.687722	0	0.533112	0.322946	0.143942	0	0.231400	0.768600	.	.	.	.
3	X	X	.	X	2	6696.18	0.01	0.00	0	1.000000	0	0	0.695886	0.304114	.	.	.	0	0.695886	0	0.304114
4	X	.	X	X	1536	17201676	4.52	6.85	0.417227	0.582773	.	.	.	.	0	0.362741	0.637259	0	0.339890	0.222725	0.437384
5	X	.	X	.	25	255366.8	0.07	0.10	0.512251	0.487749	.	.	.	.	0	0.371613	0.628387	.	.	.	.
6	X	.	.	.	1	1137.27	0.00	0.00	0	1.000000	.	.	.	.	.	.	.	.	.	.	.
7	.	X	X	X	106	723162	0.31	0.29	.	.	0.277402	0.570801	0.099198	0.052599	0.277402	0.156784	0.565813	0.277402	0.185712	0.192402	0.344483
8	.	X	X	.	5	21175.81	0.01	0.01	.	.	0	0.414365	0.585635	0	0	0.078044	0.921956	.	.	.	.
9	.	X	.	.	2	43867.66	0.01	0.02	.	.	0	0.262239	0.737761	0	.	.	.	.	.	.	.
10	.	.	X	X	4	11751.6	0.01	0.00	.	.	.	.	.	.	0	0	1.000000	0	0.664584	0	0.335416
11	.	.	X	.	1	4483.45	0.00	0.00	.	.	.	.	.	.	0	0	1.000000	.	.	.	.
12	.	.	.	.	13	59745.61	0.04	0.02	.	.	.	.	.	.	.	.	.	.	.	.	.

The "Iteration History" table shown in Output 110.3.3 displays the maximum absolute and relative differences of the fractional weights for the EM algorithm for the full sample. The algorithm converged after four iterations. The "Imputation Summary" table shown in Output 110.3.4 displays the number of observed units (31,961), the number of missing units (2,078), and the number of imputed units. All units that have missing values have been imputed.

Output 110.3.3: Iteration History for the EM

Iteration History
Iteration Number	Maximum Absolute Difference	Maximum Relative Difference
1	830.4733	0.18278
2	93.33655	0.00904
3	16.14668	0.00237
4	4.138731	0.00061

Output 110.3.4: Imputation Summary

Imputation Summary
Observation Status	Number of Observations	Sum of Weights
Nonmissing	31916	2.2918E8
Missing	2078	21913095
Missing, Imputed	2078	21913095
Missing, Not Imputed	0	0
Missing, Partially Imputed	0	0

The imputed data set SmokeImputed contains the imputation-adjusted weight (ImpWt) and 52 imputation-adjusted replicate weights (ImpRepWt_1 to ImpRepWt_52). The SmokeImputed data set has 38,701 data lines. The number of imputed values for an observation unit ranges from two to six, but around 80% of the units are imputed by using two or three imputed values.

You can use the imputed data set, the imputation-adjusted replicate weights, and the appropriate Fay coefficient to compute any estimators from your imputed data. You should use the REPWEIGHTS statement in SAS/STAT survey analysis procedures to specify the imputation-adjusted replicate weights. This example uses PROC SURVEYFREQ to perform the following analyses:

estimate the percentage of smokers and nonsmokers in the population
describe the smoking habits of an individual and of anyone who smokes in the home
perform a domain analysis of activity levels for different levels of education

The PROC SURVEYFREQ statement invokes the procedure, the DATA= option names the imputed data set SmokeImputed, and the VARMETHOD=option requests the BRR variance estimation. The FAY= option for VARMETHOD=BRR specifies the Fay coefficient 0.3. Because your replicate weights come from Fay’s BRR method, you must specify the FAY= option in the SAS/STAT survey analysis procedures to appropriately estimate the variance. The VARHEADER=LABEL option in the PROC SURVEYFREQ statement requests that the labels of the variables be displayed in the output. The WEIGHT statement specifies the imputation-adjusted full sample weights, and the REPWEIGHTS statement specifies the imputation-adjusted replicate weights. Note that the imputation-adjusted full sample and replicate weights are created by PROC SURVEYIMPUTE, and they are different from the unadjusted weights available in the Smoke data set. The first TABLE statement requests a two-way frequency analysis for HFF1MI and HAR3RMI. The second TABLE statement requests a domain analysis for HAT28MI, where the variable Education is used as the domain variable. The ROW option in the TABLE statement is required in order to compute the distribution of HAT28MI for different levels of Education. The NOTOTAL, NOFREQ, and NOWT options suppress some output columns.

proc surveyfreq data=SmokeImputed varmethod=brr(fay=0.3) varheader=label;
   weight ImpWt;
   repweights ImpRepWt_:;
   table HFF1MI*HAR3RMI;
   table Education*HAT28MI / row nototal nofreq nowt;
run;

The data summary and the variance estimation information are displayed in Output 110.3.5. There are 38,701 data lines in the SmokeImputed data set. These 38,701 data lines represent the 33,994 observation units in the Smoke data set. The observation units are identified by the variable SEQN. The sum of weights is over 251,000,000, which is the same as the sum of weights in the Smoke data set. The sum of weights is an estimate of the population size. The "Variance Estimation" table shows that 52 replicate weights from Fay’s BRR method are used for variance estimation with the Fay coefficient 0.3.

Output 110.3.5: Summary Information

The SURVEYFREQ Procedure

Data Summary
Number of Observations	38701
Sum of Weights	251097002

Variance Estimation
Method	BRR
Replicate Weights	SMOKEIMPUTED
Number of Replicates	52
Fay Coefficient	0.300

A two-way table for the smoking habit of the observation unit and smoking in the home is shown in Output 110.3.6. There are 21% smokers and 54% nonsmokers in the population. Nearly 19% of the individuals are smokers and live in a home where at least one person smokes in the home, but only 2% of the individuals are smokers and live in a home where no other household member smokes in the home. However, almost 9% of the individuals are nonsmokers but live in a home where at least one household member smokes in the home. The standard errors that are reported in the table properly account for the imputation.

Output 110.3.6: Two-Way Table for Smoking Status

Table of Anyone living here smoke cigs in home by Smoke cigarettes now (recode)
Anyone living here smoke cigs in home	Smoke cigarettes now (recode)	Frequency	Weighted Frequency	Std Err of Wgt Freq	Percent	Std Err of Percent
1	-9	5431	24762830	717089	9.8619	0.2856
	1	5788	47166758	1361381	18.7843	0.5422
	2	3341	21790932	614088	8.6783	0.2446
	Total	14560	93720520	2180565	37.3244	0.8684
2	-9	8582	38702422	701855	15.4133	0.2795
	1	881	5837874	397358	2.3249	0.1582
	2	14678	112836186	1602093	44.9373	0.6380
	Total	24141	157376482	2180565	62.6756	0.8684
Total	-9	14013	63465252	260068	25.2752	0.1036
	1	6669	53004633	1276926	21.1092	0.5085
	2	18019	134627118	1328833	53.6156	0.5292
	Total	38701	251097002	2.25270	100.000

Suppose you want to perform a domain analysis by using the imputed data. If a list of domain variables is available before the imputation, then sometimes it is desirable to use the domain variables to create the imputation cells. However, requests for domain analyses often come after the imputation. In addition, data users might use domain variables that are different from what are used to create the imputation cells. In this example, the domain variable Education was not used to create the imputation cells. Although education level is not used in the imputation, it is reasonable to use the imputed data to perform domain analysis for every level of education. Domain analysis for activity levels for different education levels is shown in Output 110.3.7. If the highest education level is college, then 38% are reported as more active and 21% are reported as less active than their peers. If the highest education level is high school, then 28% are reported as more active and 20% are reported as less active than their peers. The standard errors that are reported in the table properly account for the imputation.

Output 110.3.7: Domain Analysis for Activity Levels by Education

Table of Education by Compare own activity level to others
Education	Compare own activity level to others	Percent	Std Err of Percent	Row Percent	Std Err of Row Percent
College	-9	0.0017	0.0016	0.0055	0.0052
	1	11.9908	0.4117	38.3446	0.9792
	2	6.5292	0.3253	20.8795	0.8461
	3	12.7494	0.4641	40.7704	0.9978
Elementary	-9	21.8961	0.1269	74.2986	0.9154
	1	1.8571	0.1193	6.3015	0.3596
	2	2.0051	0.1500	6.8039	0.4513
	3	3.7121	0.1970	12.5961	0.5420
High School	-9	3.2191	0.1238	8.3653	0.3251
	1	10.6977	0.2990	27.7997	0.6634
	2	7.8486	0.2782	20.3959	0.6333
	3	16.7160	0.4958	43.4391	0.7062
Unknown	-9	0.1583	0.0289	20.3674	3.2674
	1	0.1981	0.0463	25.4966	3.7217
	2	0.1502	0.0346	19.3311	3.3973
	3	0.2704	0.0426	34.8049	3.2211