The SURVEYIMPUTE Procedure

Example 110.2 Fully Efficient Fractional Imputation

This example illustrates the fully efficient fractional imputation (FEFI) method by using the data set DrugAbuse from a fictitious survey of drug abusers from Example 110.1. The survey collects information about substance that are used (such as drugs, alcohol, and marijuana) along with insurance information and treatment information. Some participants did not respond to all questions. The data set contains 736 observation units in 35 PSUs and 10 strata. The sum of the weights is 19,600. The data set contains missing values for many variables.

As in Example 110.1, to impute the missing items, you first need to decide whether to impute within imputation cells. Imputation cells divide the data into groups of similar units such that the recipient units share similar characteristics with the donor units in the same group. For example, it is reasonable to believe that different age groups, races, and income categories might have different responses to the drug abuse survey. You can use these characteristics to create imputation cells. Characteristics for imputation cells might come from the same survey or from other sources such as census data or previous surveys. In this example, assume that the imputation cells are available as a variable called ImputationCell in the data set.

The following statements request that the missing items be imputed by using the FEFI method:

proc surveyimpute data=DrugAbuse method=FEFI varmethod=Jackknife;
   class Sex Race Insurance Drug Alcohol Treatment;
   var   Sex Race Insurance Drug Alcohol Treatment;
   cells ImputationCell;
   strata Strata;
   cluster PSU;
   weight ObsWeight;
   output out=DrugAbuseFEFI outjkcoefs=DrugAbuseJKCOEFS;
run;

The PROC SURVEYIMPUTE statement invokes the procedure, the DATA= option specifies the input data set DrugAbuse, the METHOD= option requests the FEFI method, and the VARMETHOD= option requests that imputation-adjusted jackknife replicate weights be created. The VAR statement specifies the variables that are to be imputed, and the CELLS statement identifies the imputation cell variable ImputationCell. Because no IMPJOINT statements are specified, all the variables in the VAR statement are to be imputed by using their joint categories. For more information, see the section IMPJOINT Statement. The STRATA, CLUSTER, and WEIGHT statements specify the strata, cluster, and weight variables. The OUT= option in the OUTPUT statement names the output data set DrugAbuseFEFI to store the imputed values, and the OUTJKCOEFS= option in the OUTPUT statement names the output data set DrugAbuseJKCOEFS to store the jackknife coefficients.

Summary information about the imputation method, number of observations, and survey design is shown in Output 110.2.1. The "Imputation Information" table summarizes the imputation method. The "Number of Observations" table displays the number of observations that are read and used (736) and the weighted number of observation that are read and used (19,600) by PROC SURVEYIMPUTE. The "Design Information" table shows that there are 35 PSUs and 10 strata.

Output 110.2.1: Imputation Information

The SURVEYIMPUTE Procedure

Imputation Information
Data Set	WORK.DRUGABUSE
Weight Variable	ObsWeight
Stratum Variable	Strata
Cluster Variable	PSU
Imputation Method	FEFI

Number of Observations Read	736
Number of Observations Used	736
Sum of Weights Read	19599.99
Sum of Weights Used	19599.99

Design Summary
Number of Strata	10
Number of Clusters	35

Selected observations for some variables from the output data set are displayed in Output 110.2.2. The output data set, DrugAbuseFEFI, contains unit identification, the recipient index, imputation-adjusted full-sample weights, imputation-adjusted jackknife weights, and all the variables from the input data set DrugAbuse. The output data set contains 35 sets of replicate weights, but only the first three sets of replicate weights are shown in Output 110.2.2. Units that are complete respondents have one row, but units that are incomplete respondents have multiple rows in the output data set. For example, unit 21 is a complete respondent, so it has only one row in the output data set and its Recipient value is 0. Unit 22 is an incomplete respondent; it has 20 rows in the output data set, its Recipient values range from 1 to 20, and its imputation-adjusted full-sample weights (ImpWt) range from 0.02 to 1.02. The sum of ImpWt for all rows (donor cells) for observation unit 22 is 5, which is the full sample weight for unit 22.

Output 110.2.2: Observations for Selected Units

UnitID	Recipient	ImpWt	Strata	PSU	ObsWeight	ImputationCell	Sex	Race	Insurance	Drug	Alcohol	Treatment	ImpRepWt_1	ImpRepWt_2	ImpRepWt_3
20	0	5.00000	1	1	5	1	0	1	1	1	2	1	0.00000	6.66667	6.66667
21	0	5.00000	1	2	5	1	0	3	3	1	1	1	6.66667	0.00000	6.66667
22	1	0.49238	1	2	5	2	1	1	1	1	1	1	0.65757	0.00000	0.66513
22	2	1.01597	1	2	5	2	1	1	1	1	2	1	1.36836	0.00000	1.33813
22	3	0.14661	1	2	5	2	1	1	1	1	2	2	0.18561	0.00000	0.19845
22	4	0.11496	1	2	5	2	1	1	1	2	1	1	0.15441	0.00000	0.15268
22	5	0.03894	1	2	5	2	1	1	1	2	1	2	0.05231	0.00000	0.05172
22	6	0.32730	1	2	5	2	1	1	1	2	2	1	0.44358	0.00000	0.43861
22	7	0.21661	1	2	5	2	1	1	1	2	2	2	0.29074	0.00000	0.28749
22	8	0.24024	1	2	5	2	1	1	2	1	1	1	0.32268	0.00000	0.31907
22	9	0.10453	1	2	5	2	1	1	2	1	1	2	0.12908	0.00000	0.14255
22	10	0.46726	1	2	5	2	1	1	2	1	2	1	0.61595	0.00000	0.62442
22	11	0.19382	1	2	5	2	1	1	2	1	2	2	0.26067	0.00000	0.25730
22	12	0.37531	1	2	5	2	1	1	2	2	2	1	0.50409	0.00000	0.49845
22	13	0.03697	1	2	5	2	1	1	2	2	2	2	0.04966	0.00000	0.04911
22	14	0.15533	1	2	5	2	1	1	3	1	1	1	0.19731	0.00000	0.21002
22	15	0.08672	1	2	5	2	1	1	3	1	1	2	0.11647	0.00000	0.11517
22	16	0.69416	1	2	5	2	1	1	3	1	2	1	0.93613	0.00000	0.92566
22	17	0.03688	1	2	5	2	1	1	3	1	2	2	0.04953	0.00000	0.04897
22	18	0.02106	1	2	5	2	1	1	3	2	1	1	0.02828	0.00000	0.02796
22	19	0.18442	1	2	5	2	1	1	3	2	2	1	0.23638	0.00000	0.24865
22	20	0.05053	1	2	5	2	1	1	3	2	2	2	0.06788	0.00000	0.06712
23	0	5.00000	1	2	5	2	1	1	1	1	1	1	6.66667	0.00000	6.66667

You can use the imputed data set and the imputation-adjusted replicate weights to compute any estimators from your imputed data. You can use the REPWEIGHTS statement in any SAS/STAT survey analysis procedures to specify the imputation-adjusted replicate weights. For example, the following statements use PROC SURVEYLOGISTIC to perform logistic regression analysis of the imputed data:

proc surveylogistic data=DrugAbuseFEFI varmethod=Jackknife;
   class Treatment Insurance Sex Race;
   model Drug=Treatment Insurance Age Sex Race;
   weight ImpWt;
   repweights ImpRepWt_: / jkcoefs=DrugAbuseJKCOEFS;
   ods output parameterestimates=FEFILogisticAnalysis;
run;

The WEIGHT statement specifies the imputation-adjusted full-sample weight (ImpWt), and the REPWEIGHTS statement specifies the imputation-adjusted replicate weights (ImpRepWt_1, …, ImpRepWt_35). The JKCOEFS= option in the REPWEIGHTS statement specifies the jackknife coefficients.

The parameter estimates and their standard errors are displayed in Output 110.2.3. The variance estimators correctly account for both the design variability and the imputation variability.

Output 110.2.3: Logistic Regression Analysis of the Fractionally Imputed Data Set

The SURVEYLOGISTIC Procedure

Analysis of Maximum Likelihood Estimates
Parameter		Estimate	Standard Error	t Value	Pr > \|t\|
Intercept		0.5105	0.2281	2.24	0.0317
Treatment	1	0.1048	0.1600	0.65	0.5168
Insurance	1	-0.1152	0.1454	-0.79	0.4337
Insurance	2	-0.0461	0.1183	-0.39	0.6988
Age		0.00195	0.00459	0.42	0.6736
Sex	0	-0.0719	0.0931	-0.77	0.4452
Race	1	0.4463	0.1574	2.84	0.0075
Race	2	-0.1212	0.2110	-0.57	0.5695
NOTE: The degrees of freedom for the t tests is 35.