The SURVEYIMPUTE Procedure

Example 110.2 Fully Efficient Fractional Imputation

This example illustrates the fully efficient fractional imputation (FEFI) method by using the data set DrugAbuse from a fictitious survey of drug abusers from Example 110.1. The survey collects information about substance that are used (such as drugs, alcohol, and marijuana) along with insurance information and treatment information. Some participants did not respond to all questions. The data set contains 736 observation units in 35 PSUs and 10 strata. The sum of the weights is 19,600. The data set contains missing values for many variables.

As in Example 110.1, to impute the missing items, you first need to decide whether to impute within imputation cells. Imputation cells divide the data into groups of similar units such that the recipient units share similar characteristics with the donor units in the same group. For example, it is reasonable to believe that different age groups, races, and income categories might have different responses to the drug abuse survey. You can use these characteristics to create imputation cells. Characteristics for imputation cells might come from the same survey or from other sources such as census data or previous surveys. In this example, assume that the imputation cells are available as a variable called ImputationCell in the data set.

The following statements request that the missing items be imputed by using the FEFI method:

proc surveyimpute data=DrugAbuse method=FEFI varmethod=Jackknife;
   class Sex Race Insurance Drug Alcohol Treatment;
   var   Sex Race Insurance Drug Alcohol Treatment;
   cells ImputationCell;
   strata Strata;
   cluster PSU;
   weight ObsWeight;
   output out=DrugAbuseFEFI outjkcoefs=DrugAbuseJKCOEFS;
run;

The PROC SURVEYIMPUTE statement invokes the procedure, the DATA= option specifies the input data set DrugAbuse, the METHOD= option requests the FEFI method, and the VARMETHOD= option requests that imputation-adjusted jackknife replicate weights be created. The VAR statement specifies the variables that are to be imputed, and the CELLS statement identifies the imputation cell variable ImputationCell. Because no IMPJOINT statements are specified, all the variables in the VAR statement are to be imputed by using their joint categories. For more information, see the section IMPJOINT Statement. The STRATA, CLUSTER, and WEIGHT statements specify the strata, cluster, and weight variables. The OUT= option in the OUTPUT statement names the output data set DrugAbuseFEFI to store the imputed values, and the OUTJKCOEFS= option in the OUTPUT statement names the output data set DrugAbuseJKCOEFS to store the jackknife coefficients.

Summary information about the imputation method, number of observations, and survey design is shown in Output 110.2.1. The "Imputation Information" table summarizes the imputation method. The "Number of Observations" table displays the number of observations that are read and used (736) and the weighted number of observation that are read and used (19,600) by PROC SURVEYIMPUTE. The "Design Information" table shows that there are 35 PSUs and 10 strata.

Output 110.2.1: Imputation Information

The SURVEYIMPUTE Procedure

Imputation Information
Data Set WORK.DRUGABUSE
Weight Variable ObsWeight
Stratum Variable Strata
Cluster Variable PSU
Imputation Method FEFI

Number of Observations Read 736
Number of Observations Used 736
Sum of Weights Read 19599.99
Sum of Weights Used 19599.99

Design Summary
Number of Strata 10
Number of Clusters 35



Selected observations for some variables from the output data set are displayed in Output 110.2.2. The output data set, DrugAbuseFEFI, contains unit identification, the recipient index, imputation-adjusted full-sample weights, imputation-adjusted jackknife weights, and all the variables from the input data set DrugAbuse. The output data set contains 35 sets of replicate weights, but only the first three sets of replicate weights are shown in Output 110.2.2. Units that are complete respondents have one row, but units that are incomplete respondents have multiple rows in the output data set. For example, unit 21 is a complete respondent, so it has only one row in the output data set and its Recipient value is 0. Unit 22 is an incomplete respondent; it has 20 rows in the output data set, its Recipient values range from 1 to 20, and its imputation-adjusted full-sample weights (ImpWt) range from 0.02 to 1.02. The sum of ImpWt for all rows (donor cells) for observation unit 22 is 5, which is the full sample weight for unit 22.

Output 110.2.2: Observations for Selected Units

UnitID Recipient ImpWt Strata PSU ObsWeight ImputationCell Sex Race Insurance Drug Alcohol Treatment ImpRepWt_1 ImpRepWt_2 ImpRepWt_3
20 0 5.00000 1 1 5 1 0 1 1 1 2 1 0.00000 6.66667 6.66667
21 0 5.00000 1 2 5 1 0 3 3 1 1 1 6.66667 0.00000 6.66667
22 1 0.49238 1 2 5 2 1 1 1 1 1 1 0.65757 0.00000 0.66513
22 2 1.01597 1 2 5 2 1 1 1 1 2 1 1.36836 0.00000 1.33813
22 3 0.14661 1 2 5 2 1 1 1 1 2 2 0.18561 0.00000 0.19845
22 4 0.11496 1 2 5 2 1 1 1 2 1 1 0.15441 0.00000 0.15268
22 5 0.03894 1 2 5 2 1 1 1 2 1 2 0.05231 0.00000 0.05172
22 6 0.32730 1 2 5 2 1 1 1 2 2 1 0.44358 0.00000 0.43861
22 7 0.21661 1 2 5 2 1 1 1 2 2 2 0.29074 0.00000 0.28749
22 8 0.24024 1 2 5 2 1 1 2 1 1 1 0.32268 0.00000 0.31907
22 9 0.10453 1 2 5 2 1 1 2 1 1 2 0.12908 0.00000 0.14255
22 10 0.46726 1 2 5 2 1 1 2 1 2 1 0.61595 0.00000 0.62442
22 11 0.19382 1 2 5 2 1 1 2 1 2 2 0.26067 0.00000 0.25730
22 12 0.37531 1 2 5 2 1 1 2 2 2 1 0.50409 0.00000 0.49845
22 13 0.03697 1 2 5 2 1 1 2 2 2 2 0.04966 0.00000 0.04911
22 14 0.15533 1 2 5 2 1 1 3 1 1 1 0.19731 0.00000 0.21002
22 15 0.08672 1 2 5 2 1 1 3 1 1 2 0.11647 0.00000 0.11517
22 16 0.69416 1 2 5 2 1 1 3 1 2 1 0.93613 0.00000 0.92566
22 17 0.03688 1 2 5 2 1 1 3 1 2 2 0.04953 0.00000 0.04897
22 18 0.02106 1 2 5 2 1 1 3 2 1 1 0.02828 0.00000 0.02796
22 19 0.18442 1 2 5 2 1 1 3 2 2 1 0.23638 0.00000 0.24865
22 20 0.05053 1 2 5 2 1 1 3 2 2 2 0.06788 0.00000 0.06712
23 0 5.00000 1 2 5 2 1 1 1 1 1 1 6.66667 0.00000 6.66667



You can use the imputed data set and the imputation-adjusted replicate weights to compute any estimators from your imputed data. You can use the REPWEIGHTS statement in any SAS/STAT survey analysis procedures to specify the imputation-adjusted replicate weights. For example, the following statements use PROC SURVEYLOGISTIC to perform logistic regression analysis of the imputed data:

proc surveylogistic data=DrugAbuseFEFI varmethod=Jackknife;
   class Treatment Insurance Sex Race;
   model Drug=Treatment Insurance Age Sex Race;
   weight ImpWt;
   repweights ImpRepWt_: / jkcoefs=DrugAbuseJKCOEFS;
   ods output parameterestimates=FEFILogisticAnalysis;
run;

The WEIGHT statement specifies the imputation-adjusted full-sample weight (ImpWt), and the REPWEIGHTS statement specifies the imputation-adjusted replicate weights (ImpRepWt_1, …, ImpRepWt_35). The JKCOEFS= option in the REPWEIGHTS statement specifies the jackknife coefficients.

The parameter estimates and their standard errors are displayed in Output 110.2.3. The variance estimators correctly account for both the design variability and the imputation variability.

Output 110.2.3: Logistic Regression Analysis of the Fractionally Imputed Data Set

The SURVEYLOGISTIC Procedure

Analysis of Maximum Likelihood Estimates
Parameter   Estimate Standard
Error
t Value Pr > |t|
Intercept   0.5105 0.2281 2.24 0.0317
Treatment 1 0.1048 0.1600 0.65 0.5168
Insurance 1 -0.1152 0.1454 -0.79 0.4337
Insurance 2 -0.0461 0.1183 -0.39 0.6988
Age   0.00195 0.00459 0.42 0.6736
Sex 0 -0.0719 0.0931 -0.77 0.4452
Race 1 0.4463 0.1574 2.84 0.0075
Race 2 -0.1212 0.2110 -0.57 0.5695
NOTE: The degrees of freedom for the t tests
is 35.