This example illustrates the fully efficient fractional imputation (FEFI) method by using the data set DrugAbuse
from a fictitious survey of drug abusers from Example 110.1. The survey collects information about substance that are used (such as drugs, alcohol, and marijuana) along with insurance
information and treatment information. Some participants did not respond to all questions. The data set contains 736 observation
units in 35 PSUs and 10 strata. The sum of the weights is 19,600. The data set contains missing values for many variables.
As in Example 110.1, to impute the missing items, you first need to decide whether to impute within imputation cells. Imputation cells divide
the data into groups of similar units such that the recipient units share similar characteristics with the donor units in
the same group. For example, it is reasonable to believe that different age groups, races, and income categories might have
different responses to the drug abuse survey. You can use these characteristics to create imputation cells. Characteristics
for imputation cells might come from the same survey or from other sources such as census data or previous surveys. In this
example, assume that the imputation cells are available as a variable called ImputationCell
in the data set.
The following statements request that the missing items be imputed by using the FEFI method:
proc surveyimpute data=DrugAbuse method=FEFI varmethod=Jackknife; class Sex Race Insurance Drug Alcohol Treatment; var Sex Race Insurance Drug Alcohol Treatment; cells ImputationCell; strata Strata; cluster PSU; weight ObsWeight; output out=DrugAbuseFEFI outjkcoefs=DrugAbuseJKCOEFS; run;
The PROC SURVEYIMPUTE statement invokes the procedure, the DATA= option specifies the input data set DrugAbuse
, the METHOD=
option requests the FEFI method, and the VARMETHOD=
option requests that imputation-adjusted jackknife replicate weights be created. The VAR statement specifies the variables
that are to be imputed, and the CELLS statement identifies the imputation cell variable ImputationCell
. Because no IMPJOINT statements are specified, all the variables in the VAR statement are to be imputed by using their joint
categories. For more information, see the section IMPJOINT Statement. The STRATA, CLUSTER, and WEIGHT statements specify the strata, cluster, and weight variables. The OUT= option in the OUTPUT
statement names the output data set DrugAbuseFEFI
to store the imputed values, and the OUTJKCOEFS= option in the OUTPUT
statement names the output data set DrugAbuseJKCOEFS
to store the jackknife coefficients.
Summary information about the imputation method, number of observations, and survey design is shown in Output 110.2.1. The "Imputation Information" table summarizes the imputation method. The "Number of Observations" table displays the number of observations that are read and used (736) and the weighted number of observation that are read and used (19,600) by PROC SURVEYIMPUTE. The "Design Information" table shows that there are 35 PSUs and 10 strata.
Output 110.2.1: Imputation Information
Selected observations for some variables from the output data set are displayed in Output 110.2.2. The output data set, DrugAbuseFEFI
, contains unit identification, the recipient index, imputation-adjusted full-sample weights, imputation-adjusted jackknife
weights, and all the variables from the input data set DrugAbuse
. The output data set contains 35 sets of replicate weights, but only the first three sets of replicate weights are shown
in Output 110.2.2. Units that are complete respondents have one row, but units that are incomplete respondents have multiple rows in the output
data set. For example, unit 21 is a complete respondent, so it has only one row in the output data set and its Recipient
value is 0. Unit 22 is an incomplete respondent; it has 20 rows in the output data set, its Recipient
values range from 1 to 20, and its imputation-adjusted full-sample weights (ImpWt
) range from 0.02 to 1.02. The sum of ImpWt
for all rows (donor cells) for observation unit 22 is 5, which is the full sample weight for unit 22.
Output 110.2.2: Observations for Selected Units
UnitID | Recipient | ImpWt | Strata | PSU | ObsWeight | ImputationCell | Sex | Race | Insurance | Drug | Alcohol | Treatment | ImpRepWt_1 | ImpRepWt_2 | ImpRepWt_3 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
20 | 0 | 5.00000 | 1 | 1 | 5 | 1 | 0 | 1 | 1 | 1 | 2 | 1 | 0.00000 | 6.66667 | 6.66667 |
21 | 0 | 5.00000 | 1 | 2 | 5 | 1 | 0 | 3 | 3 | 1 | 1 | 1 | 6.66667 | 0.00000 | 6.66667 |
22 | 1 | 0.49238 | 1 | 2 | 5 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 0.65757 | 0.00000 | 0.66513 |
22 | 2 | 1.01597 | 1 | 2 | 5 | 2 | 1 | 1 | 1 | 1 | 2 | 1 | 1.36836 | 0.00000 | 1.33813 |
22 | 3 | 0.14661 | 1 | 2 | 5 | 2 | 1 | 1 | 1 | 1 | 2 | 2 | 0.18561 | 0.00000 | 0.19845 |
22 | 4 | 0.11496 | 1 | 2 | 5 | 2 | 1 | 1 | 1 | 2 | 1 | 1 | 0.15441 | 0.00000 | 0.15268 |
22 | 5 | 0.03894 | 1 | 2 | 5 | 2 | 1 | 1 | 1 | 2 | 1 | 2 | 0.05231 | 0.00000 | 0.05172 |
22 | 6 | 0.32730 | 1 | 2 | 5 | 2 | 1 | 1 | 1 | 2 | 2 | 1 | 0.44358 | 0.00000 | 0.43861 |
22 | 7 | 0.21661 | 1 | 2 | 5 | 2 | 1 | 1 | 1 | 2 | 2 | 2 | 0.29074 | 0.00000 | 0.28749 |
22 | 8 | 0.24024 | 1 | 2 | 5 | 2 | 1 | 1 | 2 | 1 | 1 | 1 | 0.32268 | 0.00000 | 0.31907 |
22 | 9 | 0.10453 | 1 | 2 | 5 | 2 | 1 | 1 | 2 | 1 | 1 | 2 | 0.12908 | 0.00000 | 0.14255 |
22 | 10 | 0.46726 | 1 | 2 | 5 | 2 | 1 | 1 | 2 | 1 | 2 | 1 | 0.61595 | 0.00000 | 0.62442 |
22 | 11 | 0.19382 | 1 | 2 | 5 | 2 | 1 | 1 | 2 | 1 | 2 | 2 | 0.26067 | 0.00000 | 0.25730 |
22 | 12 | 0.37531 | 1 | 2 | 5 | 2 | 1 | 1 | 2 | 2 | 2 | 1 | 0.50409 | 0.00000 | 0.49845 |
22 | 13 | 0.03697 | 1 | 2 | 5 | 2 | 1 | 1 | 2 | 2 | 2 | 2 | 0.04966 | 0.00000 | 0.04911 |
22 | 14 | 0.15533 | 1 | 2 | 5 | 2 | 1 | 1 | 3 | 1 | 1 | 1 | 0.19731 | 0.00000 | 0.21002 |
22 | 15 | 0.08672 | 1 | 2 | 5 | 2 | 1 | 1 | 3 | 1 | 1 | 2 | 0.11647 | 0.00000 | 0.11517 |
22 | 16 | 0.69416 | 1 | 2 | 5 | 2 | 1 | 1 | 3 | 1 | 2 | 1 | 0.93613 | 0.00000 | 0.92566 |
22 | 17 | 0.03688 | 1 | 2 | 5 | 2 | 1 | 1 | 3 | 1 | 2 | 2 | 0.04953 | 0.00000 | 0.04897 |
22 | 18 | 0.02106 | 1 | 2 | 5 | 2 | 1 | 1 | 3 | 2 | 1 | 1 | 0.02828 | 0.00000 | 0.02796 |
22 | 19 | 0.18442 | 1 | 2 | 5 | 2 | 1 | 1 | 3 | 2 | 2 | 1 | 0.23638 | 0.00000 | 0.24865 |
22 | 20 | 0.05053 | 1 | 2 | 5 | 2 | 1 | 1 | 3 | 2 | 2 | 2 | 0.06788 | 0.00000 | 0.06712 |
23 | 0 | 5.00000 | 1 | 2 | 5 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 6.66667 | 0.00000 | 6.66667 |
You can use the imputed data set and the imputation-adjusted replicate weights to compute any estimators from your imputed data. You can use the REPWEIGHTS statement in any SAS/STAT survey analysis procedures to specify the imputation-adjusted replicate weights. For example, the following statements use PROC SURVEYLOGISTIC to perform logistic regression analysis of the imputed data:
proc surveylogistic data=DrugAbuseFEFI varmethod=Jackknife; class Treatment Insurance Sex Race; model Drug=Treatment Insurance Age Sex Race; weight ImpWt; repweights ImpRepWt_: / jkcoefs=DrugAbuseJKCOEFS; ods output parameterestimates=FEFILogisticAnalysis; run;
The WEIGHT statement specifies the imputation-adjusted full-sample weight (ImpWt
), and the REPWEIGHTS statement specifies the imputation-adjusted replicate weights (ImpRepWt_1
, …, ImpRepWt_35
). The JKCOEFS= option in the REPWEIGHTS statement specifies the jackknife coefficients.
The parameter estimates and their standard errors are displayed in Output 110.2.3. The variance estimators correctly account for both the design variability and the imputation variability.
Output 110.2.3: Logistic Regression Analysis of the Fractionally Imputed Data Set
Analysis of Maximum Likelihood Estimates | |||||
---|---|---|---|---|---|
Parameter | Estimate | Standard Error |
t Value | Pr > |t| | |
Intercept | 0.5105 | 0.2281 | 2.24 | 0.0317 | |
Treatment | 1 | 0.1048 | 0.1600 | 0.65 | 0.5168 |
Insurance | 1 | -0.1152 | 0.1454 | -0.79 | 0.4337 |
Insurance | 2 | -0.0461 | 0.1183 | -0.39 | 0.6988 |
Age | 0.00195 | 0.00459 | 0.42 | 0.6736 | |
Sex | 0 | -0.0719 | 0.0931 | -0.77 | 0.4452 |
Race | 1 | 0.4463 | 0.1574 | 2.84 | 0.0075 |
Race | 2 | -0.1212 | 0.2110 | -0.57 | 0.5695 |
NOTE: The degrees of freedom for the t tests is 35. |