This example illustrates the approximate Bayesian bootstrap hot-deck imputation method by using a simulated data set from a fictitious survey of drug abusers. A stratified clustered sample of drug abuse treatment centers is taken from a list of available treatment centers. The list is first stratified based on geographic locations. From each strata, two or three treatment centers are sampled as the primary sampling units (PSU). Data are collected from individual patients within the selected treatment centers. The survey collects information about the substances that the patients used (such as drugs, alcohol, and marijuana) along with insurance information and treatment information.
The data set contains 736 observation units in 35 PSUs and 10 strata. The sum of the weights is 19,600. Therefore, the survey data represent a population of 19,600 patients from the study area. Some participants did not respond to all questions. The data set contains missing values in many variables.
To impute the missing items, you first need to decide whether to impute within imputation cells. Imputation cells divide the
data into groups of similar units such that the recipient units share similar characteristics with the donor units in the
same group. For example, it is reasonable to believe that different age groups, races, and income categories might have different
responses to the drug abuse survey. You can use these characteristics to create imputation cells. Characteristics for imputation
cells might come from the same survey or might come from other sources such as census data or previous surveys. In this example,
assume that the imputation cells are available as a variable called ImputationCell
in the data set.
The data set DrugAbuse
contains the following items:
Strata
: stratum identification
PSU
: PSU identification (treatment centers)
ObsWeight
: observation weight for patients
ImputationCell
: imputation cell identification
Age
: age, in years
Sex
: 1 for female and 2 for male
Race
: 1 for white, 2 for black, and 3 for others
Insurance
: 1 if the patient has any insurance, and 2 otherwise
Drug
: 1 if the patient used any drugs in the past three months, and 2 otherwise
Alcohol
: 1 if the patient consumed any alcohol in the past month, and 2 otherwise
Treatment
: 1 if the patient is being treated for the first time, and 2 otherwise
data DrugAbuse; input Strata PSU ObsWeight ImputationCell Age Sex Race Insurance Drug Alcohol Treatment; datalines; 1 1 5 1 74 1 1 3 2 2 1 1 1 5 1 20 0 3 1 2 2 1 1 1 5 3 42 1 2 1 1 2 1 1 1 5 3 65 1 3 2 1 2 1 1 1 5 2 53 1 1 1 1 1 1 1 1 5 3 49 1 1 1 1 2 1 1 1 5 3 51 0 2 1 2 1 1 1 1 5 2 77 0 3 1 1 1 1 1 1 5 2 26 1 1 1 1 2 1 1 1 5 3 28 0 3 1 1 1 1 1 1 5 1 71 1 1 1 2 2 1 1 1 5 2 72 1 1 3 2 2 1 1 1 5 3 24 1 1 1 1 1 1 1 1 5 2 65 1 1 2 1 1 2 1 1 5 3 47 1 1 1 1 1 1 1 1 5 2 37 1 1 2 1 2 1 1 1 5 2 46 1 1 3 1 1 1 1 1 5 2 52 1 1 1 1 2 2 1 1 5 3 60 0 3 1 1 2 1 1 1 5 1 31 0 1 1 1 2 1 1 2 5 1 23 0 3 3 1 1 1 1 2 5 2 78 1 1 . . . . 1 2 5 2 29 1 1 1 1 1 1 1 2 5 2 21 . . . . . . ... more lines ... 10 4 55.5556 1 40 0 3 2 1 1 2 10 4 55.5556 1 32 1 3 1 2 2 1 10 4 55.5556 3 68 0 1 2 2 1 2 10 4 55.5556 3 35 1 1 2 1 2 2 ;
The following statements request that the missing items be imputed by using the approximate Bayesian bootstrap hot-deck imputation method:
proc surveyimpute data=DrugAbuse method=hotdeck(selection=abb) ndonors=5 seed=773269; var Sex Race Insurance Drug Alcohol Treatment; cells ImputationCell; output out=DrugAbuseABB; run;
The PROC SURVEYIMPUTE statement invokes the procedure, the DATA= option specifies the input data set DrugAbuse
, the METHOD= option requests the hot-deck imputation method, the METHOD=HOTDECK
(SELECTION=ABB) option requests the approximate Bayesian bootstrap method, the NDONORS=
option requests five donor units for every recipient unit, and the SEED= option specifies the random number generator seed.
The VAR statement specifies the variables that are to be imputed, the CELLS statement identifies the imputation cell variable
ImputationCell
, and the OUT= option in the OUTPUT
statement names the output data set DrugAbuseABB
.
You do not need to use WEIGHTS, STRATA, and CLUSTER statements for the approximate Bayesian bootstrap method unless you want to create the jackknife replication weights by including the VARMETHOD=JACKKNIFE option in the PROC SURVEYIMPUTE statement. The selection of donors does not use the design information. However, if you want to select donors from the same strata or the same group of clusters, then you must include that information in the imputation cell.
Summary information about the imputation method, number of observations, and missing data patterns is shown in Output 110.1.1. The "Imputation Information" table summarizes the imputation method. The "Number of Observations" table shows that PROC
SURVEYIMPUTE read and used all 736 observations. The "Missing Data Pattern" table displays the missing patterns in the data
set. There are four different missing data pattern groups: all items observed, one item missing, four items missing, and all
items missing. Of the observation units, 92.53% have all items observed; 4.64% have missing values in Treatment
; 1.77% have missing values in Insurance
, Drug
, Alcohol
, and Treatment
; and 1.09% have missing values in all variables. Because the WEIGHT statement is not specified, these percentages represent
the percentages of missing units in the input data.
Output 110.1.1: Imputation Summary
Missing Data Patterns | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Group | Sex | Race | Insurance | Drug | Alcohol | Treatment | Freq | Sum of Weights |
Unweighted Percent |
Weighted Percent |
Group Means | |||||
Sex | Race | Insurance | Drug | Alcohol | Treatment | |||||||||||
1 | X | X | X | X | X | X | 681 | 681 | 92.53 | 92.53 | 0.566814 | 1.491924 | 1.718062 | 1.292217 | 1.716593 | 1.201175 |
2 | X | X | X | X | X | . | 34 | 34 | 4.62 | 4.62 | 0.500000 | 1.441176 | 1.588235 | 1.294118 | 1.588235 | . |
3 | X | X | . | . | . | . | 13 | 13 | 1.77 | 1.77 | 0.692308 | 1.230769 | . | . | . | . |
4 | . | . | . | . | . | . | 8 | 8 | 1.09 | 1.09 | . | . | . | . | . | . |
Some selected observations from the output data set are displayed in Output 110.1.2. The output data set DrugAbuseABB
contains the unit identification, the recipient index, and all the variables from the input data set DrugAbuse
. Units that are complete respondents have one row, but units that are incomplete respondents have five rows in the output
data set. For example, unit 21 is a complete respondent, so it has only one row in the output data set and its Recipient
value is 0. Unit 22 is an incomplete respondent, so it has five rows in the output data set and its Recipient
values range from 1 to 5.
Output 110.1.2: Observations for Some Selected Units
UnitID | Recipient | Strata | PSU | ObsWeight | ImputationCell | Age | Sex | Race | Insurance | Drug | Alcohol | Treatment |
---|---|---|---|---|---|---|---|---|---|---|---|---|
20 | 0 | 1 | 1 | 5 | 1 | 31 | 0 | 1 | 1 | 1 | 2 | 1 |
21 | 0 | 1 | 2 | 5 | 1 | 23 | 0 | 3 | 3 | 1 | 1 | 1 |
22 | 1 | 1 | 2 | 5 | 2 | 78 | 1 | 1 | 1 | 1 | 1 | 1 |
22 | 2 | 1 | 2 | 5 | 2 | 78 | 1 | 1 | 1 | 1 | 1 | 1 |
22 | 3 | 1 | 2 | 5 | 2 | 78 | 1 | 1 | 2 | 1 | 2 | 1 |
22 | 4 | 1 | 2 | 5 | 2 | 78 | 1 | 1 | 3 | 1 | 2 | 1 |
22 | 5 | 1 | 2 | 5 | 2 | 78 | 1 | 1 | 1 | 1 | 2 | 2 |
23 | 0 | 1 | 2 | 5 | 2 | 29 | 1 | 1 | 1 | 1 | 1 | 1 |
24 | 1 | 1 | 2 | 5 | 2 | 21 | 1 | 2 | 2 | 1 | 2 | 2 |
24 | 2 | 1 | 2 | 5 | 2 | 21 | 1 | 3 | 1 | 1 | 1 | 1 |
24 | 3 | 1 | 2 | 5 | 2 | 21 | 1 | 2 | 2 | 1 | 2 | 1 |
24 | 4 | 1 | 2 | 5 | 2 | 21 | 1 | 1 | 2 | 1 | 2 | 1 |
24 | 5 | 1 | 2 | 5 | 2 | 21 | 1 | 2 | 2 | 2 | 1 | 1 |
25 | 0 | 1 | 2 | 5 | 3 | 85 | 0 | 1 | 3 | 1 | 2 | 1 |
Suppose you want to perform a logistic regression analysis by using the imputed data set. If you want to use the multiple
imputation variance estimator that is available in the MIANALYZE procedure with the imputed data set, then you need to create
one complete data set for every imputation. The following SAS statements create five complete data sets and then merge the
five data sets into one. Each complete data set contains the complete respondents and only one donor unit for the incomplete
respondents. Each data set also contains the imputation number (_Imputation_
).
data DAIMP; set DrugAbuseABB; if (Recipient = 0) then do; /* Include complete respondents */ do _Imputation_=1 to 5; /* in all imputations. */ output; end; end; else do; /* Put incomplete respondents */ _Imputation_ = Recipient; /* in separate imputations. */ output; end; proc sort data=DAIMP; by _Imputation_ UnitID; run;
The following SAS statements first use the SURVEYLOGISTIC procedure (see Chapter 111: The SURVEYLOGISTIC Procedure) to perform separate logistic regression analyses within the imputed data sets and use the MIANALYZE procedure (Chapter 76: The MIANALYZE Procedure) to combine the logistic regression results from five imputed data sets:
ods select none; proc surveylogistic data=DAIMP; by _imputation_; class Treatment Insurance Sex Race; strata Strata; cluster PSU; weight ObsWeight; model Drug=Treatment Insurance Age Sex Race / covb; ods output parameterestimates=Estimates covb=Covariances; run; ods select all;
proc mianalyze parms(classvar=classval)=Estimates covb(effectvar=stacking)=Covariances edf=25; class Treatment Insurance Sex Race; modeleffects Intercept Treatment Insurance Age Sex Race; ods output parameterestimates=ABBLogisticAnalysis; run;
Although the survey design information was not directly used in the imputation, you must use the complete design information, including strata, clusters, and weights, to estimate the design variance within each imputed data set. The STRATA, CLUSTER, and WEIGHT statements in PROC SURVEYLOGISTIC specify the design information. However, separate logistic regression results from any single imputed data set should not be used for inference.
Degrees of freedom values for survey data are often much less than the number of observation units. In this example, there are 736 observation units, but there are 35 PSUs in 10 strata. The degrees of freedom for the Taylor series linearized variance estimator is 25 (35 – 10). You should specify the reduced degrees of freedom by using the EDF= option in PROC MIANALYZE. For more information, see the section EDF=<phrase remap="Argument">number</phrase> in Chapter 76: The MIANALYZE Procedure; also see Barnard and Rubin (1999).
The estimated regression parameters and their standard errors from a multiply imputed data set are shown in Output 110.1.3.
Output 110.1.3: Logistic Regression Analysis Using a Multiply Imputed Data Set
Parameter Estimates (5 Imputations) | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Parameter | Treatment | Insurance | Sex | Race | Estimate | Std Error | 95% Confidence Limits | DF | Minimum | Maximum | Theta0 | t for H0: Parameter=Theta0 |
Pr > |t| | |
Intercept | 0.527080 | 0.232094 | 0.04656 | 1.007603 | 22.658 | 0.492600 | 0.574832 | 0 | 2.27 | 0.0330 | ||||
Treatment | 1 | 0.094937 | 0.153297 | -0.22225 | 0.412121 | 22.916 | 0.078459 | 0.114233 | 0 | 0.62 | 0.5418 | |||
Insurance | 1 | -0.128475 | 0.144781 | -0.42822 | 0.171268 | 22.671 | -0.157661 | -0.108998 | 0 | -0.89 | 0.3842 | |||
Insurance | 2 | -0.038353 | 0.119353 | -0.28536 | 0.208658 | 22.816 | -0.059264 | -0.024690 | 0 | -0.32 | 0.7509 | |||
Age | 0.001926 | 0.004635 | -0.00767 | 0.011524 | 22.577 | 0.000858 | 0.002539 | 0 | 0.42 | 0.6817 | ||||
Sex | 0 | -0.088823 | 0.088702 | -0.27265 | 0.095000 | 22.279 | -0.101860 | -0.063631 | 0 | -1.00 | 0.3274 | |||
Race | 1 | 0.436564 | 0.156414 | 0.11307 | 0.760060 | 23.092 | 0.418673 | 0.443484 | 0 | 2.79 | 0.0104 | |||
Race | 2 | -0.111091 | 0.195368 | -0.51509 | 0.292904 | 23.16 | -0.118899 | -0.096429 | 0 | -0.57 | 0.5751 |