The SURVEYIMPUTE Procedure

Example 110.1 Approximate Bayesian Bootstrap Imputation

This example illustrates the approximate Bayesian bootstrap hot-deck imputation method by using a simulated data set from a fictitious survey of drug abusers. A stratified clustered sample of drug abuse treatment centers is taken from a list of available treatment centers. The list is first stratified based on geographic locations. From each strata, two or three treatment centers are sampled as the primary sampling units (PSU). Data are collected from individual patients within the selected treatment centers. The survey collects information about the substances that the patients used (such as drugs, alcohol, and marijuana) along with insurance information and treatment information.

The data set contains 736 observation units in 35 PSUs and 10 strata. The sum of the weights is 19,600. Therefore, the survey data represent a population of 19,600 patients from the study area. Some participants did not respond to all questions. The data set contains missing values in many variables.

To impute the missing items, you first need to decide whether to impute within imputation cells. Imputation cells divide the data into groups of similar units such that the recipient units share similar characteristics with the donor units in the same group. For example, it is reasonable to believe that different age groups, races, and income categories might have different responses to the drug abuse survey. You can use these characteristics to create imputation cells. Characteristics for imputation cells might come from the same survey or might come from other sources such as census data or previous surveys. In this example, assume that the imputation cells are available as a variable called ImputationCell in the data set.

The data set DrugAbuse contains the following items:

  • Strata: stratum identification

  • PSU: PSU identification (treatment centers)

  • ObsWeight: observation weight for patients

  • ImputationCell: imputation cell identification

  • Age: age, in years

  • Sex: 1 for female and 2 for male

  • Race: 1 for white, 2 for black, and 3 for others

  • Insurance: 1 if the patient has any insurance, and 2 otherwise

  • Drug: 1 if the patient used any drugs in the past three months, and 2 otherwise

  • Alcohol: 1 if the patient consumed any alcohol in the past month, and 2 otherwise

  • Treatment: 1 if the patient is being treated for the first time, and 2 otherwise

data DrugAbuse;
   input Strata PSU ObsWeight ImputationCell Age Sex Race Insurance 
         Drug Alcohol Treatment;
   datalines;
 1    1     5        1      74  1    1      3       2      2        1    
 1    1     5        1      20  0    3      1       2      2        1    
 1    1     5        3      42  1    2      1       1      2        1    
 1    1     5        3      65  1    3      2       1      2        1    
 1    1     5        2      53  1    1      1       1      1        1    
 1    1     5        3      49  1    1      1       1      2        1    
 1    1     5        3      51  0    2      1       2      1        1    
 1    1     5        2      77  0    3      1       1      1        1    
 1    1     5        2      26  1    1      1       1      2        1    
 1    1     5        3      28  0    3      1       1      1        1    
 1    1     5        1      71  1    1      1       2      2        1    
 1    1     5        2      72  1    1      3       2      2        1    
 1    1     5        3      24  1    1      1       1      1        1    
 1    1     5        2      65  1    1      2       1      1        2    
 1    1     5        3      47  1    1      1       1      1        1    
 1    1     5        2      37  1    1      2       1      2        1    
 1    1     5        2      46  1    1      3       1      1        1    
 1    1     5        2      52  1    1      1       1      2        2    
 1    1     5        3      60  0    3      1       1      2        1    
 1    1     5        1      31  0    1      1       1      2        1    
 1    2     5        1      23  0    3      3       1      1        1    
 1    2     5        2      78  1    1      .       .      .        .    
 1    2     5        2      29  1    1      1       1      1        1    
 1    2     5        2      21  .    .      .       .      .        .    

   ... more lines ...   

10    4  55.5556      1      40  0    3      2       1      1        2    
10    4  55.5556      1      32  1    3      1       2      2        1    
10    4  55.5556      3      68  0    1      2       2      1        2    
10    4  55.5556      3      35  1    1      2       1      2        2    
;

The following statements request that the missing items be imputed by using the approximate Bayesian bootstrap hot-deck imputation method:

proc surveyimpute data=DrugAbuse method=hotdeck(selection=abb)
                  ndonors=5 seed=773269;
   var Sex Race Insurance Drug Alcohol Treatment;
   cells ImputationCell;
   output out=DrugAbuseABB;
run;

The PROC SURVEYIMPUTE statement invokes the procedure, the DATA= option specifies the input data set DrugAbuse, the METHOD= option requests the hot-deck imputation method, the METHOD=HOTDECK (SELECTION=ABB) option requests the approximate Bayesian bootstrap method, the NDONORS= option requests five donor units for every recipient unit, and the SEED= option specifies the random number generator seed. The VAR statement specifies the variables that are to be imputed, the CELLS statement identifies the imputation cell variable ImputationCell, and the OUT= option in the OUTPUT statement names the output data set DrugAbuseABB.

You do not need to use WEIGHTS, STRATA, and CLUSTER statements for the approximate Bayesian bootstrap method unless you want to create the jackknife replication weights by including the VARMETHOD=JACKKNIFE option in the PROC SURVEYIMPUTE statement. The selection of donors does not use the design information. However, if you want to select donors from the same strata or the same group of clusters, then you must include that information in the imputation cell.

Summary information about the imputation method, number of observations, and missing data patterns is shown in Output 110.1.1. The "Imputation Information" table summarizes the imputation method. The "Number of Observations" table shows that PROC SURVEYIMPUTE read and used all 736 observations. The "Missing Data Pattern" table displays the missing patterns in the data set. There are four different missing data pattern groups: all items observed, one item missing, four items missing, and all items missing. Of the observation units, 92.53% have all items observed; 4.64% have missing values in Treatment; 1.77% have missing values in Insurance, Drug, Alcohol, and Treatment; and 1.09% have missing values in all variables. Because the WEIGHT statement is not specified, these percentages represent the percentages of missing units in the input data.

Output 110.1.1: Imputation Summary

The SURVEYIMPUTE Procedure

Imputation Information
Data Set WORK.DRUGABUSE
Imputation Method HOTDECK
Selection Method ABB
Random Number Seed 773269

Number of Observations Read 736
Number of Observations Used 736

Missing Data Patterns
Group Sex Race Insurance Drug Alcohol Treatment Freq Sum of
Weights
Unweighted
Percent
Weighted
Percent
Group Means
Sex Race Insurance Drug Alcohol Treatment
1 X X X X X X 681 681 92.53 92.53 0.566814 1.491924 1.718062 1.292217 1.716593 1.201175
2 X X X X X . 34 34 4.62 4.62 0.500000 1.441176 1.588235 1.294118 1.588235 .
3 X X . . . . 13 13 1.77 1.77 0.692308 1.230769 . . . .
4 . . . . . . 8 8 1.09 1.09 . . . . . .

Imputation Summary
Observation Status Number of
Observations
Sum of
Weights
Nonmissing 681 681
Missing 55 55
Missing, Imputed 55 55
Missing, Not Imputed 0 0



Some selected observations from the output data set are displayed in Output 110.1.2. The output data set DrugAbuseABB contains the unit identification, the recipient index, and all the variables from the input data set DrugAbuse. Units that are complete respondents have one row, but units that are incomplete respondents have five rows in the output data set. For example, unit 21 is a complete respondent, so it has only one row in the output data set and its Recipient value is 0. Unit 22 is an incomplete respondent, so it has five rows in the output data set and its Recipient values range from 1 to 5.

Output 110.1.2: Observations for Some Selected Units

UnitID Recipient Strata PSU ObsWeight ImputationCell Age Sex Race Insurance Drug Alcohol Treatment
20 0 1 1 5 1 31 0 1 1 1 2 1
21 0 1 2 5 1 23 0 3 3 1 1 1
22 1 1 2 5 2 78 1 1 1 1 1 1
22 2 1 2 5 2 78 1 1 1 1 1 1
22 3 1 2 5 2 78 1 1 2 1 2 1
22 4 1 2 5 2 78 1 1 3 1 2 1
22 5 1 2 5 2 78 1 1 1 1 2 2
23 0 1 2 5 2 29 1 1 1 1 1 1
24 1 1 2 5 2 21 1 2 2 1 2 2
24 2 1 2 5 2 21 1 3 1 1 1 1
24 3 1 2 5 2 21 1 2 2 1 2 1
24 4 1 2 5 2 21 1 1 2 1 2 1
24 5 1 2 5 2 21 1 2 2 2 1 1
25 0 1 2 5 3 85 0 1 3 1 2 1



Suppose you want to perform a logistic regression analysis by using the imputed data set. If you want to use the multiple imputation variance estimator that is available in the MIANALYZE procedure with the imputed data set, then you need to create one complete data set for every imputation. The following SAS statements create five complete data sets and then merge the five data sets into one. Each complete data set contains the complete respondents and only one donor unit for the incomplete respondents. Each data set also contains the imputation number (_Imputation_).

data DAIMP;
   set DrugAbuseABB;
   if (Recipient = 0) then do;  /* Include complete respondents */
      do _Imputation_=1 to 5;   /* in all imputations.          */
         output;
      end;
   end;
   else do;                     /* Put incomplete respondents   */
      _Imputation_ = Recipient; /* in separate imputations.     */
      output;
   end;
proc sort data=DAIMP;
   by _Imputation_ UnitID;
run;

The following SAS statements first use the SURVEYLOGISTIC procedure (see Chapter 111: The SURVEYLOGISTIC Procedure) to perform separate logistic regression analyses within the imputed data sets and use the MIANALYZE procedure (Chapter 76: The MIANALYZE Procedure) to combine the logistic regression results from five imputed data sets:

ods select none;
proc surveylogistic data=DAIMP;
   by _imputation_;
   class Treatment Insurance Sex Race;
   strata Strata;
   cluster PSU;
   weight ObsWeight;
   model Drug=Treatment Insurance Age Sex Race / covb;
   ods output parameterestimates=Estimates covb=Covariances;
run;
ods select all;
proc mianalyze parms(classvar=classval)=Estimates
               covb(effectvar=stacking)=Covariances
               edf=25;
   class Treatment Insurance Sex Race;
   modeleffects Intercept Treatment Insurance Age Sex Race;
   ods output parameterestimates=ABBLogisticAnalysis;
run;

Although the survey design information was not directly used in the imputation, you must use the complete design information, including strata, clusters, and weights, to estimate the design variance within each imputed data set. The STRATA, CLUSTER, and WEIGHT statements in PROC SURVEYLOGISTIC specify the design information. However, separate logistic regression results from any single imputed data set should not be used for inference.

Degrees of freedom values for survey data are often much less than the number of observation units. In this example, there are 736 observation units, but there are 35 PSUs in 10 strata. The degrees of freedom for the Taylor series linearized variance estimator is 25 (35 – 10). You should specify the reduced degrees of freedom by using the EDF= option in PROC MIANALYZE. For more information, see the section EDF=<phrase remap="Argument">number</phrase> in Chapter 76: The MIANALYZE Procedure; also see Barnard and Rubin (1999).

The estimated regression parameters and their standard errors from a multiply imputed data set are shown in Output 110.1.3.

Output 110.1.3: Logistic Regression Analysis Using a Multiply Imputed Data Set

The MIANALYZE Procedure

Parameter Estimates (5 Imputations)
Parameter Treatment Insurance Sex Race Estimate Std Error 95% Confidence Limits DF Minimum Maximum Theta0 t for H0:
Parameter=Theta0
Pr > |t|
Intercept         0.527080 0.232094 0.04656 1.007603 22.658 0.492600 0.574832 0 2.27 0.0330
Treatment 1       0.094937 0.153297 -0.22225 0.412121 22.916 0.078459 0.114233 0 0.62 0.5418
Insurance   1     -0.128475 0.144781 -0.42822 0.171268 22.671 -0.157661 -0.108998 0 -0.89 0.3842
Insurance   2     -0.038353 0.119353 -0.28536 0.208658 22.816 -0.059264 -0.024690 0 -0.32 0.7509
Age         0.001926 0.004635 -0.00767 0.011524 22.577 0.000858 0.002539 0 0.42 0.6817
Sex     0   -0.088823 0.088702 -0.27265 0.095000 22.279 -0.101860 -0.063631 0 -1.00 0.3274
Race       1 0.436564 0.156414 0.11307 0.760060 23.092 0.418673 0.443484 0 2.79 0.0104
Race       2 -0.111091 0.195368 -0.51509 0.292904 23.16 -0.118899 -0.096429 0 -0.57 0.5751