The SURVEYIMPUTE Procedure

Example 110.1 Approximate Bayesian Bootstrap Imputation

This example illustrates the approximate Bayesian bootstrap hot-deck imputation method by using a simulated data set from a fictitious survey of drug abusers. A stratified clustered sample of drug abuse treatment centers is taken from a list of available treatment centers. The list is first stratified based on geographic locations. From each strata, two or three treatment centers are sampled as the primary sampling units (PSU). Data are collected from individual patients within the selected treatment centers. The survey collects information about the substances that the patients used (such as drugs, alcohol, and marijuana) along with insurance information and treatment information.

The data set contains 736 observation units in 35 PSUs and 10 strata. The sum of the weights is 19,600. Therefore, the survey data represent a population of 19,600 patients from the study area. Some participants did not respond to all questions. The data set contains missing values in many variables.

To impute the missing items, you first need to decide whether to impute within imputation cells. Imputation cells divide the data into groups of similar units such that the recipient units share similar characteristics with the donor units in the same group. For example, it is reasonable to believe that different age groups, races, and income categories might have different responses to the drug abuse survey. You can use these characteristics to create imputation cells. Characteristics for imputation cells might come from the same survey or might come from other sources such as census data or previous surveys. In this example, assume that the imputation cells are available as a variable called ImputationCell in the data set.

The data set DrugAbuse contains the following items:

Strata: stratum identification
PSU: PSU identification (treatment centers)
ObsWeight: observation weight for patients
ImputationCell: imputation cell identification
Age: age, in years
Sex: 1 for female and 2 for male
Race: 1 for white, 2 for black, and 3 for others
Insurance: 1 if the patient has any insurance, and 2 otherwise
Drug: 1 if the patient used any drugs in the past three months, and 2 otherwise
Alcohol: 1 if the patient consumed any alcohol in the past month, and 2 otherwise
Treatment: 1 if the patient is being treated for the first time, and 2 otherwise

data DrugAbuse;
   input Strata PSU ObsWeight ImputationCell Age Sex Race Insurance 
         Drug Alcohol Treatment;
   datalines;
 1    1     5        1      74  1    1      3       2      2        1    
 1    1     5        1      20  0    3      1       2      2        1    
 1    1     5        3      42  1    2      1       1      2        1    
 1    1     5        3      65  1    3      2       1      2        1    
 1    1     5        2      53  1    1      1       1      1        1    
 1    1     5        3      49  1    1      1       1      2        1    
 1    1     5        3      51  0    2      1       2      1        1    
 1    1     5        2      77  0    3      1       1      1        1    
 1    1     5        2      26  1    1      1       1      2        1    
 1    1     5        3      28  0    3      1       1      1        1    
 1    1     5        1      71  1    1      1       2      2        1    
 1    1     5        2      72  1    1      3       2      2        1    
 1    1     5        3      24  1    1      1       1      1        1    
 1    1     5        2      65  1    1      2       1      1        2    
 1    1     5        3      47  1    1      1       1      1        1    
 1    1     5        2      37  1    1      2       1      2        1    
 1    1     5        2      46  1    1      3       1      1        1    
 1    1     5        2      52  1    1      1       1      2        2    
 1    1     5        3      60  0    3      1       1      2        1    
 1    1     5        1      31  0    1      1       1      2        1    
 1    2     5        1      23  0    3      3       1      1        1    
 1    2     5        2      78  1    1      .       .      .        .    
 1    2     5        2      29  1    1      1       1      1        1    
 1    2     5        2      21  .    .      .       .      .        .    

   ... more lines ...   

10    4  55.5556      1      40  0    3      2       1      1        2    
10    4  55.5556      1      32  1    3      1       2      2        1    
10    4  55.5556      3      68  0    1      2       2      1        2    
10    4  55.5556      3      35  1    1      2       1      2        2    
;

The following statements request that the missing items be imputed by using the approximate Bayesian bootstrap hot-deck imputation method:

proc surveyimpute data=DrugAbuse method=hotdeck(selection=abb)
                  ndonors=5 seed=773269;
   var Sex Race Insurance Drug Alcohol Treatment;
   cells ImputationCell;
   output out=DrugAbuseABB;
run;

The PROC SURVEYIMPUTE statement invokes the procedure, the DATA= option specifies the input data set DrugAbuse, the METHOD= option requests the hot-deck imputation method, the METHOD=HOTDECK (SELECTION=ABB) option requests the approximate Bayesian bootstrap method, the NDONORS= option requests five donor units for every recipient unit, and the SEED= option specifies the random number generator seed. The VAR statement specifies the variables that are to be imputed, the CELLS statement identifies the imputation cell variable ImputationCell, and the OUT= option in the OUTPUT statement names the output data set DrugAbuseABB.

You do not need to use WEIGHTS, STRATA, and CLUSTER statements for the approximate Bayesian bootstrap method unless you want to create the jackknife replication weights by including the VARMETHOD=JACKKNIFE option in the PROC SURVEYIMPUTE statement. The selection of donors does not use the design information. However, if you want to select donors from the same strata or the same group of clusters, then you must include that information in the imputation cell.

Summary information about the imputation method, number of observations, and missing data patterns is shown in Output 110.1.1. The "Imputation Information" table summarizes the imputation method. The "Number of Observations" table shows that PROC SURVEYIMPUTE read and used all 736 observations. The "Missing Data Pattern" table displays the missing patterns in the data set. There are four different missing data pattern groups: all items observed, one item missing, four items missing, and all items missing. Of the observation units, 92.53% have all items observed; 4.64% have missing values in Treatment; 1.77% have missing values in Insurance, Drug, Alcohol, and Treatment; and 1.09% have missing values in all variables. Because the WEIGHT statement is not specified, these percentages represent the percentages of missing units in the input data.

Output 110.1.1: Imputation Summary

The SURVEYIMPUTE Procedure

Imputation Information
Data Set	WORK.DRUGABUSE
Imputation Method	HOTDECK
Selection Method	ABB
Random Number Seed	773269

Number of Observations Read	736
Number of Observations Used	736

Missing Data Patterns
Group	Sex	Race	Insurance	Drug	Alcohol	Treatment	Freq	Sum of Weights	Unweighted Percent	Weighted Percent	Group Means
Group	Sex	Race	Insurance	Drug	Alcohol	Treatment	Freq	Sum of Weights	Unweighted Percent	Weighted Percent	Sex	Race	Insurance	Drug	Alcohol	Treatment
1	X	X	X	X	X	X	681	681	92.53	92.53	0.566814	1.491924	1.718062	1.292217	1.716593	1.201175
2	X	X	X	X	X	.	34	34	4.62	4.62	0.500000	1.441176	1.588235	1.294118	1.588235	.
3	X	X	.	.	.	.	13	13	1.77	1.77	0.692308	1.230769	.	.	.	.
4	.	.	.	.	.	.	8	8	1.09	1.09	.	.	.	.	.	.

Imputation Summary
Observation Status	Number of Observations	Sum of Weights
Nonmissing	681	681
Missing	55	55
Missing, Imputed	55	55
Missing, Not Imputed	0	0

Some selected observations from the output data set are displayed in Output 110.1.2. The output data set DrugAbuseABB contains the unit identification, the recipient index, and all the variables from the input data set DrugAbuse. Units that are complete respondents have one row, but units that are incomplete respondents have five rows in the output data set. For example, unit 21 is a complete respondent, so it has only one row in the output data set and its Recipient value is 0. Unit 22 is an incomplete respondent, so it has five rows in the output data set and its Recipient values range from 1 to 5.

Output 110.1.2: Observations for Some Selected Units

UnitID	Recipient	Strata	PSU	ObsWeight	ImputationCell	Age	Sex	Race	Insurance	Drug	Alcohol	Treatment
20	0	1	1	5	1	31	0	1	1	1	2	1
21	0	1	2	5	1	23	0	3	3	1	1	1
22	1	1	2	5	2	78	1	1	1	1	1	1
22	2	1	2	5	2	78	1	1	1	1	1	1
22	3	1	2	5	2	78	1	1	2	1	2	1
22	4	1	2	5	2	78	1	1	3	1	2	1
22	5	1	2	5	2	78	1	1	1	1	2	2
23	0	1	2	5	2	29	1	1	1	1	1	1
24	1	1	2	5	2	21	1	2	2	1	2	2
24	2	1	2	5	2	21	1	3	1	1	1	1
24	3	1	2	5	2	21	1	2	2	1	2	1
24	4	1	2	5	2	21	1	1	2	1	2	1
24	5	1	2	5	2	21	1	2	2	2	1	1
25	0	1	2	5	3	85	0	1	3	1	2	1

Suppose you want to perform a logistic regression analysis by using the imputed data set. If you want to use the multiple imputation variance estimator that is available in the MIANALYZE procedure with the imputed data set, then you need to create one complete data set for every imputation. The following SAS statements create five complete data sets and then merge the five data sets into one. Each complete data set contains the complete respondents and only one donor unit for the incomplete respondents. Each data set also contains the imputation number (_Imputation_).

data DAIMP;
   set DrugAbuseABB;
   if (Recipient = 0) then do;  /* Include complete respondents */
      do _Imputation_=1 to 5;   /* in all imputations.          */
         output;
      end;
   end;
   else do;                     /* Put incomplete respondents   */
      _Imputation_ = Recipient; /* in separate imputations.     */
      output;
   end;
proc sort data=DAIMP;
   by _Imputation_ UnitID;
run;

The following SAS statements first use the SURVEYLOGISTIC procedure (see Chapter 111: The SURVEYLOGISTIC Procedure) to perform separate logistic regression analyses within the imputed data sets and use the MIANALYZE procedure (Chapter 76: The MIANALYZE Procedure) to combine the logistic regression results from five imputed data sets:

ods select none;
proc surveylogistic data=DAIMP;
   by _imputation_;
   class Treatment Insurance Sex Race;
   strata Strata;
   cluster PSU;
   weight ObsWeight;
   model Drug=Treatment Insurance Age Sex Race / covb;
   ods output parameterestimates=Estimates covb=Covariances;
run;
ods select all;

proc mianalyze parms(classvar=classval)=Estimates
               covb(effectvar=stacking)=Covariances
               edf=25;
   class Treatment Insurance Sex Race;
   modeleffects Intercept Treatment Insurance Age Sex Race;
   ods output parameterestimates=ABBLogisticAnalysis;
run;

Although the survey design information was not directly used in the imputation, you must use the complete design information, including strata, clusters, and weights, to estimate the design variance within each imputed data set. The STRATA, CLUSTER, and WEIGHT statements in PROC SURVEYLOGISTIC specify the design information. However, separate logistic regression results from any single imputed data set should not be used for inference.

Degrees of freedom values for survey data are often much less than the number of observation units. In this example, there are 736 observation units, but there are 35 PSUs in 10 strata. The degrees of freedom for the Taylor series linearized variance estimator is 25 (35 – 10). You should specify the reduced degrees of freedom by using the EDF= option in PROC MIANALYZE. For more information, see the section EDF=<phrase remap="Argument">number</phrase> in Chapter 76: The MIANALYZE Procedure; also see Barnard and Rubin (1999).

The estimated regression parameters and their standard errors from a multiply imputed data set are shown in Output 110.1.3.

Output 110.1.3: Logistic Regression Analysis Using a Multiply Imputed Data Set

The MIANALYZE Procedure

Parameter Estimates (5 Imputations)
Parameter	Treatment	Insurance	Sex	Race	Estimate	Std Error	95% Confidence Limits		DF	Minimum	Maximum	Theta0	t for H0: Parameter=Theta0	Pr > \|t\|
Intercept					0.527080	0.232094	0.04656	1.007603	22.658	0.492600	0.574832	0	2.27	0.0330
Treatment	1				0.094937	0.153297	-0.22225	0.412121	22.916	0.078459	0.114233	0	0.62	0.5418
Insurance		1			-0.128475	0.144781	-0.42822	0.171268	22.671	-0.157661	-0.108998	0	-0.89	0.3842
Insurance		2			-0.038353	0.119353	-0.28536	0.208658	22.816	-0.059264	-0.024690	0	-0.32	0.7509
Age					0.001926	0.004635	-0.00767	0.011524	22.577	0.000858	0.002539	0	0.42	0.6817
Sex			0		-0.088823	0.088702	-0.27265	0.095000	22.279	-0.101860	-0.063631	0	-1.00	0.3274
Race				1	0.436564	0.156414	0.11307	0.760060	23.092	0.418673	0.443484	0	2.79	0.0104
Race				2	-0.111091	0.195368	-0.51509	0.292904	23.16	-0.118899	-0.096429	0	-0.57	0.5751

UnitID	Recipient	Strata	PSU	ObsWeight	ImputationCell	Age	Sex	Race	Insurance	Drug	Alcohol	Treatment
20	0	1	1	5	1	31	0	1	1	1	2	1
21	0	1	2	5	1	23	0	3	3	1	1	1
22	1	1	2	5	2	78	1	1	1	1	1	1
22	2	1	2	5	2	78	1	1	1	1	1	1
22	3	1	2	5	2	78	1	1	2	1	2	1
22	4	1	2	5	2	78	1	1	3	1	2	1
22	5	1	2	5	2	78	1	1	1	1	2	2
23	0	1	2	5	2	29	1	1	1	1	1	1
24	1	1	2	5	2	21	1	2	2	1	2	2
24	2	1	2	5	2	21	1	3	1	1	1	1
24	3	1	2	5	2	21	1	2	2	1	2	1
24	4	1	2	5	2	21	1	1	2	1	2	1
24	5	1	2	5	2	21	1	2	2	2	1	1
25	0	1	2	5	3	85	0	1	3	1	2	1

UnitID	Recipient	Strata	PSU	ObsWeight	ImputationCell	Age	Sex	Race	Insurance	Drug	Alcohol	Treatment
20	0	1	1	5	1	31	0	1	1	1	2	1
21	0	1	2	5	1	23	0	3	3	1	1	1
22	1	1	2	5	2	78	1	1	1	1	1	1
22	2	1	2	5	2	78	1	1	1	1	1	1
22	3	1	2	5	2	78	1	1	2	1	2	1
22	4	1	2	5	2	78	1	1	3	1	2	1
22	5	1	2	5	2	78	1	1	1	1	2	2
23	0	1	2	5	2	29	1	1	1	1	1	1
24	1	1	2	5	2	21	1	2	2	1	2	2
24	2	1	2	5	2	21	1	3	1	1	1	1
24	3	1	2	5	2	21	1	2	2	1	2	1
24	4	1	2	5	2	21	1	1	2	1	2	1
24	5	1	2	5	2	21	1	2	2	2	1	1
25	0	1	2	5	3	85	0	1	3	1	2	1

UnitID	Recipient	Strata	PSU	ObsWeight	ImputationCell	Age	Sex	Race	Insurance	Drug	Alcohol	Treatment
20	0	1	1	5	1	31	0	1	1	1	2	1
21	0	1	2	5	1	23	0	3	3	1	1	1
22	1	1	2	5	2	78	1	1	1	1	1	1
22	2	1	2	5	2	78	1	1	1	1	1	1
22	3	1	2	5	2	78	1	1	2	1	2	1
22	4	1	2	5	2	78	1	1	3	1	2	1
22	5	1	2	5	2	78	1	1	1	1	2	2
23	0	1	2	5	2	29	1	1	1	1	1	1
24	1	1	2	5	2	21	1	2	2	1	2	2
24	2	1	2	5	2	21	1	3	1	1	1	1
24	3	1	2	5	2	21	1	2	2	1	2	1
24	4	1	2	5	2	21	1	1	2	1	2	1
24	5	1	2	5	2	21	1	2	2	2	1	1
25	0	1	2	5	3	85	0	1	3	1	2	1