The MI Procedure

Getting Started: MI Procedure

The Fitness data described in the REG procedure are measurements of 31 individuals in a physical fitness course. See Chapter 97: The REG Procedure, for more information.

The Fitness1 data set is constructed from the Fitness data set and contains three variables: Oxygen, RunTime, and RunPulse. Some values have been set to missing, and the resulting data set has an arbitrary pattern of missingness in these three variables.

*---------------------Data on Physical Fitness-------------------------*
| These measurements were made on men involved in a physical fitness   |
| course at N.C. State University. Certain values have been set to     |
| missing and the resulting data set has an arbitrary missing pattern. |
| Only selected variables of                                           |
| Oxygen (intake rate, ml per kg body weight per minute),              |
| Runtime (time to run 1.5 miles in minutes),                          |
| RunPulse (heart rate while running) are used.                        |
*----------------------------------------------------------------------*;
data Fitness1;
   input Oxygen RunTime RunPulse @@;
   datalines;
44.609  11.37  178     45.313  10.07  185
54.297   8.65  156     59.571    .      .
49.874   9.22    .     44.811  11.63  176
  .     11.95  176          .  10.85    .
39.442  13.08  174     60.055   8.63  170
50.541    .      .     37.388  14.03  186
44.754  11.12  176     47.273    .      .
51.855  10.33  166     49.156   8.95  180
40.836  10.95  168     46.672  10.00    .
46.774  10.25    .     50.388  10.08  168
39.407  12.63  174     46.080  11.17  156
45.441   9.63  164       .      8.92    .
45.118  11.08    .     39.203  12.88  168
45.790  10.47  186     50.545   9.93  148
48.673   9.40  186     47.920  11.50  170
47.467  10.50  170
;

Suppose that the data are multivariate normally distributed and the missing data are missing at random (MAR). That is, the probability that an observation is missing can depend on the observed variable values of the individual, but not on the missing variable values of the individual. See the section Statistical Assumptions for Multiple Imputation for a detailed description of the MAR assumption.

The following statements invoke the MI procedure and impute missing values for the Fitness1 data set:

proc mi data=Fitness1 seed=501213 mu0=50 10 180 out=outmi;
   mcmc;
   var Oxygen RunTime RunPulse;
run;

The "Model Information" table in Figure 75.1 describes the method used in the multiple imputation process. By default, the MCMC statement uses the Markov chain Monte Carlo (MCMC) method with a single chain to create 25 imputations. The posterior mode, the highest observed-data posterior density, with a noninformative prior, is computed from the expectation-maximization (EM) algorithm and is used as the starting value for the chain.

Figure 75.1: Model Information

The MI Procedure

Model Information
Data Set	WORK.FITNESS1
Method	MCMC
Multiple Imputation Chain	Single Chain
Initial Estimates for MCMC	EM Posterior Mode
Start	Starting Value
Prior	Jeffreys
Number of Imputations	25
Number of Burn-in Iterations	200
Number of Iterations	100
Seed for random number generator	501213

The MI procedure takes 200 burn-in iterations before the first imputation and 100 iterations between imputations. In a Markov chain, the information in the current iteration influences the state of the next iteration. The burn-in iterations are iterations in the beginning of each chain that are used both to eliminate the series of dependence on the starting value of the chain and to achieve the stationary distribution. The between-imputation iterations in a single chain are used to eliminate the series of dependence between the two imputations.

The "Missing Data Patterns" table in Figure 75.2 lists distinct missing data patterns with their corresponding frequencies and percentages. An "X" means that the variable is observed in the corresponding group, and a "." means that the variable is missing. The table also displays group-specific variable means. The MI procedure sorts the data into groups based on whether the analysis variables are observed or missing. For a detailed description of missing data patterns, see the section Missing Data Patterns.

Figure 75.2: Missing Data Patterns

Missing Data Patterns
Group	Oxygen	RunTime	RunPulse	Freq	Percent	Group Means
Group	Oxygen	RunTime	RunPulse	Freq	Percent	Oxygen	RunTime	RunPulse
1	X	X	X	21	67.74	46.353810	10.809524	171.666667
2	X	X	.	4	12.90	47.109500	10.137500	.
3	X	.	.	3	9.68	52.461667	.	.
4	.	X	X	1	3.23	.	11.950000	176.000000
5	.	X	.	2	6.45	.	9.885000	.

After the completion of m imputations, the "Variance Information" table in Figure 75.3 displays the between-imputation variance, within-imputation variance, and total variance for combining complete-data inferences. It also displays the degrees of freedom for the total variance. The relative increase in variance due to missing values, the fraction of missing information, and the relative efficiency (in units of variance) for each variable are also displayed. A detailed description of these statistics is provided in the section Combining Inferences from Multiply Imputed Data Sets.

Figure 75.3: Variance Information

Variance Information (25 Imputations)
Variable	Variance			DF	Relative Increase in Variance	Fraction Missing Information	Relative Efficiency
Variable	Between	Within	Total	DF	Relative Increase in Variance	Fraction Missing Information	Relative Efficiency
Oxygen	0.037126	0.936472	0.975084	27.018	0.041231	0.039724	0.998414
RunTime	0.001317	0.065716	0.067086	27.593	0.020843	0.020451	0.999183
RunPulse	1.386290	3.394043	4.835784	18.43	0.424786	0.303282	0.988014

The "Parameter Estimates" table in Figure 75.4 displays the estimated mean and standard error of the mean for each variable. The inferences are based on the t distribution. The table also displays a 95% confidence interval for the mean and a t statistic with the associated p-value for the hypothesis that the population mean is equal to the value specified with the MU0= option. A detailed description of these statistics is provided in the section Combining Inferences from Multiply Imputed Data Sets.

Figure 75.4: Parameter Estimates

Parameter Estimates (25 Imputations)
Variable	Mean	Std Error	95% Confidence Limits		DF	Minimum	Maximum	Mu0	t for H0: Mean=Mu0	Pr > \|t\|
Oxygen	47.100050	0.987463	45.0740	49.1261	27.018	46.774347	47.434726	50.000000	-2.94	0.0067
RunTime	10.564553	0.259010	10.0336	11.0955	27.593	10.472584	10.636629	10.000000	2.18	0.0380
RunPulse	171.490381	2.199042	166.8781	176.1027	18.43	169.175377	173.421951	180.000000	-3.87	0.0011

In addition to the output tables, the procedure also creates a data set with imputed values. The imputed data sets are stored in the Outmi data set, with the index variable _Imputation_ indicating the imputation numbers. The data set can now be analyzed using standard statistical procedures with _Imputation_ as a BY variable.

The following statements list the first 10 observations of data set Outmi:

proc print data=outmi (obs=10);
   title 'First 10 Observations of the Imputed Data Set';
run;

The table in Figure 75.5 shows that the precision of the imputed values differs from the precision of the observed values. You can use the ROUND= option to make the imputed values consistent with the observed values.

Figure 75.5: Imputed Data Set

First 10 Observations of the Imputed Data Set

Obs	_Imputation_	Oxygen	RunTime	RunPulse
1	1	44.6090	11.3700	178.000
2	1	45.3130	10.0700	185.000
3	1	54.2970	8.6500	156.000
4	1	59.5710	8.0747	155.925
5	1	49.8740	9.2200	176.837
6	1	44.8110	11.6300	176.000
7	1	42.8857	11.9500	176.000
8	1	46.9992	10.8500	173.099
9	1	39.4420	13.0800	174.000
10	1	60.0550	8.6300	170.000