The MI Procedure

Statistical Assumptions for Multiple Imputation

The MI procedure assumes that the data are from a continuous multivariate distribution and contain missing values that can occur for any of the variables. It also assumes that the data are from a multivariate normal distribution when either the regression method or the MCMC method is used.

Suppose $\mb {Y}$ is the $n{\times }p$ matrix of complete data, which is not fully observed, and denote the observed part of $\mb {Y}$ by $\mb {Y}_{\mathit{obs}}$ and the missing part by $\mb {Y}_{\mathit{mis}}$ . The MI and MIANALYZE procedures assume that the missing data are missing at random (MAR); that is, the probability that an observation is missing can depend on $\mb {Y}_{\mathit{obs}}$ , but not on $\mb {Y}_{\mathit{mis}}$ (Rubin, 1976, 1987, p. 53).

To be more precise, suppose that $\mb {R}$ is the $n{\times }p$ matrix of response indicators whose elements are zero or one depending on whether the corresponding elements of Y are missing or observed. Then the MAR assumption is that the distribution of $\mb {R}$ can depend on $Y_{\mathit{obs}}$ but not on $Y_{\mathit{mis}}$ :

$\mr {pr}( \mb {R} | Y_{\mathit{obs}}, Y_{\mathit{mis}} ) = \mr {pr}( \mb {R} | Y_{\mathit{obs}} )$

For example, consider a trivariate data set with variables $Y_{1}$ and $Y_{2}$ fully observed, and a variable $Y_{3}$ that has missing values. MAR assumes that the probability that $Y_{3}$ is missing for an individual can be related to the individual’s values of variables $Y_{1}$ and $Y_{2}$ , but not to its value of $Y_{3}$ . On the other hand, if a complete case and an incomplete case for $Y_{3}$ with exactly the same values for variables $Y_{1}$ and $Y_{2}$ have systematically different values, then there exists a response bias for $Y_{3}$ , and MAR is violated.

The MAR assumption is not the same as missing completely at random (MCAR), which is a special case of MAR. Under the MCAR assumption, the missing data values are a simple random sample of all data values; the missingness does not depend on the values of any variables in the data set.

Although the MAR assumption cannot be verified with the data and it can be questionable in some situations, the assumption becomes more plausible as more variables are included in the imputation model (Schafer 1997, pp. 27–28; van Buuren, Boshuizen, and Knook 1999, p. 687).

Furthermore, the MI and MIANALYZE procedures assume that the parameters $\btheta$ of the data model and the parameters $\bphi$ of the model for the missing-data indicators are distinct. That is, knowing the values of $\btheta$ does not provide any additional information about $\bphi$ , and vice versa. If both the MAR and distinctness assumptions are satisfied, the missing-data mechanism is said to be ignorable Rubin 1987, pp. 50–54; Schafer 1997, pp. 10–11).