FCS Methods for Data Sets with Arbitrary Missing Patterns |
For a data set with an arbitrary missing data pattern, you can use FCS methods to impute missing values for all variables, assuming the existence of a joint distribution for these variables (Brand 1999; van Buuren 2007). FCS method involves two phases in each imputation: the preliminary filled-in phase followed by the imputation phase.
At the filled-in phase, the missing values for all variables are filled in sequentially over the variables taken one at a time. The missing values for each variable are filled in using the specified method, or the default method for the variable if a method is not specified, with preceding variables serving as the covariates. These filled-in values provide starting values for these missing values at the imputation phase.
At the imputation phase, the missing values for each variable are imputed using the specified method and covariates at each iteration. The default method for the variable is used if a method is not specified, and the remaining variables are used as covariates if the set of covariates is not specified. After a number of iterations, as specified with the NBITER= option, the imputed values in each variable are used for the imputation. At each iteration, the missing values are imputed sequentially over the variables taken one at a time.
You can use the ORDER= option to specify the ordering of variables in the filled-in and imputation phases. The ORDER=VAR option orders the variables as specified in the VAR statement, and the default ORDER=FREQ option orders the variables by the descending frequency counts of the variables. For example, with variables in the VAR statement, the variables , , ..., (in that order) are used in the filled-in and imputation phases, where , , ..., are either the variables listed in the VAR statement (in that order) if the ORDER=VAR option is used, or the variables sorted by the descending frequency counts of the variables if the ORDER=FREQ option is used.
The filled-in phase replaces missing values with filled-in values for each variable. That is, with variables , , ..., (in that order), the missing values are filled in by using the sequence,
where is the set of observed values, is the set of filled-in values, is the set of both observed and filled-in values, and is the set of simulated parameters for the conditional distribution of given variables , , ..., .
For each variable with missing values, the corresponding imputation method is used to fit the model with covariates . The observed observations for , which might include observations with filled-in values for , are used in the model fitting. With this resulting model, a new model is drawn and then used to impute missing values for .
The imputation phase replaces these filled-in values with imputed values for each variable sequentially at each iteration. That is, with variables , , ..., (in that order), the missing values are imputed with the sequence at iteration ,
where is the set of observed values, is the set of imputed values at iteration , is the set of filled-in values () or the set of imputed values at iteration (), is the set of both observed and imputed values at iteration , and is the set of simulated parameters for the conditional distribution of given covariates constructed from , ..., , , …, .
At each iteration, a specified model is fitted for each variable with missing values by using observed observations for that variable, which might include observations with imputed values for other variables. With this resulting model, a new model is drawn and then used to impute missing values for the imputed variable.
The steps are iterated long enough for the results to reliably simulate an approximately independent draw of the missing values for an imputed data set.
The imputation methods used in the filled-in and imputation phases are similar to the corresponding monotone methods for monotone missing data. You can use a regression method or a predictive mean matching method to impute missing values for a continuous variable, a logistic regression method for a classification variable with a binary or ordinal response, and a discriminant function method for a classification variable with a binary or nominal response. See the sections Monotone and FCS Regression Methods, Monotone and FCS Predictive Mean Matching Methods, Monotone and FCS Discriminant Function Methods, and Monotone and FCS Logistic Regression Methods for these methods.
The FCS method requires fewer iterations than the MCMC method (van Buuren and Oudshoorn 1999). Often, as few as five or 10 iterations are enough to produce satisfactory results (van Buuren and Oudshoorn 1999, Brand 1999).