SAS/STAT Software

Missing Data Analysis

Most SAS statistical procedures exclude observations with any missing variable values from the analysis. These observations are called incomplete cases. Although analyzing only complete cases has the advantage of simplicity, the information contained in the incomplete cases is lost. This approach also ignores possible systematic differences between the complete cases and the incomplete cases, and the resulting inference might not be applicable to the population of all cases, especially with a small number of complete cases.

Another strategy for handling missing data is multiple imputation, which replaces each missing value with a set of plausible values that represent the uncertainty about the right value to impute. The multiply imputed data sets are then analyzed by using standard procedures for complete data and combining the results from these analyses. No matter which complete-data analysis is used, the process of combining results from different data sets is essentially the same.

The SAS/STAT missing data analysis procedures include the following:

CALIS Procedure

The CALIS procedure fits structural equation models, which express relationships among a system of variables that can be either observed variables (manifest variables) or unobserved hypothetical variables (latent variables). PROC CALIS enables you to do the following:

  • estimate parameters and test hypotheses for constrained and unconstrained problems in the following:
    • multiple and multivariate linear regression
    • linear measurement-error models
    • path analysis and causal modeling
    • simultaneous equation models with reciprocal causation
    • exploratory and confirmatory factor analysis of any order
    • canonical correlation
    • a wide variety of other (non)linear latent variable models
  • estimate parameters by using the following criteria:
    • unweighted least squares
    • generalized least squares
    • weighted least squares
    • diagonally weighted least squares
    • maximum likelihood
    • full information maximum likelihood
    • maximum likelihood with Satorra-Bentler scaled model fit chi-square statistic and sandwich-type standard error estimation
    • robust estimation with maximum likelihood model evaluation
  • specify models using the following modeling languages:
    • FACTOR—supports the input of factor-variable relations
    • LINEQS—uses equations to describe variable relationships
    • LISMOD—utilizes LISREL model matrices for defining models
    • MSTRUCT—supports direct parameterization in the mean and covariance matrices
    • PATH—provides an intuitive causal path specification interface
    • RAM—utilizes the formulation of the reticular action model
    • REFMODEL—provides a quick way for model referencing and respecification
  • choose among the following optimization algorithms:
    • Levenberg-Marquardt algorithm (More, 1978)
    • trust-region algorithm (Gay 1983)
    • Newton-Raphson algorithm with line search
    • ridge-stabilized Newton-Raphson algorithm
    • various quasi-Newton and dual quasi-Newton algorithms: Broyden-Fletcher-Goldfarb-Shanno and Davidon-Fletcher-Powell, including a sequential quadratic programming algorithm for processing nonlinear equality and inequality constraints
    • various conjugate gradient algorithms: automatic restart algorithm of Powell (1977), Fletcher-Reeves, Polak-Ribiere, and conjugate descent algorithm of Fletcher (1980)
  • use the following methods to automatically generate initial values for the optimization process:
    • two-stage least squares estimation
    • instrumental variable factor analysis
    • approximate factor analysis
    • ordinary least squares estimation
    • McDonald's (McDonald and Hartmann 1992) method
  • formulate general equality and inequality constraints by using programming statements
  • specify free unnamed parameters in all models
  • analyze linear dependencies in the information matrix (approximate Hessian matrix) that might be helpful in detecting unidentified models
  • perform multiple-group analysis. Groups can also be fitted by multiple models simultaneously.
  • specify linear and nonlinear equality and inequality constraints on the parameters with several different statements, depending on the type of input
  • compute Lagrange multiplier test indices for simple constant and equality parameter constraints and for active boundary constraints
  • produce a SAS data set that contains information about the optimal parameter estimates (parameter estimates, gradient, Hessian, projected Hessian and Hessian of Lagrange function for constrained optimization, the information matrix, and standard errors)
  • produce a SAS data set that contains residuals and, for exploratory factor analysis, the rotated and unrotated factor loadings
  • perform analysis of multiple samples with equal sample size by analyzing a moment supermatrix that contains the individual moment matrices as block diagonal submatrices
  • perform residual analysis at the case level or observation level
  • perform direct robust estimation based on residual weighting or two-stage robust estimation based on analyzing the robust mean and covariance matrices
  • input the model fit information of the customized baseline model of your choice. PROC CALIS then computes various fit indices (mainly the incremental fit indices) based on your customized model fit rather than the fit of the default uncorrelatedness model.
  • compute weighted covariances or correlations
  • create a SAS data set that corresponds to any output table
  • perform BY group processing, which enables you to obtain separate analyses on grouped observations
  • automatically create the following types of graphs:
    • distribution of residuals in the moment matrices
    • case-level residual diagnostics
    • path diagrams
For further details, see CALIS Procedure

GEE Procedure

The GEE procedure fits generalized linear models for longitudinal data by using the generalized estimating equations (GEE) estimation method of Liang and Zeger (1986). The GEE method fits a marginal model to longitudinal data and is commonly used to analyze longitudinal data when the population-average effect is of interest. The following are highlights of the GEE procedure's features:

  • perform weighted GEE estimation when there are missing data that are missing at random (MAR)
  • supports the following response variable distributions:
    • binomial
    • gamma
    • inverse Gaussian
    • negative binomial
    • normal
    • Poisson
    • multinomial
  • supports the following link functions:
    • complementary log-log
    • identity
    • log
    • logit
    • probit
    • reciprocal
    • power with exponent -2
  • supports the following correlation structures:
    • first order autoregressive
    • exchangeable
    • independent
    • m-dependent
    • unstructured
    • fixed (user specified)
  • performs alternating logistic regression analysis for ordinal and binary data
  • supports ESTIMATE, LSMEANS, and OUTPUT statements
  • creates a SAS data set that corresponds to any output table
  • automatically creates graphs by using ODS Graphics
For further details, see GEE Procedure

MCMC Procedure

The MCMC procedure is a general purpose Markov chain Monte Carlo (MCMC) simulation procedure that is designed to fit a wide range of Bayesian models. PROC MCMC procedure enables you to do the following:

  • specify a likelihood function for the data, prior distributions for the parameters, and hyperprior distributions if you are fitting hierarchical models
  • obtain samples from the corresponding posterior distributions, produces summary and diagnostic statistics, and save the posterior samples in an output data set that can be used for further analysis
  • analyze data that have any likelihood, prior, or hyperprior as long as these functions are programmable using the SAS data step functions
  • enter parameters into a model linearly or in any nonlinear functional form
  • fit dynamic linear models, state space models, autoregressive models, or other models that have a conditionally dependent structure on either the random-effects parameters or the response variable
  • fit models that contain differential equations or models that require integration
  • use an adaptive blocked random-walk Metropolis algorithm that uses a normal or t proposal distribution by default
  • use a Hamiltonian Monte Carlo algorithm with a fixed step size and predetermined number of steps
  • use a No-U-Turn sampler with the Hamiltonian algorithm
  • create a user defined sampler as an alternative to the default algorithms
  • create a data set that contains random samples from the posterior predictive distribution of the response variable
  • perform BY group processing, which enables you to obtain separate analyses on grouped observations
  • take advantage of multiple processors
  • create a SAS data set that corresponds to any output table
  • automatically create graphs by using ODS Graphics
For further details, see MCMC Procedure

MI Procedure

The MI procedure is a multiple imputation procedure that creates multiply imputed data sets for incomplete p-dimensional multivariate data. It uses methods that incorporate appropriate variability across the m imputations. The imputation method of choice depends on the patterns of missingness in the data and the type of the imputed variable. The following are highlights of the MI procedure's features:

  • creates multiple imputed data sets for incomplete multivariate data
  • applies parametric and nonparametric methods for data sets with monotone missing values:
    • continuous variables:
      • regression method
      • predictive mean matching method
      • propensity score method
    • classification variables:
      • logistic regression method
      • discriminant function method
  • enables you to specify a multivariate imputation that uses fully conditional specification (FCS) methods
  • applies a Markov chain Monte Carlo (MCMC) method for data sets with arbitrary missing patterns
  • uses the EM algorithm to compute the MLE for (μ,Σ), the means and covariance matrix, of a multivariate normal distribution from the input data set with missing values
  • provides the following transformation methods for data that do not satisfy the assumption of multivariate normality:
    • Box-Cox
    • exponential
    • logarithmic
    • logit
    • power
  • performs sensitivity analysis by generating multiple imputations for different scenarios under the assumption that the data are missing not at random
  • performs BY group processing, which enables you to obtain separate analyses on grouped observations
  • creates a SAS data set that corresponds to any output table
  • automatically creates graphs by using ODS Graphics

Once the m complete data sets are analyzed using standard SAS procedures, the MIANALYZE procedure can be used to generate valid statistical inferences about these parameters by combining results from the m analyses.

For further details, see MI Procedure


The MIANALYZE procedure combines the results of the analyses of imputations and generates valid statistical inferences.

A companion procedure, PROC MI, creates multiply imputed data sets for incomplete multivariate data. It uses methods that incorporate appropriate variability across the m imputations.

The analyses of imputations are obtained by using standard SAS procedures (such as PROC REG) for complete data. No matter which complete-data analysis is used, the process of combining results from different imputed data sets is essentially the same and results in valid statistical inferences that properly reflect the uncertainty due to missing values. These results of analyses are combined in the MIANALYZE procedure to derive valid inferences.

The MIANALYZE procedure reads parameter estimates and associated standard errors or covariance matrix that are computed by the standard statistical procedure for each imputed data set. The MIANALYZE procedure then derives valid univariate inference for these parameters. With an additional assumption about the population between and within imputation covariance matrices, multivariate inference based on Wald tests can also be derived.

For some parameters of interest, you can use TEST statements to test linear hypotheses about the parameters. For others, it is not straightforward to compute estimates and associated covariance matrices with standard statistical SAS procedures, and thus, require special techniques. Examples include correlation coefficients between two variables and ratios of variable means.

For further details, see MIANALYZE Procedure


The SURVEYIMPUTE procedure imputes missing values of an item in a data set by replacing them with observed values from the same item. The principles by which the imputation is performed are particularly useful for survey data. The following are highlights of the SURVEYIMPUTE procedure's features:

  • fully efficient fractional hot-deck imputation
  • traditional hot-deck imputation with the following donor selection methods
    • approximate Bayesian bootstrap
    • simple random samples without replacement
    • simple random samples with replacement
    • probability proportional to respondent weights with replacement
  • computes imputation-adjusted replicate weights
  • computes imputation-adjusted balanced repeated replication (BRR) weights
  • computes imputation-adjusted jackknife weights
  • provides a CELLS statement which names the variables that identify the imputation cells
  • imputes variables jointly or independently for the fully efficient fractional imputation method
  • creates a SAS data set that contains the imputed data
For further details, see SURVEYIMPUTE Procedure