Enhancements in SAS/STAT® 14.1 Software


SAS/STAT 14.1 introduces two new procedures and adds new features to many existing analyses. This release is available with the third maintenance release for Base SAS® 9.4.

Missing Survey Data: Imputations

Nonresponse is a common problem in surveys. The resulting estimators suffer from nonresponse bias if the nonrespondents are different from the respondents. Estimators that use complete cases (only the observed units) might also be less precise. Imputation can reduce nonresponse bias, and by producing an imputed data set, result in consistent analyses.

The SURVEYIMPUTE procedure imputes missing values of an item in a sample survey by replacing them with observed values from the same item. Imputation methods include single and multiple hot-deck imputation and fully efficient fractional imputation (FEFI). Donor selection techniques include simple random selection with or without replacement, probability proportional to weights selection, and approximate Bayesian bootstrap selection. When you use FEFI, PROC SURVEYIMPUTE produces replicate weights that appropriately account for the imputation. You can use these replicate weights in any survey analysis procedure to correctly estimate the variance of an estimator that uses the imputed data.

Big Data Modeling

The new GAMPL procedure is a high-performance procedure that fits generalized additive models by penalized likelihood estimation. Based on low-rank regression splines, these models are powerful tools for nonparametric regression and smoothing. Generalized additive models are extensions of generalized linear models. They relax the linearity assumption in generalized linear models by allowing spline terms in order to characterize nonlinear dependency structures. Each spline term is constructed by the thin-plate regression spline technique. A roughness penalty is applied to each spline term by a smoothing parameter that controls the balance between goodness of fit and the roughness of the spline curve. PROC GAMPL fits models for standard distributions in the exponential family, such as normal, Poisson, and gamma distributions.

Classification and Regression Trees

Classification and regression trees construct predictive models; classification trees predict a categorical response while regression trees predict a continuous response. Tree models partition the data into segments called nodes by applying splitting rules, which assign an observation to a node based on the value of one of the predictors. The partitioning is done recursively, starting with the root node that contains all the data, continuing down to the terminal nodes, which are called leaves. The resulting tree model typically fits the training data well, but might not necessarily fit new data well. To prevent overfitting, a pruning method can be applied to find a smaller subtree that balances the goals of fitting both the training data and new data. The subtree that best accomplishes this is determined by using validation data or cross validation. The partitioning can be represented graphically with a decision tree, which provides an accessible interpretation of the resulting model.

The HPSPLIT procedure creates a classification or regression tree model. It is a high-performance procedure, which means that it can be run in distributed mode with a SAS High-Performance Statistics product license. Otherwise, it runs in single-machine mode. It provides choices of algorithms for both classification and regression tree growth and pruning, a variety of options for handling missing values, whole and partial tree plots, cross validation plots, ROC curves, and partial tree plots. It also produces an output data set with node and leaf assignments, predicted levels and posterior probabilities for a classification tree, and predicted response values for a regression tree.


One of the many exciting performance improvements in the 14.1 release is the new FASTQUAD option in the GLIMMIX procedure, which enables you to fit multilevel models that have been computationally infeasible in the past.

In many applications, data are observed on nested units. The marginal distribution of a subject’s data in such a model is represented by a multidimensional integral. The FASTQUAD option in PROC GLIMMIX implements the multilevel quadrature algorithm of Pinheiro and Chao (2006). This algorithm reduces the single integral over many dimensions to a sum of integrals, each with fewer dimensions. With this option, you can now apply GLIMMIX to multilevel models that would have been too large or too slow to fit previously, and often do so in just seconds

Bayesian Analysis

The MCMC procedure has been updated with new sampling algorithms for continuous parameters: the Hamiltonian Monte Carlo (HMC) and the No-U-Turn Sampler (NUTS). These algorithms use Hamiltonian dynamics to enable distant proposal in the sampling, making them efficient in many scenarios. These algorithms can lead to drastic improvements in sampling efficiency in many cases, resulting in fewer needed draws to achieve the same accuracy.

PROC MCMC now supports models that require lagging and leading variables, enabling you to easily fit models such as dynamic linear models, state space models, and autoregressive models. An ordinary differential equation solver and a general integration function have also been added, which enable the procedure to fit models that contain differential equations (for example, pharmacokinetic models) or models that require integration (for example, marginal likelihood models). And last but not least, the PREDDIST statement in PROC MCMC now supports prediction from a marginalized random-effects model, which enables more realistic and useful prediction from many models.

Other Enhancements

For More Information

SAS/STAT 14.1 is now available. For complete information about all SAS/STAT releases, see the documentation at

Download a pdf version of this document