FOCUS AREAS

SAS/STAT Topics

SAS/STAT Software

Predictive Modeling

The term "predictive modeling" refers to the practice of fitting models primarily for the purpose of predicting out-of-sample outcomes rather than for performing statistical inference. There are procedures included in this category that are capable of fitting a wide variety of models, including the following:

The SAS/STAT predictive modeling procedures include the following:

ADAPTIVEREG Procedure


The ADAPTIVEREG procedure fits multivariate adaptive regression splines. The method is a nonparametric regression technique that combines both regression splines and model selection methods. It does not assume parametric model forms and does not require specification of knot values for constructing regression spline terms. Instead, it constructs spline basis functions in an adaptive way by automatically selecting appropriate knot values for different variables and obtains reduced models by applying model selection techniques. The procedure enables you to do the following:

  • specify classification variables with ordering options
  • partition your data into training, validation, and testing roles
  • specify the distribution family used in the model
  • specify the link function in the model
  • specify an offset
  • specify the maximum number of basis functions that can be used in the final model
  • specify the maximum interaction levels for effects that could potentially enter the model
  • specify the incremental penalty for increasing the number of variables in the model
  • specify the effects to be included in the final model
  • request an additive model for which only main effects are included in the fitted model
  • specify the parameter that controls the number of knots considered for each variable
  • force effects in the final model or restrict variables in linear forms
  • specify options for fast forward selection
  • perform leave-one-out and k-fold cross validation
  • produce a graphical representation of the selection process, model fit, functional components, and fit diagnostics
  • create an output data set that contains predicted values and residuals
  • create an output data set that contains the design matrix of formed basis functions
  • specify multiple SCORE statements, which create new SAS data sets that contain predicted values and residuals
  • perform BY group processing to obtain separate analyses on grouped observations
  • automatically produce graphs by using ODS Graphics
For further details, see ADAPTIVEREG Procedure

GLMSELECT Procedure


The GLMSELECT procedure performs effect selection in the framework of general linear models. A variety of model selection methods are available, including the LASSO method of Tibshirani (1996) and the related LAR method of Efron et al. (2004). The procedure offers extensive capabilities for customizing the selection with a wide variety of selection and stopping criteria, from traditional and computationally efficient significance-level-based criteria to more computationally intensive validation-based criteria. The procedure also provides graphical summaries of the selection search. The following are highlights of the GLMSELECT procedure's features:

Model Specification

  • supports different parameterizations for classification effects
  • supports any degree of interaction (crossed effects) and nested effects
  • supports hierarchy among effects
  • supports partitioning of data into training, validation, and testing roles
  • supports constructed effects including spline and multimember effects

Display and Output

  • produces graphical representation of the selection process
  • produces output data sets that contain predicted values and residuals
  • produces an output data set that contains the design matrix
  • produces macro variables that contain selected models
  • supports parallel processing of BY groups
  • supports multiple SCORE statements

Selection Control

  • provides multiple effect selection methods including the following:
    • forward selection
    • backward elimination
    • stepwise regression
    • least angle regression (LAR)
    • least absolute shrinkage and selection operator (LASSO)
    • group LASSO
    • elastic net
    • hybrid versions of LAR and LASSO
  • enables selection from a very large number of effects (tens of thousands)
  • supports safe screening and sure independence screening methods
  • offers selection of individual levels of classification effects
  • provides effect selection based on a variety of selection criteria
  • provides stopping rules based on a variety of model evaluation criteria
  • provides leave-one-out and k-fold cross validation
  • supports data resampling and model averaging
For further details, see GLMSELECT Procedure

HPLOGISTIC Procedure


The HPLOGISTIC procedure is a high-performance statistical procedure that fits logistic regression models for binary, binomial, and multinomial data. The following are highlights of the HPLOGISTIC procedure's features:

  • provides model-building syntax with the CLASS and effect-based MODEL statements
  • provides response-variable options as in the LOGISTIC procedure
  • performs maximum likelihood estimation
  • provides multiple link functions
  • provides cumulative link models for ordinal data and generalized logit modeling for unordered multinomial data
  • enables model building (variable selection) through the SELECTION statement
  • provides a WEIGHT statement for weighted analysis
  • provides a FREQ statement for grouped analysis
  • provides an OUTPUT statement to produce a data set with predicted probabilities and other observationwise statistics
  • enables you to run in distributed mode on a cluster of machines that distribute the data and the computations
  • enables you to run in single-machine mode on the server where SAS is installed
  • exploits all the available cores and concurrent threads, regardless of execution mode
  • performs parallel reads of input data and parallel writes of output data when the data source is the appliance database
For further details, see HPLOGISTIC Procedure

HPGENSELECT Procedure


The HPGENSELECT procedure is a high-performance procedure that provides model fitting and model building for generalized linear models. The following are highlights of the HPGENSELECT procedure's features:

  • fits models for standard distributions in the exponential family
  • fits multinomial models for ordinal and nominal responses
  • fits zero-inflated Poisson and negative binomial models for count data
  • supports the following link functions:
    • complementary log-log
    • generalized logit
    • identity
    • reciprocal
    • reciprocal squared
    • logarithm
    • logit
    • log-log
    • probit
  • provides forward, backward, and stepwise variable selection
  • supports the following selection criteria:
    • AIC (Akaike's information criterion)
    • AICC (small-sample bias corrected version of Akaike's information criterion)
    • BIC (Schwarz Bayesian criterion)
  • writes SAS DATA step code for computing predicted values of the fitted model either to a file or to a catalog entry
  • creates a data set that contains observationwise statistics that are computed after the model is fitted
  • enables you to specify how observations in the input data set are to be logically partitioned into disjoint subsets for model training, validation, and testing
  • performs weighted estimation
  • runs in either single-machine mode or distributed mode
For further details, see HPGENSELECT Procedure

HPREG Procedure


The HPREG procedure is a high-performance procedure that provides model fitting and model building for ordinary linear least squares models. The models supported are standard independently and identically distributed general linear models, which can contain main effects that consist of both continuous and classification variables and interaction effects of these variables. The procedure offers extensive capabilities for customizing the model selection with a wide variety of selection and stopping criteria, from traditional and computationally efficient significance-level-based criteria to more computationally intensive validation-based criteria. The following are highlights of the HPREG procedure's features:

  • supports GLM and reference parameterization for classification effects
  • supports any degree of interaction (crossed effects) and nested effects
  • supports hierarchy among effects
  • supports partitioning of data into training, validation, and testing roles
  • supports a FREQ statement for grouped analysis
  • supports a WEIGHT statement for weighted analysis
  • provides multiple effect-selection methods
  • enables selection from a very large number of effects (tens of thousands)
  • offers selection of individual levels of classification effects
  • provides effect selection based on a variety of selection criteria
  • provides stopping rules based on a variety of model evaluation criteria
  • supports stopping and selection rules based on external validation and leave-one-out cross validation
  • produces output data sets that contain predicted values, residuals, studentized residuals, confidence limits, and influence statistics
For further details, see HPREG Procedure

PLS Procedure


The PLS procedure fits models by using any one of a number of linear predictive methods including partial least squares (PLS). Ordinary least squares regression, as implemented in SAS/STAT procedures such as PROC GLM and PROC REG, has the single goal of minimizing sample response prediction error, seeking linear functions of the predictors that explain as much variation in each response as possible. The techniques implemented in the PLS procedure have the additional goal of accounting for variation in the predictors, under the assumption that directions in the predictor space that are well sampled should provide better prediction for new observations when the predictors are highly correlated. All of the techniques implemented in the PLS procedure work by extracting successive linear combinations of the predictors, called factors (also called components, latent vectors, or latent variables), which optimally address one or both of these two goals—explaining response variation and explaining predictor variation. In particular, the method of partial least squares balances the two objectives, seeking factors that explain both response and predictor variation. The following are highlights of the PLS procedure's features:

  • implements the following techniques:
    • principal components regression, which extracts factors to explain as much predictor sample variation as possible
    • reduced rank regression, which extracts factors to explain as much response variation as possible. This technique, also known as (maximum) redundancy analysis, differs from multivariate linear regression only when there are multiple responses.
    • partial least squares regression, which balances the two objectives of explaining response variation and explaining predictor variation. Two different formulations for partial least squares are available: the original predictive method of Wold (1966) and the SIMPLS method of de Jong (1993).
  • enables you to choose the number of extracted factors by cross validation
  • enables you to use the general linear modeling approach of the GLM procedure to specify a model for your design, allowing for general polynomial effects as well as classification or ANOVA effects
  • enables you to save the fitted model in a data set and apply it to new data by using the SCORE procedure
  • performs BY group processing, which enables you to obtain separate analyses on grouped observations
  • creates an output data set to receive quantities that can be computed for every input observation, such as extracted factors and predicted values
  • automatically creates graphs by using ODS Graphics
For further details, see PLS Procedure

TRANSREG Procedure


The TRANSREG (transformation regression) procedure fits linear models, optionally with smooth, spline, Box-Cox, and other nonlinear transformations of the variables. The following are highlights of the TRANSREG procedure's features:

  • enables you to fit linear models including:
    • ordinary regression and ANOVA
    • metric and nonmetric conjoint analysis (Green and Wind 1975; de Leeuw, Young, and Takane 1976)
    • linear models with Box-Cox (1964) transformations of the dependent variables
    • regression with a smooth (Reinsch 1967), spline (de Boor 1978; van Rijckevorsel 1982), monotone spline (Winsberg and Ramsay 1980), or penalized B-spline (Eilers and Marx 1996) fit function
    • metric and nonmetric vector and ideal point preference mapping (Carroll 1972)
    • simple, multiple, and multivariate regression with variable transformations (Young, de Leeuw, and Takane 1976; Winsberg and Ramsay 1980; Breiman and Friedman 1985)
    • redundancy analysis (Stewart and Love 1968) with variable transformations (Israels 1984)
    • canonical correlation analysis with variable transformations (van der Burg and de Leeuw 1983)
    • response surface regression (Meyers 1976; Khuri and Cornell 1987) with variable transformations
  • enables you to use a data set that can contain variables measured on nominal, ordinal, interval, and ratio scales; you can specify any mix of these variable types for the dependent and independent variables
  • transform nominal variables by scoring the categories to minimize squared error (Fisher 1938), or treat nominal variables as classification variables
  • enables you to transform ordinal variables by monotonically scoring the ordered categories so that order is weakly preserved (adjacent categories can be merged) and squared error is minimized. Ties can be optimally untied or left tied (Kruskal 1964). Ordinal variables can also be transformed to ranks.
  • enables you to transform interval and ratio scale of measurement variables linearly or nonlinearly with spline (de Boor 1978; van Rijckevorsel 1982), monotone spline (Winsberg and Ramsay 1980), penalized B-spline (Eilers and Marx 1996), smooth (Reinsch 1967), or Box-Cox (Box and Cox 1964) transformations. In addition, logarithmic, exponential, power, logit, and inverse trigonometric sine transformations are available.
  • fits a curve through a scatter plot or fit multiple curves, one for each level of a classification variable
  • enables you to constrain the functions to be parallel or monotone or have the same intercept
  • enables you to code experimental designs and classification variables prior to their use in other analyses
  • perform sweighted estimation
  • generates output data sets including
    • ANOVA results
    • regression tables
    • conjoint analysis part-worth utilities
    • coefficients
    • marginal means
    • original and transformed variables, predicted values, residuals, scores, and more
  • performs BY group processing, which enables you to obtain separate analyses on grouped observations
  • automatically creates graphs by using ODS Graphics
For further details, see TRANSREG Procedure