Predictive Modeling
The term "predictive modeling" refers to the practice of fitting models primarily for the purpose of predicting outofsample
outcomes rather than for performing statistical inference. There are procedures included in this category that are capable
of fitting a wide variety of models, including the following:
The SAS/STAT predictive modeling procedures include the following:
ADAPTIVEREG Procedure
The ADAPTIVEREG procedure fits multivariate adaptive regression splines. The method is a nonparametric regression technique
that combines both regression splines and model selection methods. It does not assume parametric model forms and does not
require specification of knot values for constructing regression spline terms. Instead, it constructs spline basis functions
in an adaptive way by automatically selecting appropriate knot values for different variables and obtains reduced models by
applying model selection techniques.
The procedure enables you to do the following:
 specify classification variables with ordering options
 partition your data into training, validation, and testing roles
 specify the distribution family used in the model
 specify the link function in the model
 specify an offset
 specify the maximum number of basis functions that can be used in the final model
 specify the maximum interaction levels for effects that could potentially enter the model
 specify the incremental penalty for increasing the number of variables in the model
 specify the effects to be included in the final model
 request an additive model for which only main effects are included in the fitted model
 specify the parameter that controls the number of knots considered for each variable

 force effects in the final model or restrict variables in linear forms
 specify options for fast forward selection
 perform leaveoneout and kfold cross validation
 produce a graphical representation of the selection process, model fit, functional components, and fit diagnostics
 create an output data set that contains predicted values and residuals
 create an output data set that contains the design matrix of formed basis functions
 specify multiple SCORE statements, which create new SAS data sets that contain predicted values and residuals
 perform BY group processing to obtain separate analyses on grouped observations
 automatically produce graphs by using ODS Graphics

For further details, see
ADAPTIVEREG Procedure
GLMSELECT Procedure
The GLMSELECT procedure performs effect selection in the framework of general linear models.
A variety of model selection methods are available, including the LASSO method of Tibshirani (1996) and the related
LAR method of Efron et al. (2004). The procedure offers extensive capabilities for customizing the selection with a
wide variety of selection and stopping criteria, from traditional and computationally efficient significancelevelbased
criteria to more computationally intensive validationbased criteria. The procedure also provides graphical summaries of the selection search.
The following are highlights of the GLMSELECT procedure's features:
Model Specification
 supports different parameterizations for classification effects
 supports any degree of interaction (crossed effects) and nested effects
 supports hierarchy among effects
 supports partitioning of data into training, validation, and testing roles
 supports constructed effects including spline and multimember effects
Display and Output
 produces graphical representation of the selection process
 produces output data sets that contain predicted values and residuals
 produces an output data set that contains the design matrix
 produces macro variables that contain selected models
 supports parallel processing of BY groups
 supports multiple SCORE statements

Selection Control
 provides multiple effect selection methods including the following:
 forward selection
 backward elimination
 stepwise regression
 least angle regression (LAR)
 least absolute shrinkage and selection operator (LASSO)
 group LASSO
 elastic net
 hybrid versions of LAR and LASSO
 enables selection from a very large number of effects (tens of thousands)
 supports safe screening and sure independence screening methods
 offers selection of individual levels of classification effects
 provides effect selection based on a variety of selection criteria
 provides stopping rules based on a variety of model evaluation criteria
 provides leaveoneout and kfold cross validation
 supports data resampling and model averaging

For further details, see
GLMSELECT Procedure
HPLOGISTIC Procedure
The HPLOGISTIC procedure is a highperformance statistical procedure that fits logistic regression models for binary, binomial, and multinomial data.
The following are highlights of the HPLOGISTIC procedure's features:
 provides modelbuilding syntax with the CLASS and effectbased MODEL statements
 provides responsevariable options as in the LOGISTIC procedure
 performs maximum likelihood estimation
 provides multiple link functions
 provides cumulative link models for ordinal data and generalized logit modeling for unordered multinomial data
 enables model building (variable selection) through the SELECTION statement
 provides a WEIGHT statement for weighted analysis

 provides a FREQ statement for grouped analysis
 provides an OUTPUT statement to produce a data set with predicted probabilities and other observationwise statistics
 enables you to run in distributed mode on a cluster of machines that distribute the data and the computations
 enables you to run in singlemachine mode on the server where SAS is installed
 exploits all the available cores and concurrent threads, regardless of execution mode
 performs parallel reads of input data and parallel writes of output data when the data source is the appliance database

For further details, see
HPLOGISTIC Procedure
HPGENSELECT Procedure
The HPGENSELECT procedure is a highperformance procedure that provides model fitting and model building for generalized linear models.
The following are highlights of the HPGENSELECT procedure's features:
 fits models for standard distributions in the exponential family
 fits multinomial models for ordinal and nominal responses
 fits zeroinflated Poisson and negative binomial models for count data
 supports the following link functions:
 complementary loglog
 generalized logit
 identity
 reciprocal
 reciprocal squared
 logarithm
 logit
 loglog
 probit
 provides forward, backward, and stepwise variable selection

 supports the following selection criteria:
 AIC (Akaike's information criterion)
 AICC (smallsample bias corrected version of Akaike's information criterion)
 BIC (Schwarz Bayesian criterion)
 writes SAS DATA step code for computing predicted values of the fitted model either to a file or to a catalog entry
 creates a data set that contains observationwise statistics that are computed after the model is fitted
 enables you to specify how observations in the input data set are to be logically partitioned into disjoint subsets for model training, validation, and testing
 performs weighted estimation
 runs in either singlemachine mode or distributed mode

For further details, see
HPGENSELECT Procedure
HPREG Procedure
The HPREG procedure is a highperformance procedure that provides model fitting and model building for ordinary linear least squares models.
The models supported are standard independently and identically distributed general linear models, which can contain main effects that consist
of both continuous and classification variables and interaction effects of these variables. The procedure offers extensive capabilities for
customizing the model selection with a wide variety of selection and stopping criteria, from traditional and computationally efficient
significancelevelbased criteria to more computationally intensive validationbased criteria.
The following are highlights of the HPREG procedure's features:
 supports GLM and reference parameterization for classification effects
 supports any degree of interaction (crossed effects) and nested effects
 supports hierarchy among effects
 supports partitioning of data into training, validation, and testing roles
 supports a FREQ statement for grouped analysis
 supports a WEIGHT statement for weighted analysis
 provides multiple effectselection methods

 enables selection from a very large number of effects (tens of thousands)
 offers selection of individual levels of classification effects
 provides effect selection based on a variety of selection criteria
 provides stopping rules based on a variety of model evaluation criteria
 supports stopping and selection rules based on external validation and leaveoneout cross validation
 produces output data sets that contain predicted values, residuals, studentized residuals, confidence limits, and influence statistics

For further details, see
HPREG Procedure
PLS Procedure
The PLS procedure fits models by using any one of a number of linear predictive methods including partial least squares (PLS).
Ordinary least squares regression, as implemented in SAS/STAT procedures such as PROC GLM and PROC REG, has the single goal of
minimizing sample response prediction error, seeking linear functions of the predictors that explain as much variation in each response as possible.
The techniques implemented in the PLS procedure have the additional goal of accounting for variation in the predictors, under the
assumption that directions in the predictor space that are well sampled should provide better prediction for new observations when
the predictors are highly correlated. All of the techniques implemented in the PLS procedure work by extracting successive linear
combinations of the predictors, called factors (also called components, latent vectors, or latent variables), which optimally address
one or both of these two goals—explaining response variation and explaining predictor variation. In particular, the method of partial
least squares balances the two objectives, seeking factors that explain both response and predictor variation.
The following are highlights of the PLS procedure's features:
 implements the following techniques:
 principal components regression, which extracts factors to explain as much predictor sample variation as possible
 reduced rank regression, which extracts factors to explain as much response variation as possible.
This technique, also known as (maximum) redundancy analysis, differs from multivariate linear regression only when there are multiple responses.
 partial least squares regression, which balances the two objectives of explaining response variation and explaining predictor variation.
Two different formulations for partial least squares are available: the original predictive method of Wold (1966) and the SIMPLS method of de Jong (1993).

 enables you to choose the number of extracted factors by cross validation
 enables you to use the general linear modeling approach of the GLM procedure to specify a model for
your design, allowing for general polynomial effects as well as classification or ANOVA effects
 enables you to save the fitted model in a data set and apply it to new data by using the SCORE procedure
 performs BY group processing, which enables you to obtain separate analyses on grouped observations
 creates an output data set to receive quantities that can be computed for every input observation, such as extracted factors and predicted values
 automatically creates graphs by using ODS Graphics

For further details, see
PLS Procedure
TRANSREG Procedure
The TRANSREG (transformation regression) procedure fits linear models, optionally with smooth, spline, BoxCox, and other nonlinear
transformations of the variables. The following are highlights of the TRANSREG procedure's features:
 enables you to fit linear models including:
 ordinary regression and ANOVA
 metric and nonmetric conjoint analysis (Green and Wind 1975; de Leeuw, Young, and Takane 1976)
 linear models with BoxCox (1964) transformations of the dependent variables
 regression with a smooth (Reinsch 1967), spline (de Boor 1978; van Rijckevorsel 1982),
monotone spline (Winsberg and Ramsay 1980), or penalized Bspline (Eilers and Marx 1996)
fit function
 metric and nonmetric vector and ideal point preference mapping (Carroll 1972)
 simple, multiple, and multivariate regression with variable transformations (Young,
de Leeuw, and Takane 1976; Winsberg and Ramsay 1980; Breiman and Friedman 1985)
 redundancy analysis (Stewart and Love 1968) with variable transformations (Israels 1984)
 canonical correlation analysis with variable transformations (van der Burg and de Leeuw 1983)
 response surface regression (Meyers 1976; Khuri and Cornell 1987) with variable transformations
 enables you to use a data set that can contain variables measured on nominal, ordinal, interval, and ratio scales;
you can specify any mix of these variable types for the dependent and independent variables
 transform nominal variables by scoring the categories to minimize squared error
(Fisher 1938), or treat nominal variables as classification variables

 enables you to transform ordinal variables by monotonically scoring the ordered categories so that order is
weakly preserved (adjacent categories can be merged) and squared error is minimized. Ties
can be optimally untied or left tied (Kruskal 1964). Ordinal variables can also be transformed
to ranks.
 enables you to transform interval and ratio scale of measurement variables linearly or nonlinearly with spline
(de Boor 1978; van Rijckevorsel 1982), monotone spline (Winsberg and Ramsay 1980),
penalized Bspline (Eilers and Marx 1996), smooth (Reinsch 1967), or BoxCox (Box and
Cox 1964) transformations. In addition, logarithmic, exponential, power, logit, and inverse
trigonometric sine transformations are available.
 fits a curve through a scatter plot or fit multiple curves, one for each level of a classification variable
 enables you to constrain the functions to be parallel or monotone or have the same intercept
 enables you to code experimental designs and classification variables prior to their use in other analyses
 perform sweighted estimation
 generates output data sets including
 ANOVA results
 regression tables
 conjoint analysis partworth utilities
 coefficients
 marginal means
 original and transformed variables, predicted values, residuals, scores, and more
 performs BY group processing, which enables you to obtain separate analyses on grouped observations
 automatically creates graphs by using ODS Graphics

For further details, see
TRANSREG Procedure