FOCUS AREAS

SAS/STAT Topics

SAS/STAT Software

Model Selection

Model selection in this context refers to searching for the best subset of explanatory variables to include in your model. Many authors caution against the use of "automatic variable selection" methods and describe pitfalls that plague many such methods, however, careful and informed use of variable selection methods has its place in modern data analysis.

Suppose you want to model the relationship between a response variable and a set of p potential explanatory variables, but you have reason to believe that some of the potential explanatory variables are redundant or irrelevant. With 2p possible models from which to choose, the challenge is to find the best subset of explanatory variables to include in your model. Solving this problem requires choosing a criterion by which the candidate models can be ranked and searching through the model space to find the candidate models that exhibit optimal values of the chosen criterion.

Below are highlights of the capabilities of the SAS/STAT procedures that perform model selection:

GLMSELECT Procedure


The GLMSELECT procedure performs effect selection in the framework of general linear models. A variety of model selection methods are available, including the LASSO method of Tibshirani (1996) and the related LAR method of Efron et al. (2004). The procedure offers extensive capabilities for customizing the selection with a wide variety of selection and stopping criteria, from traditional and computationally efficient significance-level-based criteria to more computationally intensive validation-based criteria. The procedure also provides graphical summaries of the selection search. The following are highlights of the GLMSELECT procedure's features:

Model Specification

  • supports different parameterizations for classification effects
  • supports any degree of interaction (crossed effects) and nested effects
  • supports hierarchy among effects
  • supports partitioning of data into training, validation, and testing roles
  • supports constructed effects including spline and multimember effects

Display and Output

  • produces graphical representation of the selection process
  • produces output data sets that contain predicted values and residuals
  • produces an output data set that contains the design matrix
  • produces macro variables that contain selected models
  • supports parallel processing of BY groups
  • supports multiple SCORE statements

Selection Control

  • provides multiple effect selection methods including the following:
    • forward selection
    • backward elimination
    • stepwise regression
    • least angle regression (LAR)
    • least absolute shrinkage and selection operator (LASSO)
    • group LASSO
    • elastic net
    • hybrid versions of LAR and LASSO
  • enables selection from a very large number of effects (tens of thousands)
  • supports safe screening and sure independence screening methods
  • offers selection of individual levels of classification effects
  • provides effect selection based on a variety of selection criteria
  • provides stopping rules based on a variety of model evaluation criteria
  • provides leave-one-out and k-fold cross validation
  • supports data resampling and model averaging
For further details, see GLMSELECT Procedure

HPGENSELECT Procedure


The HPGENSELECT procedure is a high-performance procedure that provides model fitting and model building for generalized linear models. The following are highlights of the HPGENSELECT procedure's features:

  • fits models for standard distributions in the exponential family
  • fits multinomial models for ordinal and nominal responses
  • fits zero-inflated Poisson and negative binomial models for count data
  • supports the following link functions:
    • complementary log-log
    • generalized logit
    • identity
    • reciprocal
    • reciprocal squared
    • logarithm
    • logit
    • log-log
    • probit
  • provides forward, backward, and stepwise variable selection
  • supports the following selection criteria:
    • AIC (Akaike's information criterion)
    • AICC (small-sample bias corrected version of Akaike's information criterion)
    • BIC (Schwarz Bayesian criterion)
  • writes SAS DATA step code for computing predicted values of the fitted model either to a file or to a catalog entry
  • creates a data set that contains observationwise statistics that are computed after the model is fitted
  • enables you to specify how observations in the input data set are to be logically partitioned into disjoint subsets for model training, validation, and testing
  • performs weighted estimation
  • runs in either single-machine mode or distributed mode
For further details, see HPGENSELECT Procedure

QUANTSELECT Procedure


The QUANTSELECT procedure performs effect selection in the framework of quantile regression. A variety of effect selection methods are available, including greedy methods and penalty methods. PROC QUANTSELECT offers extensive capabilities for customizing the effect selection processes with a variety of candidate selecting, effect-selection stopping, and final-model choosing criteria. It also provides graphical summaries for the effect selection processes. The following are highlights of the QUANTSELECT procedure's features:

  • supports the following model specifications:
    • interaction (crossed) effects and nested effects
    • constructed effects such as regression splines
    • hierarchy among effects
    • partitioning of data into training, validation, and testing roles
  • provides the following selection controls:
    • multiple methods for effect selection
    • selection for quantile process and single quantile levels
    • selection of individual or grouped effects
    • selection based on a variety of selection criteria
    • stopping rules based on a variety of model evaluation criteria
  • provides graphical representations of the selection process
  • provides output data sets that contain predicted values and residuals
  • provides an output data set that contains the parameter estimates from a quantile process regression
  • provides an output data set that contains the design matrix
  • provides macro variables that contain selected effects
For further details, see QUANTSELECT Procedure