Model Selection Methods

Statistical model selection (or model building) involves forming a model from a set of regressor variables that fits the data well but without overfitting. Models are overfit when they contain too many unimportant regressor variables. Overfit models are too closely molded to a particular data set. As a result, overfit models have unstable regression coefficients and are quite likely to have poor predictive power. Guided, numerical variable selection methods offer one approach to building models in situations where many potential regressor variables are available for inclusion in a regression model.

Both the REG and GLMSELECT procedures provide extensive options for model selection in ordinary linear regression models.[1] PROC GLMSELECT provides the most modern and flexible options for model selection. PROC GLMSELECT provides more selection options and criteria than PROC REG, and PROC GLMSELECT also supports CLASS variables. For more information about PROC GLMSELECT, see Chapter 47: The GLMSELECT Procedure. For more information about PROC REG, see Chapter 83: The REG Procedure.

SAS/STAT procedures provide the following model selection options for regression models:

NONE

performs no model selection. This method uses the full model given in the MODEL statement to fit the model. This selection method is available in the GLMSELECT, LOGISTIC, PHREG, QUANTSELECT, and REG procedures.

FORWARD

uses a forward-selection algorithm to select variables. This method starts with no variables in the model and adds variables one by one to the model. At each step, the variable that is added is the one that most improves the fit of the model. You can also specify groups of variables to treat as a unit during the selection process. An option enables you to specify the criterion for inclusion. This selection method is available in the GLMSELECT, LOGISTIC, PHREG, QUANTSELECT, and REG procedures.

BACKWARD

uses a backward-elimination algorithm to select variables. This method starts with a full model and eliminates variables one by one from the model. At each step, the variable that makes the smallest contribution to the model is deleted. You can also specify groups of variables to treat as a unit during the selection process. An option enables you to specify the criterion for exclusion. This selection method is available in the GLMSELECT, LOGISTIC, PHREG, QUANTSELECT, and REG procedures.

STEPWISE

uses a stepwise-regression algorithm to select variables; it combines the forward-selection and backward-elimination steps. This method is a modification of the forward-selection method in that variables already in the model do not necessarily stay there. You can also specify groups of variables to treat as a unit during the selection process. Again, options enable you to specify criteria for entering the model and for remaining in the model. This selection method is available in the GLMSELECT, LOGISTIC, PHREG, QUANTSELECT, and REG procedures.

LASSO

adds and deletes parameters by using a version of ordinary least squares in which the sum of the absolute regression coefficients is constrained. If the model contains CLASS variables, then these CLASS variables are split. This selection method is available in the GLMSELECT and QUANTSELECT procedures.

LAR

uses least angle regression to select variables. Like forward selection, this method starts with no effects in the model and adds effects. The parameter estimates at any step are shrunk when compared to the corresponding least squares estimates. If the model contains CLASS variables, then these CLASS variables are split. This selection method is available in PROC GLMSELECT.

MAXR

uses maximum R-square improvement to select models. This method tries to find the best one-variable model, the best two-variable model, and so on. The MAXR method differs from the STEPWISE method in that it evaluates many more models. The MAXR method considers all possible variable exchanges before making any exchange. The STEPWISE method might remove the worst variable without considering what the best remaining variable might accomplish, whereas MAXR considers what the best remaining variable might accomplish. Consequently, model building based on the maximum R-square improvement usually takes much longer than stepwise model building. This selection method is available in PROC REG.

MINR

uses minimum R-square improvement to select models. This method closely resembles MAXR, but the switch that is chosen is the one that produces the smallest increase in R square. This selection method is available in PROC REG.

RSQUARE

selects a specified number of models that have the highest R square in each of a range of model sizes. This selection method is available in PROC REG.

CP

selects a specified number of models that have the lowest $C_ p$ within a range of model sizes. This selection method is available in PROC REG.

ADJRSQ

selects a specified number of models that have the highest adjusted R square within a range of model sizes. This selection method is available in PROC REG.

SCORE

selects models based on a branch-and-bound algorithm that finds a specified number of models that have the highest likelihood score for all possible model sizes, from one up to a model that contains all the explanatory effects. This selection method is available in PROC LOGISTIC and PROC PHREG.



[1] The QUANTSELECT, PHREG, and LOGISTIC procedures provide model selection for quantile, proportional hazards, and logistic regression, respectively.