Setting the Model Options :: SAS(R) Studio 3.5: Task Reference Guide

Choosing a Model

With these options, you can specify the complexity level of the model that you want to build. The modeling methods are in a hierarchy: the intermediate method includes basic and intermediate models, and the advanced method includes basic, intermediate, and advanced models.

The models that you create using the basic method will probably run faster than the models that you create using the intermediate method, but the basic method also might create a less accurate model. The same is often true when you compare the models that you create with the intermediate and advanced methods.

SAS Enterprise Miner modeling functions are executed when you run the SAS Rapid Predictive Modeler. The modeling functions that the software runs depend on the selected modeling method.

Modeling Methods

You can choose from these modeling methods:

Basic

The basic method samples the data only if you have a rare target event, and then partitions the data by using the target as a stratification variable. Next, the basic method performs a one-level variable selection step. The input variables that were selected are then binned according to the strength of their relationship to the target and passed to a forward stepwise regression model.

Intermediate

The intermediate method is an extension of the basic method. Several variable selection techniques are performed and then followed by multiple variable transformations. A decision tree, a regression model, and a logistic regression are used as modeling techniques. Variable interactions are represented using the node variable that was exported from a decision tree. The intermediate method also runs the basic method, and then chooses the best performing model.

Advanced

The advanced method is an extension of the intermediate method and includes a neural network model, an advanced regression analysis, and ensemble models. The advanced method also runs the intermediate and basic methods, and then chooses the best performing model.

Understanding the Models for the SAS Rapid Predictive Modeler

The SAS Rapid Predictive Modeler provides you with basic, intermediate, and advanced models. The models increase in sophistication and complexity.

The basic model is a simple regression analysis.
The intermediate model includes a more sophisticated analysis, plus the analysis from the basic model, and chooses the better model.
The advanced model includes an even more sophisticated analysis, plus the analyses from the basic and intermediate models, and chooses the best model.

Basic

The basic model performs a series of three data mining operations.

Variable Selection: The basic model chooses the top 100 variables for modeling.
Transformation: The basic model performs an Optimal Binning transformation on the top 100 variables selected for modeling. The Optimal Binning transformation compensates for missing variable values, so missing value imputation is not performed.
Modeling: The basic model uses a forward regression model. The forward regression model chooses variables one at a time in a stepwise process. The stepwise process adds one variable at a time to the linear equation until the variable contributions are insignificant. The forward regression model seeks to exclude variables with no predictive ability (or variables that are highly correlated with other predictor variables) from the analytic analysis.

Intermediate

The intermediate model performs a series of seven data mining operations.

Variable Selection: The intermediate model chooses the top 200 variables for modeling.
Transformation: The intermediate model performs a best power transformation on the 200 variables that were selected for modeling. The best power transformations are a subset of the general class of transformations that are known as Box-Cox transformations. The best power transformation evaluates a subset of exponential power transformations, and then chooses the transformation that has the best results for the specified criterion.
Imputation: The intermediate model performs an imputation to replace missing variables with the average variable values. The imputation operation also creates indicator variables that enable observations that contain imputed variable values to be identified.
Variable Selection: The intermediate model uses the chi-square and R-square criteria tests to remove variables that are not related to the target variable.
Union of Variable Selection Techniques: The intermediate model merges the set of variables that were selected by the chi-square and R-square criteria tests.
Modeling: The intermediate model submits the training data to three competing model algorithms. The models are a decision tree, a logistic regression, and a stepwise regression. In the case of the logistic regression model, the training data is first submitted to a decision tree that creates a NODE_ID variable that is passed as input to the regression model. The NODE_ID variable is created to enable variable interaction models.
Champion Model Selection: The intermediate model performs an analytic assessment of the predictive or classification performance of the competing models. The model that demonstrates the best predictive or classification performance is selected to perform the modeling analysis. The intermediate model for champion model selection evaluates the performance of not only the intermediate models, but also the basic models.

After the SAS Rapid Predictive Modeler chooses the intermediate champion model, it compares the predictive performance of the intermediate champion model to the basic model, and then chooses the better model as the result.

Advanced

The advanced model performs a series of seven data mining operations.

Variable Selection: The advanced model chooses the top 400 variables for modeling.
Transformation: The advanced model performs the multiple transformation algorithm on the 400 variables that were selected for modeling. The multiple transformation operation creates several variable transformations that are intended for use in later variable selections. Multiple transformations result in an increase in the number of input variables. Because of the increase in input variables, SAS Rapid Predictive Modeler selects the best 400 input variables from the output that was generated by the multiple transformation algorithm.
Imputation: The advanced model performs an imputation to replace missing variables with the average variable values. The imputation operation also creates indicator variables that enable the user to identify observations that contain imputed variable values.
Variable Selection: The advanced model uses the chi-square and R-square criteria tests to remove variables that are not related to the target variable. AOV16 variables are created during the R-square analysis.
Union of Variable Selection Techniques: The advanced model merges the set of variables that were selected by the chi-square and R-square criteria tests.
Modeling: The advanced model submits the training data to four competing model algorithms. The models are a decision tree model, a neural network model, a backward regression model, and an ensemble model. The neural network model conducts limited searches in order to find an optimal feed-forward network. Backward regression is a linear regression model that eliminates variables by removing one variable at a time until the R-squared scores drop significantly. The ensemble model creates new models by combining the posterior probabilities (for class targets) or the predicted values (for interval targets) from multiple predecessor input models. The new ensemble model is then used to score new data. The ensemble model that you use in the advanced model is created from the output of the basic model, the champion model from the intermediate model, and the champion model from the advanced model.
Champion Model Selection: The advanced model performs an analytic assessment of the predictive or classification performance of the competing decision tree, neural, and regression models. The model that demonstrates the best predictive or classification performance is then used as an input, along with the champion model from the basic and intermediate models, to create an ensemble model. Then the newly created advanced ensemble model, decision tree model, neural model, and backward regression model are analytically compared to select the best model from the sample space of all basic, intermediate, and advanced champion models.

After the SAS Rapid Predictive Modeler selects a champion model, it runs and compares the predictive performance of the advanced model to the champion models for the intermediate and basic models, and then chooses the best performing champion model as the result.