SUPPORT / SAMPLES & SAS NOTES
 

Support

Sample 54866: Logistic model selection using area under curve (AUC) or R-square selection criteria

DetailsResultsDownloadsAboutRate It

Logistic model selection using area under curve (AUC) or R-square selection criteria

Contents: Purpose / History / Requirements / Usage / Details / Limitations

PURPOSE:
The SELECT macro performs model selection methods for categorical-response models that can be fit in PROC LOGISTIC. These include models using the logit, probit, cloglog, cumulative logit, or generalized logit links. The macro supports binary as well as ordinal and nominal multinomial models.

Standard model selection is done by choosing candidate effects for entry to or removal from the model according to their significance levels. After completion, the set of models selected at each step of this process is sorted on the selected criterion - AUC, R-square, max-rescaled R-square, AIC, or BIC. The requested number of best models on the selected criterion is displayed.

HISTORY:
Version
Update Notes
1.0Initial coding
REQUIREMENTS:
Base SAS® and SAS/STAT® software are required.
USAGE:
Follow the instructions in the Downloads tab of this sample to save the SELECT macro definition. Replace the text within quotation marks in the following statement with the location of the SELECT macro definition file on your system. In your SAS program or in the SAS editor window, specify this statement to define the SELECT macro and make it available for use:
   %inc "<location of your file containing the SELECT macro>";

Following this statement, you can call the SELECT macro. See the Results tab for examples.

The following parameters are required when using the SELECT macro:

response=variable
Specifies the name of the response variable to be modeled. The variable can be binary or multinomial with ordered or unordered levels. The response variable should not be continuous. Response variable options, such as EVENT= or ORDER= specified in parentheses, can follow the response variable name. For example, RESPONSE=Y(EVENT="1").
model=effect-list
Specifies the set of candidate effects from which a final model is selected based on their significance levels. Any type of list allowed in the MODEL statement in PROC LOGISTIC can be used, such as: VAR:, X1-X10, A--K, A|B|C@2.
class=variable-list and options
Required only if some variables in the candidate effects in the MODEL= option are categorical variables. Specify categorical variables and variable options using the same syntax as the CLASS statement in PROC LOGISTIC. For example, CLASS=A B(REF="1") / PARAM=REF.

The following parameters are optional:

data=data-set-name
Specifies the data set containing the response variable and variables in the candidate effects. The default is the last-created data set.
method=selection-method
Specifies the model selection to use: FORWARD, BACKWARD, or STEPWISE. The SCORE method in PROC LOGISTIC is not available. METHOD=STEPWISE is the default.
choose=selection-criterion
Specifies the criterion statistic to order the selected models: AUC, RSQUARE, RSQUARE_RESCALED, AIC, or BIC. CHOOSE=AUC is the default. The AUC criterion is not available when LINK=GLOGIT.
best=number|MAX
Specifies the number of selected models to display that are best on the CHOOSE= criterion. Number must be zero or a positive integer. BEST=MAX is the default which displays all selected models.
include=number
Specifies the number of candidate effects in the MODEL statement to include in every model during model selection. INCLUDE=number requests that the first number of effects in the MODEL= option be included in every model. The default is INCLUDE=0.
link=LOGIT|PROBIT|CLOGLOG|GLOGIT
Specifies the link function. The default is LINK=LOGIT. For an ordinal response, the default LINK=LOGIT uses cumulative logits. For a nominal (unordered) response, specify LINK=GLOGIT.
sle=number
Specifies the significance level for entering an effect into the model in the FORWARD or STEPWISE method. Number should be between 0 and 1, inclusive. The default is SLE=0.05. The SLE= option has no effect when METHOD=BACKWARD.
sls=number
Specifies the significance level for an effect to stay in the model in a backward elimination step. Number should be between 0 and 1, inclusive. The default is SLS=0.05. The SLS= option has no effect when METHOD=FORWARD.
modelopts=option-list
Specifies any additional options which can appear following a slash in the MODEL statement in PROC LOGISTIC. An option available in the SELECT macro, even its default is not specified, takes precedence over specifying the option in the MODELOPTS= option. For example, specifying MODELOPTS=INCLUDE=3 does not change the default INCLUDE=0. To do this, specify the INCLUDE=3 option.
details=SUMMARY|ALL
Requests display of only the model selection summary table from PROC LOGISTIC (DETAILS=SUMMARY), or the display of all results produced by the DETAILS option of PROC LOGISTIC (DETAILS=ALL). If omitted, only the final list of models sorted on the CHOOSE= criterion is displayed and all model selection results from PROC LOGISTIC are suppressed.
out=data-set-name
Specifies the name of the final data set containing the list of selected models sorted by the CHOOSE= criterion. By default, OUT=BestMods.
binwidth=value
Specifies the size of the bins on the predicted probabilities to be used when computing the area under the ROC curve (AUC) for binary and ordinal response models. Value can be 0 to 1, inclusive (except for ordinal response models where value must be greater than zero). Smaller values of BINWIDTH= provide more precise estimates of the AUC but require more time and memory. The default is BINWIDTH=1E-6.
freq=variable
Specifies a variable that contains the frequency of occurrence of each observation. Each observation is treated as if it appears n times, where n is the value of the FREQ= variable for the observation. If it is not an integer, the frequency value is truncated to an integer. If the frequency value is less than 1 or missing, the observation is not used in the model fitting. When FREQ= is not specified, each observation is assigned a frequency of 1.
weight=variable
Specifies a variable that weights each observation in the input data set by its value on the WEIGHT= variable. Unlike the FREQ= variable, the values of the WEIGHT= variable can be nonintegral and are not truncated. Observations with negative, zero, or missing values for the WEIGHT variable are not used in the model fitting. When WEIGHT= is not specified, each observation is assigned a weight of 1.

The version of the SELECT macro that you are using is displayed when you specify version (or any string) as the first argument. For example:

    %SELECT(version, response=y, model=x1-x10)
DETAILS:
Model selection is done in the same manner as PROC HPLOGISTIC when the CHOOSE= and SELECT=SL method options are specified with the forward, backward, or stepwise methods. That is, effects are added to or removed from the model according to their significance levels resulting in a set of models. The CHOOSE= criterion is then used to sort the models. In addition to the AIC and BIC criteria available in PROC HPLOGISTIC, the SELECT macro can also choose models using the area under the ROC curve (CHOOSE=AUC), the R-square statistic (CHOOSE=RSQUARE), or the max-rescaled R-square statistic (CHOOSE=RSQUARE_RESCALED).

Note that since the SELECT macro uses PROC LOGISTIC to fit models, differences from PROC HPLOGISTIC can occur due to differences in defaults. For example, PROC HPLOGISTIC does not require model hierarchy by default and uses a Newton-Raphson algorithm rather than the Fisher scoring method. Using the HIERARCHY=NONE and TECHNIQUE=NEWTON options in the MODEL statement of PROC LOGISTIC removes these differences.

Macro updates:

The SELECT macro attempts to check for a later version of itself. If it is unable to do this (such as if there is no active Internet connection available), the macro issues the following message:

   SELECT: Unable to check for newer version

The computations performed by the macro are not affected by the appearance of this message.

LIMITATIONS:
The best subsets selection method (SELECTION=SCORE) available in PROC LOGISTIC is not available in the SELECT macro. The area under the ROC curve criterion (CHOOSE=AUC) is not available with nominal, multinomial (LINK=GLOGIT) models.



These sample files and code examples are provided by SAS Institute Inc. "as is" without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose. Recipients acknowledge and agree that SAS Institute shall not be liable for any damages whatsoever arising out of their use of this material. In addition, SAS Institute will provide no support for the materials contained herein.