22605 - Assessing the relative importance of effects in generalized linear models

Usage Note 22605: Assessing the relative importance of effects in generalized linear models

After settling on a final model, it is often desirable to assess of the relative importance of the predictors in the model. In ordinary linear regression, as done in the REG, GLM, and GLMSELECT procedures, two commonly used tools are standardized regression coefficients and partial correlations. Note that the p-values associated with the model parameters assess the evidence of a nonzero association with the response. They do not measure of the strength of the association and so cannot be used to assess the relative importance of the predictors.

Unlike the regular model coefficients, the standardized coefficients can be directly compared to find the largest ones (in absolute value) which represent the most important predictors. They measure the change in the response for a one standard deviation change in the predictor while the other predictors are held fixed. In the REG and GLMSELECT procedures, standardized coefficients are displayed by the STB option in the MODEL statement. Because PROC GLM does not provide an option to display standardized estimates, use PROC GLMSELECT instead. One problem is that each parameter in the model has a standardized coefficient which makes assessing a categorical (CLASS) predictor difficult since multiple standardized coefficients are associated with it and may vary considerably.

The partial correlation assesses the importance of a predictor in the regression model while holding the other predictors constant. Of the variability in the response that is not explained by the other predictors, the amount explained by the predictor in question is its squared partial correlation (SPC). In PROC REG, the SPC for each predictor is provided by the PCORR2 option in the MODEL statement. But again, there is a SPC for each parameter in the model.

Additional measures are discussed by Thompson (2009) and Thomas et. al. (2008) for the particular case of logistic models, though several could apply to other types of models.

Considering models beyond normal regression models, standardized coefficients and partial correlations are available in PROC LOGISTIC using the STB and PCORR options in the MODEL statement. Standardized estimates (though computed differently from PROC LOGISTIC) are also provided by the STDCOEF option in the MODEL statement of PROC GLIMMIX. As described above, these statistics are given for each parameter rather than a single value for each model effect. A variable importance measure based on generalized cross validation (GCV) is available for several generalized linear models in PROC ADAPTIVEREG. While this measure is associated with model effects rather than parameters, only a main effects model or a model with all two-way interactions can be assessed.

Unlike statistics that offer importance measures associated with model parameters, an estimate of the SPC for each model effect in many generalized linear models can be obtained using the RsquareV macro. The RsquareV macro computes the R 2 V estimator proposed by Zhang (2017) which defines the variation in the response, and variance unexplained by the model, using the variance function of the response distribution. Zhang states that this estimator applies to any model based on well-defined mean and variance functions such as generalized linear models based on likelihood or quasi-likelihood functions. The R 2 V measure is equivalent to the SPC in the case of a normal regression model. The SPC for a predictor is obtained by comparing the full model to a submodel which excludes the predictor (or model effect) of interest.

The partial correlation provided by the PCORR option in PROC LOGISTIC is computed differently. For a given parameter in the model, it is defined using the Wald chi-square statistic for the test of the parameter and the log likelihood for the intercept-only model. For details, see the description of the PCORR option in the PROC LOGISTIC documentation. Since, like the standardized estimates, the partial correlations are computed for each model parameter, assessing the importance of a multi-parameter effect can be difficult.

Example

The following uses the neuralgia study data in the example titled "Logistic Modeling with Categorical Predictors" in the PROC LOGISTIC documentation. The absence of pain is modeled as a function of four predictors: Treatment (two test treatments and placebo), Sex, Age, and Duration of pain before the study. In addition to these four predictors, the interaction of Treatment and Age is included in the full model. The goal is to assess the relative importance of these five effects in the model. To do this, several statistics including both the chi-square-based SPC and the SPC from the RsquareV macro are computed for each effect.

For a binary response, the RsquareV macro requires the response variable to be a 0,1-coded numeric variable. The variable NoPain is created in the initial DATA step for use as the response in each model. Note that NoPain=1 indicates absence of pain.

     data Neur;
        set Neuralgia;
        NoPain = (Pain = "No");
        run;

To obtain SPCs for each of model effect, begin by fitting the full model and saving the predicted probabilities (P_full) in a data set (Preds) using the OUTPUT statement. The RSQUARE, STB, and PCORR options are included to show the R² statistics, standardized parameter estimates, and chi-square-based SPCs that PROC LOGISTIC provides. To obtain the GCV-based importance measure, PROC ADAPTIVEREG then fits the main effects (additive) logistic model. This model matches the main effects model that PROC LOGISTIC fits if reference parameterization (PARAM=REF) is used. Note that this model does not include the single interaction, Treatment*Sex, included in the model fit by PROC LOGISTIC.

     proc logistic data=Neur;
        class Treatment Sex;
        model NoPain(event="1") = Treatment Sex Treatment*Sex Age Duration / rsquare stb pcorr;
        output out=Preds p=P_full;
        run;
     proc adaptivereg data=Neur;
        class Treatment Sex;
        model NoPain(event="1") = Treatment Sex Age Duration / dist=bin additive
              forwardonly linear=(Treatment Sex Age Duration);
        run;

Following the fit of the full model, a submodel is fitted which in turn excludes one model effect and saves its predicted probabilities. Finally, the intercept-only model (no predictors) is fitted. Note that the input (DATA=) data set for each submodel fit is the Preds data set and that the output (OUTPUT OUT=) data set is also Preds. This allows data set Preds to contain the predicted probabilities from all models after fitting all of the submodels.

It is important to explicitly specify the response level whose probability you want to estimate in each of the logistic models. Since the response probability of interest is the probability of no pain (NoPain=1), the EVENT="1" option is specified in each model. If this is not done, the wrong probability could be estimated and the SPCs will not be correct.

     proc logistic data=Preds;
        class Treatment Sex;
        model NoPain(event="1") = Sex Treatment*Sex Age Duration;
        output out=Preds p=P_Trt;
        run;
     proc logistic data=Preds;
        class Treatment Sex;
        model NoPain(event="1") = Treatment Treatment*Sex Age Duration;
        output out=Preds p=P_Sex;
        run;
     proc logistic data=Preds;
        class Treatment Sex;
        model NoPain(event="1") = Treatment Sex Age Duration;
        output out=Preds p=P_TrtSex;
        run;
     proc logistic data=Preds;
        class Treatment Sex;
        model NoPain(event="1") = Treatment Sex Treatment*Sex Duration;
        output out=Preds p=P_Age;
        run;
     proc logistic data=Preds;
        class Treatment Sex;
        model NoPain(event="1") = Treatment Sex Treatment*Sex Age;
        output out=Preds p=P_Dur;
        run;
     proc logistic data=Preds;
        class Treatment Sex;
        model NoPain(event="1") = ;
        output out=Preds p=P_null;
        run;

The model R 2 V , unadjusted and adjusted, for the full model is obtained first from the first call of the RsquareV macro. The predicted probabilities from the full and intercept-only (null) model are specified. The total number of parameters estimated in the full model is 8 (total in the DF column of the parameter estimates table - not shown). The intercept-only model only has one parameter.

Next, the SPC for each model effect is obtained by specifying the predicted probabilities of the full model and of the model that excludes each model effect. The number of parameters in each submodel is specified in NPARMSUB=.

     %RsquareV(data=Preds, response=NoPain, dist=binomial,
               pfull=P_full, psub=P_null,   nparmfull=8, nparmsub=1)
     %RsquareV(data=Preds, response=NoPain, dist=binomial,
               pfull=P_full, psub=P_Trt,    nparmfull=8, nparmsub=6)
     %RsquareV(data=Preds, response=NoPain, dist=binomial,
               pfull=P_full, psub=P_Sex,    nparmfull=8, nparmsub=7)
     %RsquareV(data=Preds, response=NoPain, dist=binomial,
               pfull=P_full, psub=P_TrtSex, nparmfull=8, nparmsub=6)
     %RsquareV(data=Preds, response=NoPain, dist=binomial,
               pfull=P_full, psub=P_Age,    nparmfull=8, nparmsub=7)
     %RsquareV(data=Preds, response=NoPain, dist=binomial,
               pfull=P_full, psub=P_Dur,    nparmfull=8, nparmsub=7)

The results of the first call of the RsquareV macro for the full model are shown below. These results and the results from the subsequent calls for each submodel are summarized in the table below.

R_v² and Penalized R_v²

Rsquare_v	Penalized Rsquare_v
0.49588	0.42801

These statements fit the full model and obtains a likelihood ratio Type 3 test for each model effect for comparison.

     proc genmod data=Neur;
        class Treatment Sex;
        model NoPain(event="1") = Treatment Sex Treatment*Sex Age Duration / d=b type3;
        run;

The following table summarizes the results. The full model is designated (T,S,TS,A,D) indicating that all effects, Treatment (T), Sex (S), Treatment*Sex (TS), Age (A), and Duration (D), are in the model. The designation for the submodel that excludes Treatment is (T|S,TS,A,D) and similarly for each effect that is excluded, in turn, from the model.

First, notice that the R² and adjusted R² statistics provided by the RSQUARE option in PROC LOGISTIC reasonably agree with the R 2 V statistics for the full model. For the submodels evaluating the individual model effects, both the unadjusted and adjusted R 2 V values suggest that Treatment is the most important effect in the model, followed by Age, then Sex. Neither Duration nor the Treatment*Sex interaction is important.

The p-values from the the Type 3 tests of the effects in the full model are in agreement, though as stated above, they assess the presence of effects, not their strengths. The standardized estimates and the PCORR SPCs are more difficult to compare since there are two parameters for each of Treatment and the interaction. But for both statistics they generally agree that Treatment, Sex, and Age are important while Duration and the interaction are not. Finally, the GCV-based measure also agrees on the above importance ordering, but note that the Treatment*Sex effect is not assessed. This measure is scaled so that the most important effect has value 100.

Model Comparison	Standardized Estimate(s) (STB)	Type 3 p-value	SPC (PCORR²)	R 2 V	Adjusted R 2 V	GCV Based	R²	Adjusted R²
(T,S,TS,A,D)	.	.	.	0.49588	0.42801	.	0.4222	0.5682
(T\|S,TS,A,D)	0.3851 0.6786	<.0001	0.004630 0.037981	0.37086	0.34666	100
(S\|T,TS,A,D)	0.5100	0.0120	0.040617	0.18402	0.16833	21.99
(TS\|T,S,A,D)	-0.0913 0.0221	0.9324	0 0	0.008178	-0.02997
(A\|T,S,TS,D)	-0.7689	0.0011	0.064714	0.24521	0.23069	61.83
(D\|T,S,TS,A)	0.0335	0.8749	0	0.00174	-0.01746	-40.75

_________

References:

Thomas, D. R., Zhu, P., Zumbo, B. D., and Dutta, S. (2008), "On Measuring the Relative Importance of Explanatory Variables in a Logistic Regression," Journal of Modern Applied Statistical Methods, 7:1, 21-38.

Thompson D., (2009), Ranking Predictors in Logistic Regression, Proceedings of the 2009 MidWest SAS^® Users Group, Paper D10.

Zhang, D. (2017), A coefficient of determination for generalized linear models. The American Statistician, 71:4, 310-316.

Operating System and Release Information

Product Family	Product	System	SAS Release
			Reported	Fixed*
SAS System	SAS/STAT	All	n/a

* For software releases that are not yet generally available, the Fixed Release is the software release in which the problem is planned to be fixed.

Type:	Usage Note
Priority:	low
Topic:	SAS Reference ==> Procedures ==> LOGISTIC Analytics ==> Categorical Data Analysis Analytics ==> Regression SAS Reference ==> Procedures ==> FMM SAS Reference ==> Procedures ==> GAMPL SAS Reference ==> Procedures ==> GENMOD SAS Reference ==> Procedures ==> GLIMMIX SAS Reference ==> Procedures ==> HPGENSELECT SAS Reference ==> Procedures ==> HPLOGISTIC SAS Reference ==> Procedures ==> ADAPTIVEREG

Date Modified:	2020-03-03 14:05:30
Date Created:	2002-12-16 10:56:39

Support

Usage Note 22605: Assessing the relative importance of effects in generalized linear models

Example

Operating System and Release Information