32471 - Testing assumptions in logit, probit, Poisson and other generalized linear models

Usage Note 32471: Testing assumptions in logit, probit, Poisson and other generalized linear models

Equal Variances

Unlike in least squares estimation of normal-response models, variances are not assumed to be equal in the maximum likelihood estimation of logistic, Poisson, and other generalized linear models. For these models there is usually a known relationship between the mean and the variance such that the variance cannot be constant. In the Poisson model, for example, the mean and variance are equal. Consequently, the variance will be larger in populations in which the mean is larger. Because variances are not equal in these models, there is no need to test for this condition when fitting the model in a procedure such as GENMOD, GLIMMIX, LOGISTIC, or PROBIT, which is designed to fit such models. However, if there is more or less variability than expected under the response distribution then the data are said to be overdispersed or underdispersed (SAS Note 22630) and other models can be used.

Normal Errors (Residuals)

For generalized linear models in which the response distribution is not normal, the residuals from these models are also not normal. So, again there is no need to test for this condition when fitting the model in a procedure such as GENMOD, GLIMMIX, LOGISTIC, or PROBIT, which is designed to fit such models.

Independence

As in ordinary regression, independence of the observations is assumed. Autocorrelation or correlated clusters of observations might adversely affect the parameter variance estimates. No test of correlation is available. However, when the data are known to be correlated in clusters, the model can be fit using the Generalized Estimating Equations (GEE) method. The GEE method is available via the REPEATED statement in PROC GENMOD or (beginning in SAS^® 9.4M2) PROC GEE. The GEE method employs a White-like sandwich variance estimator that accounts for the correlation.

Predictor Collinearity and Ill-Conditioned Information Matrix

As in normal-response models, collinearity (sometimes called multicollinearity) among the predictor variables in generalized linear models can cause the information matrix to become ill-conditioned, which adversely affects the precision of the estimated model parameters. Inflated standard errors for the model parameters can result in Wald-based tests being inappropriately insignificant. Unlike in normal-response models, the existence of collinearity does not necessarily imply ill-conditioning in generalized linear models. Extremely large standard errors for one or more of the estimated parameters and large off-diagonal values in the parameter covariance matrix (COVB option) or correlation matrix (CORRB option) both suggest an ill-conditioned information matrix. However, these conditions can also happen for reasons not related to collinearity among the raw predictors.

In normal-response models, you are concerned about the condition of the information matrix, X'X, and this is directly affected by collinearity among the predictors (X). However, in generalized linear models, the information matrix you are concerned about is X'WX. W is a diagonal matrix of weights that is determined by the fitting algorithm at each iteration. It is collinearity in the weighted predictors (W^½X), which directly affects the condition of the information matrix. Collinearity in the raw predictors (X) might not result in an ill-conditioned information matrix because the weights might reduce the effect of the collinearity. Conversely, the weights could cause the information matrix to become ill-conditioned even if the raw predictors are not collinear. Consequently, an assessment of collinearity in the raw predictors is not equivalent to an assessment of ill-conditioning as in normal-response models. For generalized linear models fit using Fisher's scoring, the HESSWGT= option in the OUTPUT statement of PROC GENMOD provides the diagonals of the weight matrix, W. These can then be used in a weighted regression using PROC REG. By including options for assessing collinearity, PROC REG can then provide an assessment of the condition of the information matrix in generalized linear models.

A Logistic Regression Example

The following statements fit a logistic model to the cancer remission data presented in the stepwise logistic regression example in the PROC LOGISTIC documentation (SAS Note 22930). For a logistic model fit using Fisher's scoring, W has values μ_i*(1-μ_i) along its diagonal, where μ is the binomial mean for the population defined by observation i. In the statements below, these weights are computed by the HESSWGT= option in PROC GENMOD and added to the OUTPUT OUT= data set. The SCORING=50 option ensures that GENMOD uses Fisher's scoring at each iteration (by default, GENMOD does 50 iterations). The CORRB option is included to show the correlations among the parameter estimates of the fitted model. The ODS SELECT statement causes only the correlations and parameter estimates to be displayed.

      ods select CorrB ParameterEstimates;
      proc genmod data=remiss;
        model remiss = li temp cell / dist=binomial scoring=50 corrb;
        output out=out hesswgt=w;
        run;

Notice that several parameter estimates and their standard errors are quite large, suggesting that the information matrix might be ill-conditioned. Also, note the large correlation (-0.99) between the Intercept and TEMP parameter estimates (parameters 1 and 3).

Estimated Correlation Matrix
	Prm1	Prm2	Prm3	Prm4
Prm1	1.0000	0.6383	-0.9922	0.3563
Prm2	0.6383	1.0000	-0.6866	0.5039
Prm3	-0.9922	-0.6866	1.0000	-0.4676
Prm4	0.3563	0.5039	-0.4676	1.0000

Analysis Of Parameter Estimates
Parameter	DF	Estimate	Standard Error	Wald 95% Confidence Limits		Chi-Square	Pr > ChiSq
Intercept	1	-67.6339	56.8875	-179.131	43.8636	1.41	0.2345
li	1	-3.8671	1.7783	-7.3525	-0.3817	4.73	0.0297
temp	1	82.0737	61.7124	-38.8803	203.0278	1.77	0.1835
cell	1	-9.6521	7.7511	-24.8440	5.5397	1.55	0.2130
Scale	0	1.0000	0.0000	1.0000	1.0000

To assess the condition of the logistic model's information matrix, a weighted regression is done in PROC REG using the HESSWGT= values as weights and including the collinearity options COLLIN and COLLINOINT. With the WEIGHT statement, the collinearity options in PROC REG assess the information matrix from the final iteration of PROC GENMOD. The ODS SELECT statement displays only the collinearity diagnostics since all other results from PROC REG should be ignored. Note that any response variable, including random values, could be used since the response values are not involved in assessing the information matrix.

      ods select CollinDiag CollinDiagNoInt;
      proc reg data=out;
        weight w;
        model remiss = li temp cell / collin collinoint;
        run; quit;

The large final condition index (315) indicates that collinearity exists among the weighted predictors. The variation proportions associated with this large condition index suggest that TEMP is collinear with the intercept. In the collinearity results adjusted for the intercept (from the COLLINOINT option), the small condition numbers suggest that there is no other collinearity except with the intercept.

Collinearity Diagnostics
Number	Eigenvalue	Condition Index	Proportion of Variation
Number	Eigenvalue	Condition Index	Intercept	li	temp	cell
1	3.90340	1.00000	0.00000553	0.00343	0.00000476	0.00032611
2	0.09274	6.48769	0.00005881	0.45683	0.00004243	0.00689
3	0.00382	31.95783	0.00488	0.09542	0.00295	0.81660
4	0.00003930	315.13660	0.99506	0.44433	0.99701	0.17619

Collinearity Diagnostics (intercept adjusted)
Number	Eigenvalue	Condition Index	Proportion of Variation
Number	Eigenvalue	Condition Index	li	temp	cell
1	1.59890	1.00000	0.16031	0.14518	0.00720
2	1.14592	1.18123	0.01703	0.06562	0.50885
3	0.25518	2.50313	0.82266	0.78919	0.48394

In logistic models, separation (SAS Note 22599) can also cause large parameter estimates and standard errors. To determine whether separation is the issue, use PROC LOGISTIC to fit the model. By default, PROC LOGISTIC checks for separation and will display notes in the SAS^® log and in the displayed results if separation is detected. For this model, PROC LOGISTIC does not detect separation, so the problem appears to be one of collinearity.

The following PROC REG step assesses the collinearity of the raw predictors. Again, any response variable could be used and only the collinearity diagnostics in the results are relevant.

      ods select CollinDiag CollinDiagNoInt;
      proc reg data=remiss;
        model remiss = li temp cell / collin collinoint;
        run; quit;

The COLLIN results again indicate collinearity between the raw TEMP predictor and the intercept. Large condition indices for both X'WX (315) and X'X (190) indicate collinearity among both the weighted and raw predictors. Assessments of both information matrices point to TEMP being collinear with the intercept. This similarity in the collinearity of the two matrices suggests that the collinearity in the raw predictors could be driving the collinearity in the weighted predictors.

Collinearity Diagnostics
Number	Eigenvalue	Condition Index	Proportion of Variation
Number	Eigenvalue	Condition Index	Intercept	li	temp	cell
1	3.84282	1.00000	0.00001411	0.01011	0.00001401	0.00259
2	0.12948	5.44792	0.00013645	0.97928	0.00013864	0.02006
3	0.02760	11.79925	0.00124	0.00311	0.00119	0.96945
4	0.00010558	190.77648	0.99861	0.00750	0.99866	0.00790

Collinearity Diagnostics (intercept adjusted)
Number	Eigenvalue	Condition Index	Proportion of Variation
Number	Eigenvalue	Condition Index	li	temp	cell
1	1.19888	1.00000	0.32845	0.04069	0.42725
2	1.04623	1.07047	0.19685	0.71894	0.01760
3	0.75488	1.26023	0.47470	0.24038	0.55515

One approach to this problem is to use a penalty-based selection method that can reduce the inflation of the parameters and simultaneously find the important predictors of the response. This is illustrated for these data in Example 3 of SAS Note 60240. However, the following analysis suggests another solution. It shows that the range of TEMP in these data is very restricted, varying only from 0.98 to 1.038, and that its standard deviation is much smaller than the other predictors. The tiny amount of variability in TEMP relative to its mean is behind its collinearity with the intercept.

      proc means data=remiss;
        run;

remiss
cell
smear
infil
li
blast
temp

0.3333333

0.8814815

0.6351852

0.5707407

1.0037037

0.6888519

0.9970000

0.4803845

0.1866445

0.2140519

0.2375666

0.4677947

0.5358045

0.0148609

0.2000000

0.3200000

0.0800000

0.4000000

0.9800000

1.0000000

0.9700000

0.9200000

1.9000000

2.0640000

1.0380000

By rescaling the predictors, the collinearity with the intercept can be removed. In these statements, PROC STANDARD creates the data set STD containing the rescaled predictors. The S=1 option scales the predictors to have standard deviations of 1.

      proc standard data=remiss s=1 out=std; 
        var li temp cell; 
        run;

Refitting the logistic model in GENMOD using the rescaled predictors in the STD data set, the parameter estimates and standard errors no longer exhibit any inflation. The correlation between the intercept and CELL parameters is still noticeably large (-0.91), but the results below indicate no problem with ill-conditioning.

Estimated Correlation Matrix
	Prm1	Prm2	Prm3	Prm4
Prm1	1.0000	-0.6924	0.4089	-0.9070
Prm2	-0.6924	1.0000	-0.6866	0.5039
Prm3	0.4089	-0.6866	1.0000	-0.4676
Prm4	-0.9070	0.5039	-0.4676	1.0000

Analysis Of Parameter Estimates
Parameter	DF	Estimate	Standard Error	Wald 95% Confidence Limits		Chi-Square	Pr > ChiSq
Intercept	1	3.9917	2.2477	-0.4136	8.3971	3.15	0.0757
li	1	-1.8090	0.8319	-3.4394	-0.1786	4.73	0.0297
temp	1	1.2197	0.9171	-0.5778	3.0172	1.77	0.1835
cell	1	-1.8015	1.4467	-4.6370	1.0340	1.55	0.2130
Scale	0	1.0000	0.0000	1.0000	1.0000

Repeating the assessment of X'WX using PROC REG on the rescaled data, the condition index is now small (10.7) indicating no collinearity among the weighted predictors.

Collinearity Diagnostics
Number	Eigenvalue	Condition Index	Proportion of Variation
Number	Eigenvalue	Condition Index	Intercept	li	temp	cell
1	3.35716	1.00000	0.00434	0.01124	0.01787	0.00577
2	0.41871	2.83158	0.02451	0.03529	0.29398	0.04597
3	0.19486	4.15072	0.00631	0.41454	0.39181	0.05887
4	0.02927	10.71054	0.96485	0.53893	0.29634	0.88939

Collinearity Diagnostics (intercept adjusted)
Number	Eigenvalue	Condition Index	Proportion of Variation
Number	Eigenvalue	Condition Index	li	temp	cell
1	1.59890	1.00000	0.16031	0.14518	0.00720
2	1.14592	1.18123	0.01703	0.06562	0.50885
3	0.25518	2.50313	0.82266	0.78919	0.48394

The condition index of X'X is also small (3.8) indicating no collinearity among the raw predictors.

Collinearity Diagnostics
Number	Eigenvalue	Condition Index	Proportion of Variation
Number	Eigenvalue	Condition Index	Intercept	li	temp	cell
1	2.85226	1.00000	0.02955	0.03891	0.03858	0.04354
2	0.52483	2.33124	0.00200	0.30832	0.54770	0.05615
3	0.42468	2.59158	0.01989	0.29164	0.01730	0.85436
4	0.19823	3.79324	0.94856	0.36113	0.39643	0.04595

Collinearity Diagnostics (intercept adjusted)
Number	Eigenvalue	Condition Index	Proportion of Variation
Number	Eigenvalue	Condition Index	li	temp	cell
1	1.19888	1.00000	0.32845	0.04069	0.42725
2	1.04623	1.07047	0.19685	0.71894	0.01760
3	0.75488	1.26023	0.47470	0.24038	0.55515

References:

Belsley, D.A., Kuh, E., and Welch, R.E. (1980), Regression Diagnostics: Identifying Influential Data and Sources of Collinearity, New York: John Wiley & Sons.

Lesaffre, E. and Marx, B.D. (1993), "Collinearity in Generalized Linear Regression," Comm. Stat. Theory and Meth., 22(7), 1933-1952.

Muller, K.E. and Fetterman, B.A. (2002), Regression and ANOVA: An Integrated Approach Using SAS^® Software, Cary, NC: SAS Institute Inc.

Segerstedt, B. and Nyquist, H. (1992), "On the conditioning problem in generalized linear models," Journal of Applied Statistics, 19(4), 513-526.

Operating System and Release Information

Product Family	Product	System	SAS Release
Product Family	Product	System	Reported	Fixed*
SAS System	SAS/STAT	Microsoft Windows 2000 Server
		Microsoft Windows 2000 Datacenter Server
		Microsoft Windows 2000 Advanced Server
		Microsoft Windows 95/98
		OS/2
		Microsoft Windows XP 64-bit Edition
		Microsoft® Windows® for x64
		Microsoft Windows Server 2003 Enterprise 64-bit Edition
		Microsoft Windows Server 2003 Datacenter 64-bit Edition
		Microsoft® Windows® for 64-Bit Itanium-based Systems
		z/OS
		OpenVMS VAX
		Microsoft Windows 2000 Professional
		Microsoft Windows NT Workstation
		Microsoft Windows Server 2003 Datacenter Edition
		Microsoft Windows Server 2003 Enterprise Edition
		Microsoft Windows Server 2003 Standard Edition
		Microsoft Windows XP Professional
		Windows Millennium Edition (Me)
		Windows Vista
		64-bit Enabled AIX
		64-bit Enabled HP-UX
		64-bit Enabled Solaris
		ABI+ for Intel Architecture
		AIX
		HP-UX
		HP-UX IPF
		IRIX
		Linux
		Linux for x64
		Linux on Itanium
		OpenVMS Alpha
		OpenVMS on HP Integrity
		Solaris
		Solaris for x64
		Tru64 UNIX

* For software releases that are not yet generally available, the Fixed Release is the software release in which the problem is planned to be fixed.

Type:	Usage Note
Priority:
Topic:	Analytics ==> Regression SAS Reference ==> Procedures ==> PROBIT SAS Reference ==> Procedures ==> REG SAS Reference ==> Procedures ==> GENMOD SAS Reference ==> Procedures ==> LOGISTIC

Date Modified:	2023-03-02 14:45:33
Date Created:	2008-06-17 16:05:23

Support