Model Fit and Assessment Statistics :: SAS/STAT(R) 13.2 User's Guide: High-Performance Procedures

Information Criteria

The calculation of the information criteria uses the following formulas, where p denotes the number of effective parameters in the candidate model, F denotes the sum of frequencies used, and l is the log likelihood evaluated at the converged estimates:

$\begin{align*} \mr{AIC} & = -2 l + 2p \\ \mr{AICC} & = \left\{ \begin{array}{ll} -2 l + 2 p F/(F-p-1) & \text {when } F > p+2 \cr -2 l + 2 p (p+2) & \text {otherwise} \end{array}\right. \\ \mr{BIC} & = -2 l + p \log (F) \\ \end{align*}$

If you do not specify a FREQ statement, F equals n, the number of observations used.

Generalized Coefficient of Determination

The goal of a coefficient of determination, also known as an R-square measure, is to express the agreement between a stipulated model and the data in terms of variation in the data that is explained by the model. In linear models, the R-square measure is based on residual sums of squares; because these are additive, a measure bounded between 0 and 1 is easily derived.

In more general models where parameters are estimated by the maximum likelihood principle, Cox and Snell (1989, pp. 208–209) and Magee (1990) proposed the following generalization of the coefficient of determination:

$R^2 = 1 - \biggl \{ \frac{L(\bm {0})}{L({\widehat{\bbeta }})}\biggr \} ^{\frac{2}{n}}$

Here, $L(\bm {0})$ is the likelihood of the intercept-only model, $L({\widehat{\bbeta }})$ is the likelihood of the specified model, and n denotes the number of observations used in the analysis. This number is adjusted for frequencies if a FREQ statement is present and is based on the trials variable for binomial models.

As discussed in Nagelkerke (1991), this generalized R-square measure has properties similar to the coefficient of determination in linear models. If the model effects do not contribute to the analysis, $L({\widehat{\bbeta }})$ approaches $L(\bm {0})$ and $R^2$ approaches zero.

However, $R^2$ does not have an upper limit of 1. Nagelkerke suggested a rescaled generalized coefficient of determination, $\tilde{R}^2$ , which achieves an upper limit of 1 by dividing $R^2$ by its maximum value:

$\begin{align*} R_{\max }^2 & = 1 - \{ L(\bm {0})\} ^{\frac{2}{n}} \\ \tilde{R}^2 & = \frac{R^2}{R_{\max }^2} \\ \end{align*}$

Another measure from McFadden (1974) is also bounded by 0 and 1:

$R_ M^2 = 1-\biggl (\frac{\log L({\widehat{\bbeta }})}{\log L(\bm {0})}\biggr )$

If you specify the RSQUARE option in the MODEL statement, the HPLOGISTIC procedure computes $R^2$ and $\tilde{R}^2$ . All three measures are computed for each data role when you specify a PARTITION statement.

These measures are most useful for comparing competing models that are not necessarily nested—that is, models that cannot be reduced to one another by simple constraints on the parameter space. Larger values of the measures indicate better models.

Classification Table and ROC Curves

For binary response data, the response Y is either an event or a nonevent; let the response Y take the value 1 for an event and 2 for a nonevent. From the fitted model, a predicted event probability $\hat{\pi }_ i$ can be computed for each observation i. If the predicted event probability equals or exceeds some cutpoint value $z \in [0,1]$ , the observation is classified as an event; otherwise, it is classified as a nonevent. Suppose $n_1$ of n individuals experience an event, such as a disease, and the remaining $n_2=n-n_1$ individuals are nonevents. The $2\times 2$ decision matrix in Table 9.6 is obtained by cross-classifying the observed and predicted responses, where $n_{ij}$ is the total number of observations that are observed to have Y=i and are classified into j. In this table, let $Y\! =\! 1$ denote an observed event and $Y\! =\! 2$ denote a nonevent, and let $D\! =\! 1$ indicate that the observation is classified as an event and $D\! =\! 2$ denote that the observation is classified as a nonevent.

Table 9.6: Decision Matrix

	$D\! =\! 1$ ( $\hat{\pi }\ge z$ )	$D\! =\! 2$ ( $\hat{\pi }<z$ )	Total
$Y\! =\! 1$ (event)	$n_{11}$	$n_{12}$	$n_1$
$Y\! =\! 2$ (nonevent)	$n_{21}$	$n_{22}$	$n_2$

In the decision matrix, the number of true positives, $n_{11}$ , is the number of event observations that are correctly classified as events; the number of false positives, $n_{21}$ , is the number of nonevent observations that are incorrectly classified as events; the number of false negatives, $n_{12}$ , is the number of event observations that are incorrectly classified as nonevents; and the number of true negatives, $n_{22}$ , is the number of nonevent observations that are correctly classified as nonevents. The following statistics are computed from the preceding decision matrix:

Table 9.7: Statistics from the Decision Matrix with Cutpoint z

Statistic	Equation	OUTROC Column
Cutpoint	z	ProbLevel
Number of true positives	$n_{11}$	TruePos
Number of true negatives	$n_{22}$	TrueNeg
Number of false positives	$n_{21}$	FalsePos
Number of false negatives	$n_{12}$	FalseNeg
Sensitivity	${n_{11}}/{n_1}$	TPF (true positive fraction)
1–specificity	${n_{21}}/{n_2}$	FPF (false positive fraction)
Correct classification rate	${(n_{11}+n_{22})}/{n}$	PercentCorrect (PC)
Misclassification rate	1 – PC
Positive predictive value	${n_{11}}/{(n_{11}+n_{21})}$	PPV
Negative predictive value	${n_{22}}/{(n_{12}+n_{22})}$	NPV

The accuracy of the classification is measured by its ability to predict events and nonevents correctly. Sensitivity (TPF, true positive fraction) is the proportion of event responses that are predicted to be events. Specificity (1–FPF, true negative fraction) is the proportion of nonevent responses that are predicted to be nonevents.

You can also measure accuracy by how well the classification predicts the response. The positive predictive value (PPV, 1–false positive rate) is the proportion of nonevents that are classified as events. The negative predictive value (NPV, 1–false negative rate) is the proportion of events that are classified as nonevents. The correct classification rate (PC) is the proportion of observations that are correctly classified.

If you also specify a PRIOR= option, then PROC HPLOGISTIC uses Bayes’ theorem to modify the PPV, NPV, and PC as follows. Results of the classification are represented by two conditional probabilities: sensitivity, ${\Pr }(D\! =\! 1|Y\! =\! 1)= \frac{n_{11}}{n_1}$ , and one minus the specificity, ${\Pr }(D\! =\! 1|Y\! =\! 2)= \frac{n_{21}}{n_2}$ .

If the prevalence of the disease in the population ${\Pr }(D=1)$ is provided by the value of the PRIOR= option, then the PPV, NPV, and PC are given by Fleiss (1981, pp. 4–5) as follows:

$\begin{align*} \textnormal{PPV} & = {\Pr }(Y\! =\! 1|D\! =\! 1) = \frac{{\Pr }(Y\! =\! 1){\Pr }(D\! =\! 1|Y\! =\! 1)}{{\Pr }(D\! =\! 1|Y\! =\! 2) + {\Pr }(Y\! =\! 1)[{\Pr }(D\! =\! 1|Y\! =\! 1) - {\Pr }(D\! =\! 1|Y\! =\! 2)]} \\ \textnormal{NPV} & = {\Pr }(Y\! =\! 2|D\! =\! 2) = \frac{[1-{\Pr }(D\! =\! 1|Y\! =\! 2)]{[1-\Pr }(Y\! =\! 1)]}{1-{\Pr }(D\! =\! 1|Y\! =\! 2) - {\Pr }(Y\! =\! 1)[{\Pr }(D\! =\! 1|Y\! =\! 1) - {\Pr }(D\! =\! 1|Y\! =\! 2)]} \\ \textnormal{PC} & = {\Pr }(Y\! =\! 1|D\! =\! 1) + {\Pr }(Y\! =\! 2|D\! =\! 2) \\ & = {\Pr }(D\! =\! 1|Y\! =\! 1){\Pr }(Y\! =\! 1)+{\Pr }(D\! =\! 2|Y\! =\! 2)[1-{\Pr }(Y\! =\! 1)] \\ \end{align*}$

If you do not specify the PRIOR= option, then PROC HPLOGISTIC uses the sample proportion of diseased individuals; that is, ${\Pr }(Y\! =\! 1)=n_1/n$ . In such a case, the preceding values reduce to those in Table 9.7. Note that for a stratified sampling situation in which $n_1$ and $n_2$ are chosen a priori, $n_1/n$ is not a desirable estimate of ${\Pr }(Y\! =\! 1)$ , so you should specify a PRIOR= option.

PROC HPLOGISTIC constructs the data for a receiver operating characteristic (ROC) curve by initially binning the predicted probabilities as discussed in the section The Hosmer-Lemeshow Goodness-of-Fit Test, then moving the cutpoint from 0 to 1 along the bin boundaries (so that the cutpoints correspond to the predicted probabilities), and then selecting those cutpoints where a change in the decision matrix occurs. The CTABLE option produces a table that includes these cutpoints and the statistics in Table 9.7 that correspond to each cutpoint. You can output this table to a SAS data set by specifying the CTABLE= option (see Table 9.7 for the column names), and you can display the ROC curve by using the SGPLOT procedure as shown in Example 9.2.

The area under the ROC curve (AUC), as determined by the trapezoidal rule, is given by the concordance index $C$ , which is described in the section Association Statistics.

For more information about the topics in this section, see Pepe (2003).

The "Partition Fit Statistics" table displays the misclassification rate, true positive fraction, true negative fraction, and AUC according to their roles. If you have a polytomous response, then instead of classifying according to a cutpoint, PROC HPLOGISTIC classifies the observation into the lowest response level (which has the largest predicted probability for that observation) and similarly computes a true response-level fraction.

Association Statistics

If you specify the ASSOCIATION option in the MODEL statement, PROC HPLOGISTIC displays measures of association between predicted probabilities and observed responses for binary or binomial response models. These measures assess the predictive ability of a model.

Of the n pairs of observations in the data set with different responses, let $n_ c$ be the number of pairs where the observation that has the lower ordered response value has a lower predicted probability, let $n_ d$ be the number of pairs where the observation that has the lower ordered response value has a higher predicted probability, and let $n_ t=n-n_ c-n_ d$ be the rest. Let N be the sum of observation frequencies in the data. Then the following statistics are reported:

$\begin{array}{lcl} \text {concordance index } C \text { (AUC)} & =& (n_ c+0.5n_ t)/n \\ \text {Somers’ } D \text { (Gini coefficient) } & =& (n_ c-n_ d)/n \\ \text {Goodman-Kruskal gamma } & =& (n_ c-n_ d)/(n_ c+n_ d) \\ \text {Kendall’s tau-}a & =& (n_ c-n_ d)/(0.5N(N-1)) \end{array}$

Classification of the pairs is carried out by initially binning the predicted probabilities as discussed in the section The Hosmer-Lemeshow Goodness-of-Fit Test. The concordance index, C, is an estimate of the AUC, which is the area under the receiver operating characteristic (ROC) curve. If there are no ties, then Somers’ D (Gini’s coefficient) = 2C–1.

If you specify a PARTITION statement, then PROC HPLOGISTIC displays the AUC and Somers’ D in the "Association" and "Partition Fit Statistics" tables according to their roles.

Average Squared Error

The average squared error, ASE, is the average of the squared differences between the responses and the predictions. When you have a discrete number of response levels, the ASE is modified as shown in Table 9.8 (Brier, 1950; Murphy, 1973); it is also called the Brier score or Brier reliability:

Table 9.8: Average Square Error Computations

Response Type	ASE (Brier Score)
Polytomous	$\frac{1}{F}\sum _ if_ i\sum _ j(y_{ij}-\widehat{\pi }_{ij})^2$
Binary	$\frac{1}{F}\sum _ if_ i(y_ i(1-\widehat{\pi }_ i)^2+(1-y_ i)\widehat{\pi }_ i^2)$
Binomial	$\frac{1}{F}\sum _ if_ i(r_ i/t_ i-\widehat{\pi }_{i})^2$

In Table 9.8, $F=\sum _ if_ i$ , $r_ i$ is the number of events, $t_ i$ is the number of trials in binomial response models, and $y_ i$ =1 for events and 0 for nonevents in binary response models. For polytomous response models, $y_{ij}$ =1 if the ith observation has response level j, and $\pi _{ij}$ is the model-predicted probability of response level j for observation i.

Mean Difference

For a binary response model, write the mean of the model-predicted probabilities of event (Y=1) observations as $\overline{X}_1 = \frac{\sum _{i=1}^ n(\pi _ i | y_ i=1)}{n_1}$ and of the nonevent (Y=2) observations as $\overline{X}_2 = \frac{\sum _{i=1}^ n(\pi _ i | y_ i=2)}{n_1}$ . The mean difference is $\overline{X}_1-\overline{X}_2$ , which Tjur (2009) relates to other R-square measures and calls the coefficient of discrimination because it is a measure of the model’s ability to distinguish between the event and nonevent distributions. This mean difference is also the $d’$ or $\Delta m$ statistic (with unit standard error) that is discussed in signal detection literature (McNicol, 2005).

The HPLOGISTIC Procedure