The HPSEVERITY Procedure

Statistics of Fit

PROC HPSEVERITY computes and reports various statistics of fit to indicate how well the estimated model fits the data. The statistics belong to two categories: likelihood-based statistics and EDF-based statistics. Neg2LogLike, AIC, AICC, and BIC are likelihood-based statistics, and KS, AD, and CvM are EDF-based statistics.

In the distributed mode of execution, in which data are distributed across the grid nodes, the EDF estimates are computed by using the local data. The EDF-based statistics are computed by using these local EDF estimates. The reported value of each EDF-based statistic is an average of the values of the statistic that are computed by all the grid nodes where the data reside. Also, for large data sets, in both single-machine and distributed modes of execution, the EDF estimates are computed by using a fraction of the input data that is governed by either the INITSAMPLE option or the default sample size. Because of this nature of computing the EDF estimates, the EDF-based statistics of fit are an approximation of the values that would have been computed if the entire input data set were used for computing the EDF estimates. So the values that are reported for EDF-based statistics should be used only for comparing different models. The reported values should not be interpreted as true estimates of the corresponding statistics.

The likelihood-based statistics are reported for the entire input data in both single-machine and distributed modes of execution.

The following subsections provide definitions of each category of statistics.

Likelihood-Based Statistics of Fit

Let $y_ i, i = 1, \dotsc , N$, denote the response variable values. Let L be the likelihood as defined in the section Likelihood Function. Let p denote the number of model parameters that are estimated. Note that $p = p_ d + (k-k_ r)$, where $p_ d$ is the number of distribution parameters, k is the number of all regression parameters, and $k_ r$ is the number of regression parameters that are found to be linearly dependent (redundant) on other regression parameters. Given this notation, the likelihood-based statistics are defined as follows:

Neg2LogLike

The log likelihood is reported as

\[  \text {Neg2LogLike} = -2 \log (L)  \]

The multiplying factor $-2$ makes it easy to compare it to the other likelihood-based statistics. A model that has a smaller value of Neg2LogLike is deemed better.

AIC

Akaike’s information criterion (AIC) is defined as

\[  \text {AIC} = -2 \log (L) + 2 p  \]

A model that has a smaller AIC value is deemed better.

AICC

The corrected Akaike’s information criterion (AICC) is defined as

\[  \text {AICC} = -2 \log (L) + \frac{2 N p}{N - p - 1}  \]

A model that has a smaller AICC value is deemed better. It corrects the finite-sample bias that AIC has when N is small compared to p. AICC is related to AIC as

\[  \text {AICC} = \text {AIC} + \frac{2p(p+1)}{N-p-1}  \]

As N becomes large compared to p, AICC converges to AIC. AICC is usually recommended over AIC as a model selection criterion.

BIC

The Schwarz Bayesian information criterion (BIC) is defined as

\[  \text {BIC} = -2 \log (L) + p \log (N)  \]

A model that has a smaller BIC value is deemed better.

EDF-Based Statistics

This class of statistics is based on the difference between the estimate of the cumulative distribution function (CDF) and the estimate of the empirical distribution function (EDF). A model that has a smaller value of the chosen EDF-based statistic is deemed better.

Let $y_ i, i = 1, \dotsc , N$ denote the sample of N values of the response variable. Let $r_ i = \sum _{j=1}^{N} I[y_ j \leq y_ i]$ denote the number of observations with a value less than or equal to $y_ i$, where I is an indicator function. Let $F_ n(y_ i)$ denote the EDF estimate that is computed by using the method that you specify in the EMPIRICALCDF= option. Let $Z_ i = \hat{F}(y_ i)$ denote the estimate of the CDF. Let $F_ n(Z_ i)$ denote the EDF estimate of $Z_ i$ values that are computed using the same method that is used to compute the EDF of $y_ i$ values. Using the probability integral transformation, if $F(y)$ is the true distribution of the random variable Y, then the random variable $Z = F(y)$ is uniformly distributed between 0 and 1 (D’Agostino and Stephens, 1986, Ch. 4). Thus, comparing $F_ n(y_ i)$ with $\hat{F}(y_ i)$ is equivalent to comparing $F_ n(Z_ i)$ with $\hat{F}(Z_ i) = Z_ i$ (uniform distribution).

Note the following two points regarding which CDF estimates are used for computing the test statistics:

  • If you specify regression effects, then the CDF estimates $Z_ i$ that are used for computing the EDF test statistics are from a mixture distribution. See the section CDF and PDF Estimates with Regression Effects for more information.

  • If the EDF estimates are conditional because of the truncation information, then each unconditional estimate $Z_ i$ is converted to a conditional estimate using the method described in the section Truncation and Conditional CDF Estimates.

In the following, it is assumed that $Z_ i$ denotes an appropriate estimate of the CDF if you specify any truncation or regression effects. Given this, the EDF-based statistics of fit are defined as follows:

KS

The Kolmogorov-Smirnov (KS) statistic computes the largest vertical distance between the CDF and the EDF. It is formally defined as follows:

\[  \text {KS} = \sup _ y | F_ n(y) - F(y) |  \]

If the STANDARD method is used to compute the EDF, then the following formula is used:

\begin{align*}  D^+ & = \text {max}_ i (\frac{r_ i}{N} - Z_ i) \\ D^- & = \text {max}_ i (Z_ i - \frac{r_{i-1}}{N}) \\ \text {KS} & = \sqrt {N} \;  \text {max}(D^+, D^-) + \frac{0.19}{\sqrt {N}} \end{align*}

Note that $r_0$ is assumed to be 0.

If the method used to compute the EDF is any method other than the STANDARD method, then the following formula is used:

\begin{align*}  D^+ & = \text {max}_ i (F_ n(Z_ i) - Z_ i), \text { if } F_ n(Z_ i) \geq Z_ i \\ D^- & = \text {max}_ i (Z_ i - F_ n(Z_ i)), \text { if } F_ n(Z_ i) < Z_ i \\ \text {KS} & = \sqrt {N} \;  \text {max}(D^+, D^-) + \frac{0.19}{\sqrt {N}} \end{align*}
AD

The Anderson-Darling (AD) statistic is a quadratic EDF statistic that is proportional to the expected value of the weighted squared difference between the EDF and CDF. It is formally defined as follows:

\[  \text {AD} = N \int _{-\infty }^{\infty } \frac{(F_ n(y) - F(y))^2}{F(y) (1-F(y))} dF(y)  \]

If the STANDARD method is used to compute the EDF, then the following formula is used:

\[  \text {AD} = -N - \frac{1}{N} \sum _{i=1}^{N} \left[ (2 r_ i - 1) \log (Z_ i) + (2 N + 1 - 2 r_ i) \log (1-Z_ i) \right]  \]

If the method used to compute the EDF is any method other than the STANDARD method, then the statistic can be computed by using the following two pieces of information:

  • If the EDF estimates are computed using the KAPLANMEIER or MODIFIEDKM methods, then EDF is a step function such that the estimate $F_ n(z)$ is a constant equal to $F_ n(Z_{i-1})$ in interval $[Z_{i-1}, Z_ i]$. If the EDF estimates are computed using the TURNBULL method, then there are two types of intervals: one in which the EDF curve is constant and the other in which the EDF curve is theoretically undefined. For computational purposes, it is assumed that the EDF curve is linear for the latter type of the interval. For each method, the EDF estimate $F_ n(y)$ at y can be written as

    \[  F_ n(z) = F_ n(Z_{i-1}) + S_ i (z - Z_{i-1}), \text { for } z \in [Z_{i-1},Z_ i]  \]

    where $S_ i$ is the slope of the line defined as

    \[  S_ i = \frac{F_ n(Z_{i})-F_ n(Z_{i-1})}{Z_{i}-Z_{i-1}}  \]

    For the KAPLANMEIER or MODIFIEDKM method, $S_ i=0$ in each interval.

  • Using the probability integral transform $z=F(y)$, the formula simplifies to

    \[  \text {AD} = N \int _{-\infty }^{\infty } \frac{(F_ n(z) - z)^2}{z (1-z)} dz  \]

The computation formula can then be derived from the following approximation:

\begin{align*}  \text {AD} & = N \sum _{i=1}^{K+1} \int _{Z_{i-1}}^{Z_ i} \frac{(F_ n(z) - z)^2}{z (1-z)} dz \\ & = N \sum _{i=1}^{K+1} \int _{Z_{i-1}}^{Z_ i} \frac{(F_ n(Z_{i-1}) + S_ i (z - Z_{i-1}) - z)^2}{z (1-z)} dz \\ & = N \sum _{i=1}^{K+1} \int _{Z_{i-1}}^{Z_ i} \frac{(P_ i - Q_ i z)^2}{z (1-z)} dz \end{align*}

where $P_ i = F_ n(Z_{i-1}) - S_ i Z_{i-1}$, $Q_ i = 1-S_ i$, and K is the number of points at which the EDF estimate are computed. For the TURNBULL method, $K=2k$ for some k.

Assuming $Z_0 = 0$, $Z_{K+1} = 1$, $F_ n(0) = 0$, and $F_ n(Z_ K) = 1$ yields the following computation formula:

\begin{equation*} \begin{split}  \text {AD} = &  - N (Z_1 + \log (1-Z_1) + \log (Z_ K) + (1-Z_ K)) \\ &  + N \sum _{i=2}^{K} \left[ P_ i^2 A_ i - (Q_ i-P_ i)^2 B_ i - Q_ i^2 C_ i \right] \end{split}\end{equation*}

where $A_ i = \log (Z_ i) - \log (Z_{i-1})$, $B_ i = \log (1-Z_ i) - \log (1-Z_{i-1})$, and $C_ i=Z_ i-Z_{i-1}$.

If EDF estimates are computed using the KAPLANMEIER or MODIFIEDKM method, then $P_ i = F_ n(Z_{i-1})$ and $Q_ i=1$, which simplifies the formula as

\begin{equation*} \begin{split}  \text {AD} = &  - N (1 + \log (1-Z_1) + \log (Z_ K)) \\ &  + N \sum _{i=2}^{K} \left[ F_ n(Z_{i-1})^2 A_ i - (1-F_ n(Z_{i-1}))^2 B_ i \right] \end{split}\end{equation*}
CvM

The Cramér-von Mises (CvM) statistic is a quadratic EDF statistic that is proportional to the expected value of the squared difference between the EDF and CDF. It is formally defined as follows:

\[  \text {CvM} = N \int _{-\infty }^{\infty } (F_ n(y) - F(y))^2 dF(y)  \]

If the STANDARD method is used to compute the EDF, then the following formula is used:

\[  \text {CvM} = \frac{1}{12 N} + \sum _{i=1}^{N} \left( Z_ i - \frac{(2 r_ i - 1)}{2N}\right)^2  \]

If the method used to compute the EDF is any method other than the STANDARD method, then the statistic can be computed by using the following two pieces of information:

  • As described previously for the AD statistic, the EDF estimates are assumed to be piecewise linear such that the estimate $F_ n(y)$ at y is

    \[  F_ n(z) = F_ n(Z_{i-1}) + S_ i (z - Z_{i-1}), \text { for } z \in [Z_{i-1},Z_ i]  \]

    where $S_ i$ is the slope of the line defined as

    \[  S_ i = \frac{F_ n(Z_{i})-F_ n(Z_{i-1})}{Z_{i}-Z_{i-1}}  \]

    For the KAPLANMEIER or MODIFIEDKM method, $S_ i=0$ in each interval.

  • Using the probability integral transform $z=F(y)$, the formula simplifies to:

    \[  \text {CvM} = N \int _{-\infty }^{\infty } (F_ n(z) - z)^2 dz  \]

The computation formula can then be derived from the following approximation:

\begin{align*}  \text {CvM} & = N \sum _{i=1}^{K+1} \int _{Z_{i-1}}^{Z_ i} (F_ n(z) - z)^2 dz \\ & = N \sum _{i=1}^{K+1} \int _{Z_{i-1}}^{Z_ i} (F_ n(Z_{i-1}) + S_ i (z - Z_{i-1}) - z)^2 dz \\ & = N \sum _{i=1}^{K+1} \int _{Z_{i-1}}^{Z_ i} (P_ i - Q_ i z)^2 dz \end{align*}

where $P_ i = F_ n(Z_{i-1}) - S_ i Z_{i-1}$, $Q_ i = 1-S_ i$, and K is the number of points at which the EDF estimate are computed. For the TURNBULL method, $K=2k$ for some k.

Assuming $Z_0 = 0$, $Z_{K+1} = 1$, and $F_ n(0) = 0$ yields the following computation formula:

\[  \text {CvM} = N \frac{Z_1^3}{3} + N \sum _{i=2}^{K+1} \left[ P_ i^2 A_ i - P_ i Q_ i B_ i - \frac{Q_ i^2}{3} C_ i \right]  \]

where $A_ i = Z_ i - Z_{i-1}$, $B_ i = Z_ i^2 - Z_{i-1}^2$, and $C_ i=Z_ i^3-Z_{i-1}^3$.

If EDF estimates are computed using the KAPLANMEIER or MODIFIEDKM method, then $P_ i = F_ n(Z_{i-1})$ and $Q_ i=1$, which simplifies the formula as

\[  \text {CvM} = \frac{N}{3} + N \sum _{i=2}^{K+1} \left[ F_ n(Z_{i-1})^2 (Z_ i-Z_{i-1}) - F_ n(Z_{i-1}) (Z_ i^2-Z_{i-1}^2) \right]  \]

which is similar to the formula proposed by Koziol and Green (1976).