The LIFEREG Procedure

Probability Plotting

Subsections:

Complete and Right-Censored Data
Arbitrarily Censored Data
Nonparametric Confidence Intervals
Parametric Confidence Intervals

Probability plots are useful tools for the display and analysis of lifetime data. Probability plots use an inverse distribution scale so that a cumulative distribution function (CDF) plots as a straight line. A nonparametric estimate of the CDF of the lifetime data will plot approximately as a straight line, thus providing a visual assessment of goodness of fit.

You can use the PROBPLOT statement in PROC LIFEREG to create probability plots of data that are complete, right censored, interval censored, or a combination of censoring types (arbitrarily censored). A line representing the maximum likelihood fit from the MODEL statement and pointwise parametric confidence bands for the cumulative probabilities are also included in the plot.

A random variable Y belongs to a location-scale family of distributions if its CDF F is of the form

$\Pr \{ Y \leq y\} = F(y) = G\left(\frac{y-\mu }{\sigma }\right)$

where $\mu$ is the location parameter and $\sigma$ is the scale parameter. Here, G is a CDF that cannot depend on any unknown parameters, and G is the CDF of Y if $\mu =0$ and $\sigma =1$ . For example, if Y is a normal random variable with mean $\mu$ and standard deviation $\sigma$ ,

$G(u) = \Phi (u) = \int _{-\infty }^{u}\frac{1}{\sqrt {2\pi }}\exp \left(-\frac{u^{2}}{2}\right) \, du$

and

$F(y) = \Phi \left(\frac{y-\mu }{\sigma }\right)$

The normal, extreme-value, and logistic distributions are location-scale models. The three-parameter gamma distribution is a location-scale model if the shape parameter $\delta$ is fixed. If T has a lognormal, Weibull, or log-logistic distribution, then $\log (T)$ has a distribution that is a location-scale model. These distributions are said to be of type log-location-scale. Probability plots are constructed for lognormal, Weibull, and log-logistic distributions by using $\log (T)$ instead of T in the plots.

Let $y_{(1)} \le y_{(2)} \le \ldots \le y_{(n)}$ be ordered observations of a random sample with distribution function $F(y)$ . A probability plot is a plot of the points $y_{(i)}$ against $m_{i}=G^{-1}(a_{i})$ , where $a_{i}=\hat{F}(y_{i})$ is an estimate of the CDF $F(y_{(i)})=G\left(\frac{y_{(i)}-\mu }{\sigma }\right)$ . The nonparametric CDF estimates $a_{i}$ are sometimes called plotting positions. The axis on which the points $m_{i}$ are plotted is usually labeled with a probability scale (the scale of $a_{i}$ ).

If F is one of the location-scale distributions, then y is the lifetime; otherwise, the log of the lifetime is used to transform the distribution to a location-scale model.

If the data actually have the stated distribution, then $\hat{F} \approx F$ ,

$m_{i}=G^{-1}(\hat{F}(y_{i}))\approx G^{-1}\left(G\left(\frac{y_{(i)}-\mu }{\sigma }\right)\right)= \frac{y_{(i)}-\mu }{\sigma }$

and points $(y_{(i)}, m_{i})$ should fall approximately in a straight line.

There are several ways to compute the nonparametric CDF estimates used in probability plots from lifetime data. These are discussed in the next two sections.

Complete and Right-Censored Data

The censoring times must be taken into account when you compute plotting positions for right-censored data. The modified Kaplan-Meier method described in the following section is the default method for computing nonparametric CDF estimates for display on probability plots. See Abernethy (1996), Meeker and Escobar (1998), and Nelson (1982) for discussions of the methods described in the following sections.

Expected Ranks, Kaplan-Meier, and Modified Kaplan-Meier Methods

Let $y_{(1)} \le y_{(2)} \le \ldots \le y_{(n)}$ be ordered observations of a random sample including failure times and censor times. Order the data in increasing order. Label all the data with reverse ranks $r_{i}$ , with $r_{1}=n, \ldots , r_{n}=1$ . For the lifetime (not censoring time) corresponding to reverse rank $r_{i}$ , compute the survival function estimate

$S_{i} = \left[\frac{r_{i}}{r_{i}+1}\right]S_{i-1}$

with $S_{0}=1$ . The expected rank plotting position is computed as $a_{i}=1-S_{i}$ . The option PPOS=EXPRANK specifies the expected rank plotting position.

For the Kaplan-Meier method,

$S_{i} = \left[\frac{r_{i}-1}{r_{i}}\right]S_{i-1}$

The Kaplan-Meier plotting position is then computed as $a^\prime _{i}=1-S_{i}$ . The option PPOS=KM specifies the Kaplan-Meier plotting position.

For the modified Kaplan-Meier method, use

$S^\prime _{i} = \frac{S_{i} + S_{i-1}}{2}$

where $S_{i}$ is computed from the Kaplan-Meier formula with $S_{0}=1$ . The plotting position is then computed as $a^{\prime \prime }_{i}=1-S^\prime _{i}$ . The option PPOS=MKM specifies the modified Kaplan-Meier plotting position. If the PPOS option is not specified, the modified Kaplan-Meier plotting position is used as the default method.

For complete samples, $a_{i}=i/(n+1)$ for the expected rank method, $a^\prime _{i}=i/n$ for the Kaplan-Meier method, and $a^{\prime \prime }_{i}=(i-0.5)/n$ for the modified Kaplan-Meier method. If the largest observation is a failure for the Kaplan-Meier estimator, then $F_{n}=1$ and the point is not plotted.

Median Ranks

Let $y_{(1)} \le y_{(2)} \le \ldots \le y_{(n)}$ be ordered observations of a random sample including failure times and censor times. A failure order number $j_{i}$ is assigned to the ith failure: $j_{i}=j_{i-1}+\Delta$ , where $j_{0}=0$ . The increment $\Delta$ is initially 1 and is modified when a censoring time is encountered in the ordered sample. The new increment is computed as

$\Delta = \frac{(n+1) - \mbox{ previous failure order number }}{1 + \mbox{ number of items beyond previous censored item }}$

The plotting position is computed for the ith failure time as

$a_{i} = \frac{j_{i} - 0.3}{n + 0.4}$

For complete samples, the failure order number $j_{i}$ is equal to i, the order of the failure in the sample. In this case, the preceding equation for $a_{i}$ is an approximation of the median plotting position computed as the median of the ith-order statistic from the uniform distribution on (0, 1). In the censored case, $j_{i}$ is not necessarily an integer, but the preceding equation still provides an approximation to the median plotting position. The PPOS=MEDRANK option specifies the median rank plotting position.

Arbitrarily Censored Data

The LIFEREG procedure can create probability plots for data that consist of combinations of exact, left-censored, right-censored, and interval-censored lifetimes—that is, arbitrarily censored data. The LIFEREG procedure uses an iterative algorithm developed by Turnbull (1976) to compute a nonparametric maximum likelihood estimate of the cumulative distribution function for the data. Since the technique is maximum likelihood, standard errors of the cumulative probability estimates are computed from the inverse of the associated Fisher information matrix. This algorithm is an example of the expectation-maximization (EM) algorithm. The default initial estimate assigns equal probabilities to each interval. You can specify different initial values with the PROBLIST= option. Convergence is determined if the change in the log likelihood between two successive iterations is less than delta, where the default value of delta is $10^{-8}$ . You can specify a different value for delta with the TOLLIKE= option. Iterations will be terminated if the algorithm does not converge after a fixed number of iterations. The default maximum number of iterations is 1000. Some data might require more iterations for convergence. You can specify the maximum allowed number of iterations with the MAXITEM= option in the PROBPLOT statement. The iteration history of the log likelihood is displayed if you specify the ITPRINTEM option. The iteration history of the estimated interval probabilities are also displayed if you specify both options ITPRINTEM and PRINTPROBS.

If an interval probability is smaller than a tolerance ( $10^{-6}$ by default) after convergence, the probability is set to zero, the interval probabilities are renormalized so that they add to one, and iterations are restarted. Usually the algorithm converges in just a few more iterations. You can change the default value of the tolerance with the TOLPROB= option. You can specify the NOPOLISH option to avoid setting small probabilities to zero and restarting the algorithm.

If you specify the ITPRINTEM option, a table summarizing the Turnbull estimate of the interval probabilities is displayed. The columns labeled "Reduced Gradient" and "Lagrange Multiplier" are used in checking final convergence of the maximum likelihood estimate. The Lagrange multipliers must all be greater than or equal to zero, or the solution is not maximum likelihood. See Gentleman and Geyer (1994) for more details of the convergence checking. Also see Meeker and Escobar (1998, Chapter 3) for more information.

See Example 69.6 for an illustration.

Nonparametric Confidence Intervals

You can use the PPOUT option in the PROBPLOT statement to create a table containing the nonparametric CDF estimates computed by the selected method, Kaplan-Meier CDF estimates, standard errors of the Kaplan-Meier estimator, and nonparametric confidence limits for the CDF. The confidence limits are either pointwise or simultaneous, depending on the value of the NPINTERVALS= option in the PROBPLOT statement. The method used in the LIFEREG procedure for computation of approximate pointwise and simultaneous confidence intervals for cumulative failure probabilities relies on the Kaplan-Meier estimator of the cumulative distribution function of failure time and approximate standard deviation of the Kaplan-Meier estimator. For the case of arbitrarily censored data, the Turnbull algorithm, discussed previously, provides an extension of the Kaplan-Meier estimator. Both the Kaplan-Meier and the Turnbull estimators provide an estimate of the standard error of the CDF estimator, $\mr{se}_{\hat{F}}$ , that is used in computing confidence intervals.

Pointwise Confidence Intervals

Approximate $(1-\alpha )100\%$ pointwise confidence intervals are computed as in Meeker and Escobar (1998, Section 3.6) as

$[ F_ L,\; \; F_ U ] = \left[ \frac{\hat{F}}{\hat{F} + ( 1-\hat{F})w}, \; \; \frac{\hat{F}}{\hat{F} + ( 1-\hat{F})/w} \right]$

where

$w = \exp \left[\frac{z_{1-\alpha /2}\mr{se}_{\hat{F}}}{(\hat{F}(1-\hat{F}))} \right]$

where $z_ p$ is the pth quantile of the standard normal distribution.

Simultaneous Confidence Intervals

Approximate $(1-\alpha )100\%$ simultaneous confidence bands valid over the lifetime interval $(t_ a, t_ b)$ are computed as the "Equal Precision" case of Nair (1984) and Meeker and Escobar (1998, Section 3.8) as

$[ F_ L,\; \; F_ U ] = \left[ \frac{\hat{F}}{\hat{F} + ( 1-\hat{F})w}, \; \; \frac{\hat{F}}{\hat{F} + ( 1-\hat{F})/w} \right]$

where

$w = \exp \left[\frac{e_{a,b,1-\alpha /2}\mr{se}_{\hat{F}}}{(\hat{F}(1-\hat{F}))}\right]$

where the factor $x = e_{a,b,1-\alpha /2}$ is the solution of

$x\exp (-x^2/2)\log \left[\frac{(1-a)b}{(1-b)a}\right]/\sqrt {8\pi } = \alpha /2$

The time interval $(t_ a, t_ b)$ over which the bands are valid depends in a complicated way on the constants a and b defined in Nair (1984), $0 < a < b < 1$ . The constants a and b are chosen by default so that the confidence bands are valid between the lowest and highest times corresponding to failures in the case of multiply censored data, or to the lowest and highest intervals for which probabilities are computed for arbitrarily censored data. You can optionally specify a and b directly with the NPINTERVALS=SIMULTANEOUS(a, b) option in the PROBPLOT statement.

Parametric Confidence Intervals

Pointwise parametric confidence bands are displayed in a probability plot, unless you specify the NOCONF option in the PROBPLOT statement. Two kinds of confidence intervals are available for display in a probability plot: confidence limits for the estimated cumulative distribution function (CDF) and confidence limits for estimated distribution percentiles.

Confidence Limits for the Estimated CDF

If the distribution is of type log-location-scale, let $y=\log (t)$ where t is the value of time at which the confidence limits are to be computed. If the distribution is of type location-scale, let y be the value at which you want to evaluate confidence limits for the estimated CDF $\hat{F}(y)$ . Let

$\hat{u} = \frac{\mb{y}-\mb{x}^\prime \hat{\bbeta }}{\hat{\sigma }}$

where the column vector $\mb{x}$ of covariate values is determined by the rules summarized in the section XDATA= Data Set. If an offset variable is specified, the mean of the offset variable values is included in $\mb{x}^\prime \bbeta$ .

The CDF estimate is given by

$\hat{F}(y) = G(\hat{u})$

where G is the baseline distribution. The approximate standard error of $\hat{F}(y)$ is computed as in Meeker and Escobar (1998, Section 8.4.3) as

$\mr{SE}_{\hat{F}}=\frac{g(\hat{u})}{\hat{\sigma }} \left[ \mr{Var}(\mb{x}^\prime \hat{\bbeta }) + 2\hat{u}\mr{Cov}(\mb{x}^\prime \hat{\bbeta }, \hat{\sigma }) + \hat{u}^2\mr{Var}(\hat{\sigma })\right]^\frac {1}{2}$

where g is the probability density function corresponding to G. Two-sided $(1-\alpha )\times 100\%$ confidence limits are given by

$[F_ L,\; \; F_ U] = \left[\frac{\hat{F}}{\hat{F}+(1-\hat{F})\times w},\; \; \frac{\hat{F}}{\hat{F}+(1-\hat{F})/ w} \right]$

where

$w = \exp \left[\frac{z_{1-\alpha /2}\mr{SE}_{\hat{F}}}{\hat{F}(1-\hat{F})} \right]$

and $z_ p$ is the $p\times 100$ percentile of the standard normal distribution. The quantities $\mr{Var}(\mb{x}^\prime \hat{\bbeta })$ , $\mr{Cov}(\mb{x}^\prime \hat{\bbeta }, \hat{\sigma })$ , and $\mr{Var}(\hat{\sigma })$ are computed based on the covariance matrix of the estimated parameter vector $(\hat{\bbeta }, \hat{\sigma } )$ .

Confidence Limits for Percentiles

If the HCL option is specified in the PROBPLOT statement, confidence limits based on estimated distribution percentiles instead of the default CDF limits are displayed in the probability plot.

For location-scale distributions, the estimated $p\times 100$ percentile of the distribution F is given by

$y_ p = \mb{x}^\prime \hat{\bbeta } + G^{-1}(p)\hat{\sigma }$

where G is the baseline distribution and the column vector $\mb{x}$ of covariate values is determined by the rules summarized in the section XDATA= Data Set. The standard error of $y_ p$ is estimated by $\mr{SE}_ y = z^\prime \Sigma z$ where $\mb{z}=(\mb{x}^\prime , G^{-1}(p))^\prime$ and $\Sigma$ is the covariance matrix of the parameter estimates $(\hat{\bbeta }^\prime , \hat{\sigma })^\prime$ . Two-sided $(1-\alpha )\times 100\%$ confidence limits for $y_ p$ are given by

$[y_ L,\; \; y_ U] = [ y_ p - z_{1-\alpha /2}\mr{SE}_ y,\; \; y_ p + z_{1-\alpha /2}\mr{SE}_ y]$

For distributions of type log-location-scale, the confidence limits are computed as

$[t_ L=\exp (y_ L),\; \; t_ U=\exp (y_ U)]$

For example, if T has the Weibull distribution, G is the standardized extreme value distribution, $[y_ L,\; \; y_ U]$ are confidence limits for the $p\times 100$ percentile of the extreme value distribution for $\log (T)$ , and $[t_ L=\exp (y_ L),\; \; t_ U=\exp (y_ U)]$ are confidence limits for the $p\times 100$ percentile of the Weibull distribution for T.