The HPSEVERITY Procedure

Empirical Distribution Function Estimation Methods

The empirical distribution function (EDF) is a nonparametric estimate of the cumulative distribution function (CDF) of the distribution. PROC HPSEVERITY computes EDF estimates for two purposes: to send the estimates to a distribution’s PARMINIT subroutine in order to initialize the distribution parameters, and to compute the EDF-based statistics of fit.

To reduce the time that it takes to compute the EDF estimates, you can use the INITSAMPLE option to specify that only a fraction of the input data be used. If you do not specify the INITSAMPLE option and the data set has more than 10,000 valid observations, then a uniform random sample of at most 10,000 observations is used for EDF estimation.

In the distributed mode of execution, in which data are distributed across the grid nodes, the EDF estimates are computed on each node by using the portion of the input data that is located on that node. These local EDF estimates are an approximation of the global EDF estimates, which would been computed by using the entire input data set. PROC HPSEVERITY does not compute global EDF estimates. Let $X$ denote a quantity that depends on the EDF estimates. $X$ can be either an EDF-based initial value of a distribution parameter or an EDF-based statistic of fit. PROC HPSEVERITY estimates $X$ as follows: First, each grid node $k$ computes an estimate $X_ k$ by using the local EDF estimates that are computed on that node. Then, the estimate $\hat{X}$ of $X$ is computed as an average of all the $X_ k$ values; that is, $\hat{X} = \sum _{i=1}^{K} X_ k$, where $K$ denotes the total number of nodes where the data reside.

This section describes the methods that are used for computing EDF estimates.

Notation

Let there be a set of $N$ observations, each containing a quintuplet of values $(y_ i, t^ l_ i, t^ r_ i, c^ r_ i, c^ l_ i), i=1, \dotsc , N$, where $y_ i$ is the value of the response variable, $t^ l_ i$ is the value of the left-truncation threshold, $t^ r_ i$ is the value of the right-truncation threshold, and $c^ r_ i$ is the value of the right-censoring limit, $c^ l_ i$ is the value of the left-censoring limit.

If an observation is not left-truncated, then $t^ l_ i = \tau ^ l$, where $\tau ^ l$ is the smallest value in the support of the distribution; so $F(t^ l_ i)=0$. If an observation is not right-truncated, then $t^ r_ i = \tau _ h$, where $\tau _ h$ is the largest value in the support of the distribution; so $F(t^ r_ i)=1$. If an observation is not right-censored, then $c^ r_ i = \tau ^ l$; so $F(c^ r_ i)=0$. If an observation is not left-censored, then $c^ l_ i = \tau _ h$; so $F(c^ l_ i)=1$.

Let $w_ i$ denote the weight associated with $i$th observation. If you specify the WEIGHT statement, then $w_ i$ is the normalized value of the weight variable; otherwise, it is set to 1. The weights are normalized such that they sum up to $N$.

An indicator function $I[e]$ takes a value of 1 or 0 if the expression $e$ is true or false, respectively.

Estimation Methods

If the response variable is subject to both left-censoring and right-censoring effects and if you do not specify the EMPIRICALCDF=NOTURNBULL method, then PROC HPSEVERITY uses the Turnbull’s method. This section describes methods other than Turnbull’s method. For Turnbull’s method, see the next section Turnbull’s EDF Estimation Method.

The method descriptions assume that all observations are either uncensored or right-censored; that is, each observation is of the form $(y_ i, t^ l_ i, t^ r_ i, \tau ^ l, \tau _ h)$ or $(y_ i, t^ l_ i, t^ r_ i, c^ r_ i, \tau _ h)$.

If all observations are either uncensored or left-censored, then each observation is of the form $(y_ i, t^ l_ i, t^ r_ i, \tau _ l, c^ l_ i)$. It is converted to an observation $(-y_ i, -t^ r_ i, -t^ l_ i, -c^ l_ i, \tau _ h)$; that is, the signs of all the response variable values are reversed, the new left-truncation threshold is equal to the negative of the original right-truncation threshold, the new right-truncation threshold is equal to the negative of the original left-truncation threshold, and the negative of the original left-censoring limit becomes the new right-censoring limit. With this transformation, each observation is either uncensored or right-censored. The methods described for handling uncensored or right-censored data are now applicable. After the EDF estimates are computed, the observations are transformed back to the original form and EDF estimates are adjusted such $F_ n(y_ i) = 1-F_ n(-y_ i-)$, where $F_ n(-y_ i-)$ denotes the EDF estimate of the value slightly less than the transformed value $-y_ i$.

Further, a set of uncensored or right-censored observations can be converted to a set of observations of the form $(y_ i, t^ l_ i, t^ r_ i, \delta _ i)$, where $\delta _ i$ is the indicator of right-censoring. $\delta _ i = 0$ indicates a right-censored observation, in which case $y_ i$ is assumed to record the right-censoring limit $c^ r_ i$. $\delta _ i = 1$ indicates an uncensored observation, and $y_ i$ records the exact observed value. In other words, $\delta _ i = I[Y \leq C^ r]$ and $y_ i = \text {min}(y_ i, c^ r_ i)$.

Given this notation, the EDF is estimated as

\[  F_ n(y) = \left\{  \begin{array}{ll} 0 &  \text { if } y < y^{(1)} \\ \hat{F}_ n(y^{(k)}) &  \text { if } y^{(k)} \leq y < y^{(k+1)}, k = 1, \dotsc , N-1 \\ \hat{F}_ n(y^{(N)}) &  \text { if } y^{(N)} \leq y \end{array} \right.  \]

where $y^{(k)}$ denotes the $k$th order statistic of the set $\lbrace y_ i \rbrace $ and $\hat{F}_ n(y^{(k)})$ is the estimate computed at that value. The definition of $\hat{F}_ n$ depends on the estimation method. You can specify a particular method or let PROC HPSEVERITY choose an appropriate method by using the EMPIRICALCDF= option in the PROC HPSEVERITY statement. Each method computes $\hat{F}_ n$ as follows:

NOTURNBULL

This is the default method. First, censored observations, if any, are processed as follows:

  • An observation that is left-censored but not right-censored is converted to an uncensored observation $(y_ i^ u, t^ l_ i, t^ r_ i, \tau ^ l, \tau _ h)$, where $y_ i^ u = c^ l_ i/2$.

  • An observation that is both left-censored and right-censored is converted to an uncensored observation $(y_ i^ u, t^ l_ i, t^ r_ i, \tau ^ l, \tau _ h)$, where $y_ i^ u = (c^ r_ i + c^ l_ i)/2$.

  • An observation that is right-censored but not left-censored is left unchanged.

If the processed set of observations contains any truncated or right-censored observations, the KAPLANMEIER method is used. Otherwise, the STANDARD method is used.

The observations are modified only for the purpose of computing the EDF estimates. The original censoring information is used by the parameter estimation process.

STANDARD

If you do not specify any censoring or truncation information, then this method is chosen. It is the standard way of computing EDF. The EDF estimate at observation $i$ is computed as follows:

\[  \hat{F}_ n(y_ i) = \frac{1}{N} {\displaystyle \sum _{j=1}^{N} I[y_ j \leq y_ i]}  \]
KAPLANMEIER

This method is chosen when you specify at least one form of censoring or truncation. The Kaplan-Meier (KM) estimator, also known as the product-limit estimator, was first introduced by Kaplan and Meier (1958) for censored data. Lynden-Bell (1971) derived a similar estimator for left-truncated data. PROC HPSEVERITY uses the definition that combines both censoring and truncation information (Klein and Moeschberger, 1997; Lai and Ying, 1991).

The EDF estimate at observation $i$ is computed as

\[  \hat{F}_ n(y_ i) = 1 - {\displaystyle \prod _{\tau \leq y_ i} \left(1 - \frac{n_\tau }{R_ n(\tau )} \right)}  \]

where $n_\tau $ and $R_ n(\tau )$ are defined as follows:

  • $n_\tau = \sum _{k=1}^{N} w_ k \cdot I[y_ k = \tau \text { and } \tau \leq t^ r_ k \text { and } \delta _ k = 1]$, which is the number of uncensored observations ($\delta _ k = 1$) for which the response variable value is equal to $\tau $ and $\tau $ is observable according to the right-truncation threshold of that observation ($\tau \leq t^ r_ k$).

  • $R_ n(\tau ) = \sum _{k=1}^{N} w_ k \cdot I[y_ k \geq \tau > t^ l_ k]$, which is the size (cardinality) of the risk set at $\tau $. The term risk set has its origins in survival analysis; it contains the events that are at risk of failure at a given time, $\tau $. In other words, it contains the events that have survived up to time $\tau $ and might fail at or after $\tau $. For PROC HPSEVERITY, time is equivalent to the magnitude of the event and failure is equivalent to an uncensored and observable event, where observable means it satisfies the truncation thresholds.

MODIFIEDKM

The product-limit estimator used by the KAPLANMEIER method does not work well if the risk set size becomes very small. For right-censored data, the size can become small towards the right tail. For left-truncated data, the size can become small at the left tail and can remain so for the entire range of data. This was demonstrated by Lai and Ying (1991). They proposed a modification to the estimator that ignores the effects due to small risk set sizes.

The EDF estimate at observation $i$ is computed as

\[  \hat{F}_ n(y_ i) = 1 - {\displaystyle \prod _{\tau \leq y_ i} \left(1 - \frac{n(\tau )}{R_ n(\tau )} \cdot I[R_ n(\tau ) \geq c N^\alpha ] \right)}  \]

where the definitions of $n(\tau )$ and $R_ n(\tau )$ are identical to those used for the KAPLANMEIER method described previously.

You can specify the values of $c$ and $\alpha $ by using the C= and ALPHA= options. If you do not specify a value for $c$, the default value used is $c=1$. If you do not specify a value for $\alpha $, the default value used is $\alpha =0.5$.

As an alternative, you can also specify an absolute lower bound, say $L$, on the risk set size by using the RSLB= option, in which case $I[R_ n(\tau ) \geq c N^\alpha ]$ is replaced by $I[R_ n(\tau ) \geq L]$ in the definition.

Turnbull’s EDF Estimation Method

If the response variable is subject to both left-censoring and right-censoring effects and if you do not specify the NOTURNBULL method, then the HPSEVERITY procedure uses a method proposed by Turnbull (1976) to compute the nonparametric estimates of the cumulative distribution function. The original Turnbull’s method is modified using the suggestions made by Frydman (1994) when truncation effects are present.

Let the input data consist of $N$ observations in the form of quintuplets of values $(y_ i, t^ l_ i, t^ r_ i, c^ r_ i, c^ l_ i), i=1, \dotsc , N$ with notation described in the section Notation. For each observation, let $A_ i = (c^ r_ i, c^ l_ i]$ be the censoring interval; that is, the response variable value is known to lie in the interval $A_ i$, but the exact value is not known. If an observation is uncensored, then $A_ i = (y_ i-\epsilon , y_ i]$ for any arbitrarily small value of $\epsilon > 0$. If an observation is censored, then the value $y_ i$ is ignored. Similarly, for each observation, let $B_ i = (t^ l_ i, t^ r_ i]$ be the truncation interval; that is, the observation is drawn from a truncated (conditional) distribution $F(y, B_ i) = P(Y \leq y | Y \in B_ i)$.

Two sets, $L$ and $R$, are formed using $A_ i$ and $B_ i$ as follows:

\begin{align*}  L & = \lbrace c^ r_ i, 1 \leq i \leq N \rbrace \cup \lbrace t^ r_ i, 1 \leq i \leq N \rbrace \\ R & = \lbrace c^ l_ i, 1 \leq i \leq N \rbrace \cup \lbrace t^ l_ i, 1 \leq i \leq N \rbrace \end{align*}

The sets $L$ and $R$ represent the left endpoints and right endpoints, respectively. A set of disjoint intervals $C_ j = [q_ j, p_ j]$, $1 \leq j \leq M$ is formed such that $q_ j \in L$ and $p_ j \in R$ and $q_ j \leq p_ j$ and $p_{j} < q_{j+1}$. The value of $M$ is dependent on the nature of censoring and truncation intervals in the input data. Turnbull (1976) showed that the maximum likelihood estimate (MLE) of the EDF can increase only inside intervals $C_ j$. In other words, the MLE estimate is constant in the interval $(p_{j}, q_{j+1})$. The likelihood is independent of the behavior of $F_ n$ inside any of the intervals $C_ j$. Let $s_ j$ denote the increase in $F_ n$ inside an interval $C_ j$. Then, the EDF estimate is as follows:

\[  F_ n(y) = \left\{  \begin{array}{ll} 0 &  \text { if } y < q_1 \\ \sum _{k=1}^{j} s_ k &  \text { if } p_ j < y < q_{j+1}, 1 \leq j \leq M-1 \\ 1 &  \text { if } y > p_ M \end{array} \right.  \]

PROC HPSEVERITY computes the estimates $F_ n(p_ j+) = F_ n(q_{j+1}-) = \sum _{k=1}^{j} s_ k$ at points $p_ j$ and $q_{j+1}$ and computes $F_ n(q_1-) = 0$ at point $q_1$, where $F_ n(x+)$ denotes the limiting estimate at a point that is infinitesimally larger than $x$ when approaching $x$ from values larger than $x$ and where $F_ n(x-)$ denotes the limiting estimate at a point that is infinitesimally smaller than $x$ when approaching $x$ from values smaller than $x$.

PROC HPSEVERITY uses the expectation-maximization (EM) algorithm proposed by Turnbull (1976), who referred to the algorithm as the self-consistency algorithm. By default, the algorithm runs until one of the following criteria is met:

  • Relative-error criterion: The maximum relative error between the two consecutive estimates of $s_ j$ falls below a threshold $\epsilon $. If $l$ indicates an index of the current iteration, then this can be formally stated as

    \[  \arg \max _{1 \leq j \leq M} \left\lbrace \frac{|s_ j^{l} - s_ j^{l-1}|}{s_ j^{l-1}} \right\rbrace \leq \epsilon  \]

    You can control the value of $\epsilon $ by specifying the EPS= suboption of the EDF=TURNBULL option in the PROC HPSEVERITY statement. The default value is 1.0E–8.

  • Maximum-iteration criterion: The number of iterations exceeds an upper limit that you specify for the MAXITER= suboption of the EDF=TURNBULL option in the PROC HPSEVERITY statement. The default number of maximum iterations is $500$.

The self-consistent estimates obtained in this manner might not be maximum likelihood estimates. Gentleman and Geyer (1994) suggested the use of the Kuhn-Tucker conditions for the maximum likelihood problem to ensure that the estimates are MLE. If you specify the ENSUREMLE suboption of the EDF=TURNBULL option in the PROC HPSEVERITY statement, then PROC HPSEVERITY computes the Kuhn-Tucker conditions at the end of each iteration to determine whether the estimates {$s_ j$} are MLE. If you do not specify any truncation effects, then the Kuhn-Tucker conditions derived by Gentleman and Geyer (1994) are used. If you specify any truncation effects, then PROC HPSEVERITY uses modified Kuhn-Tucker conditions that account for the truncation effects. An integral part of checking the conditions is to determine whether an estimate $s_ j$ is zero or whether an estimate of the Lagrange multiplier or the reduced gradient associated with the estimate $s_ j$ is zero. PROC HPSEVERITY declares these values to be zero if they are less than or equal to a threshold $\delta $. You can control the value of $\delta $ by specifying the ZEROPROB= suboption of the EDF=TURNBULL option in the PROC HPSEVERITY statement. The default value is 1.0E–8. The algorithm continues until the Kuhn-Tucker conditions are satisfied or the number of iterations exceeds the upper limit. The relative-error criterion stated previously is not used when you specify the ENSUREMLE option.

EDF Estimates and Truncation

If you specify truncation, then the estimate $\hat{F}_ n(y)$ that is computed by any method other than the STANDARD method is a conditional estimate. In other words, $\hat{F}_ n(y) = \Pr (Y \leq y | \tau _ G < Y \leq \tau _ H )$, where $G$ and $H$ denote the (unknown) distribution functions of the left-truncation threshold variable $T^ l$ and the right-truncation threshold variable $T^ r$, respectively, $\tau _ G$ denotes the smallest left-truncation threshold with a nonzero cumulative probability, and $\tau _ H$ denotes the largest right-truncation threshold with a nonzero cumulative probability. Formally, $\tau _ G = \inf \lbrace s: G(s) > 0 \rbrace $ and $\tau _ H = \sup \lbrace s: H(s) > 0 \rbrace $. For computational purposes, PROC HPSEVERITY estimates $\tau _ G$ and $\tau _ H$ by $t^ l_{\text {min}}$ and $t^ r_{\text {max}}$, respectively, defined as

\begin{align*}  t^ l_{\text {min}} & = \text {min} \{  t^ l_ k : 1 \leq k \leq N \}  \\ t^ r_{\text {max}} & = \text {max} \{  t^ r_ k : 1 \leq k \leq N \}  \end{align*}

These estimates of $t^ l_{\text {min}}$ and $t^ r_{\text {max}}$ are used to compute the conditional estimates of the CDF as described in the section Truncation and Conditional CDF Estimates.

Supplying EDF Estimates to Functions and Subroutines

The parameter initialization subroutines in distribution models and some predefined utility functions require EDF estimates. For more information, see the sections Defining a Severity Distribution Model with the FCMP Procedure and Predefined Utility Functions.

PROC HPSEVERITY supplies the EDF estimates to these subroutines and functions by using two arrays, x and F, the dimension of each array, and a type of the EDF estimates. The type identifies how the EDF estimates are computed and stored. They are as follows:

Type 1

specifies that EDF estimates are computed using the STANDARD method; that is, the data used for estimation are neither censored nor truncated.

Type 2

specifies that EDF estimates are computed using the KAPLANMEIER method; that is, the data used for estimation are subject to at least one form of truncation or censoring.

Type 3

specifies that EDF estimates are computed using the TURNBULL method; that is, the data used for estimation are subject to both left- and right-censoring. The data might or might not be truncated.

For Types 1 and 2, the EDF estimates are stored in arrays x and F of dimension N such that the following holds:

\[  F_ n(y) = \left\{  \begin{array}{ll} 0 &  \text { if } y < x[1] \\ F[k] &  \text { if } x[k] \leq y < x[k+1], k = 1, \dotsc , N-1 \\ F[N] &  \text { if } x[N] \leq y \end{array} \right.  \]

where $[k]$ denotes $k$th element of the array ([1] denotes the first element of the array).

For Type 3, the EDF estimates are stored in arrays x and F of dimension N such that the following holds:

\[  F_ n(y) = \left\{  \begin{array}{ll} 0 &  \text { if } y < x[1] \\ \text {undefined} &  \text { if } x[2k-1] \leq y < x[2k], k = 1, \dotsc , (N-1)/2 \\ F[2k] = F[2k+1] &  \text { if } x[2k] \leq y < x[2k+1], k = 1, \dotsc , (N-1)/2 \\ F[N] &  \text { if } x[N] \leq y \end{array} \right.  \]

Although the behavior of EDF is theoretically undefined for the interval $[x[2k-1],x[2k])$, for computational purposes, all predefined functions and subroutines assume that the EDF increases linearly from $F[2k-1]$ to $F[2k]$ in that interval if $x[2k-1] < x[2k]$. If $x[2k-1]=x[2k]$, which can happen when the EDF is estimated from a combination of uncensored and interval-censored data, the predefined functions and subroutines assume that $F_ n(x[2k-1]) = F_ n(x[2k]) = F[2k]$.