PROC SEVERITY: Empirical Distribution Function Estimation Methods

The SEVERITY Procedure

Empirical Distribution Function Estimation Methods

The empirical distribution function (EDF) is a nonparametric estimate of the cumulative distribution function (CDF) of the distribution. PROC SEVERITY uses EDF estimates for computing the EDF-based statistics in addition to providing a nonparametric estimate of the CDF to the PARMINIT subroutine.

Let there be a set of $\text{[math]}$ observations, each containing a triplet of values $\text{[math]}$ , where $\text{[math]}$ is the value of the response variable, $\text{[math]}$ is the value of the left-truncation threshold, and $\text{[math]}$ is the indicator of right-censoring. A missing value for $\text{[math]}$ indicates no left-truncation. $\text{[math]}$ indicates a right-censored observation, in which case $\text{[math]}$ is assumed to record the right-censoring limit $\text{[math]}$ . $\text{[math]}$ indicates an uncensored observation.

In the following definitions, an indicator function $\text{[math]}$ is used, which takes a value of 1 or 0 if the expression $\text{[math]}$ is true or false, respectively.

Given this notation, the EDF is estimated as follows:

$\text{[math]}$

where $\text{[math]}$ denotes the $\text{[math]}$ th order statistic of the set $\text{[math]}$ and $\text{[math]}$ is the estimate computed at that value. The definition of $\text{[math]}$ depends on the estimation method. You can specify a particular method or let PROC SEVERITY choose an appropriate method by using the EMPIRICALCDF= option in the MODEL statement. Each method computes $\text{[math]}$ as follows:

STANDARD

This method is the standard way of computing EDF. The EDF estimate at observation $\text{[math]}$ is computed as follows:

$\text{[math]}$

This method ignores any censoring and truncation information, even if it is specified. When no censoring or truncation information is specified, this is the default method chosen.

KAPLANMEIER

This method is suitable primarily when left-truncation or right-censoring is specified. The Kaplan-Meier (KM) estimator, also known as the product-limit estimator, was first introduced by Kaplan and Meier (1958) for censored data. Lynden-Bell (1971) derived a similar estimator for left-truncated data. PROC SEVERITY uses the definition that combines both censoring and truncation information (Klein and Moeschberger 1997, Lai and Ying 1991).

The EDF estimate at observation $\text{[math]}$ is computed as

$\text{[math]}$

where $\text{[math]}$ and $\text{[math]}$ are defined as follows:

$\text{[math]}$ , which is the number of uncensored observations with response variable value equal to $\text{[math]}$ .
$\text{[math]}$ , which is the size (cardinality) of the risk set at $\text{[math]}$ . The term risk set has its origins in survival analysis; it contains the events that are at the risk of failure at a given time, $\text{[math]}$ . In other words, it contains the events that have survived up to time $\text{[math]}$ and might fail at or after $\text{[math]}$ . For PROC SEVERITY, time is equivalent to the magnitude of the event and failure is equivalent to an uncensored and observable event, where observable means it satisfies the left-truncation threshold.

If you specify either right-censoring or left-truncation and do not explicitly specify a method of computing EDF, then this is the default method used.

MODIFIEDKM

The product-limit estimator used by the KAPLANMEIER method does not work well if the risk set size becomes very small. This can happen for right-censored data towards the right tail, and for left-truncated data at the left tail and propagate to the entire range of data. This was demonstrated by Lai and Ying (1991). They proposed a modification to the estimator that ignores the effects due to small risk set sizes.

The EDF estimate at observation $\text{[math]}$ is computed as

$\text{[math]}$

where the definitions of $\text{[math]}$ and $\text{[math]}$ are identical to those used for the KAPLANMEIER method described previously.

You can specify the values of $\text{[math]}$ and $\text{[math]}$ by using the C= and ALPHA= options. If you do not specify a value for $\text{[math]}$ , the default value used is $\text{[math]}$ . If you do not specify a value for $\text{[math]}$ , the default value used is $\text{[math]}$ .

As an alternative, you can also specify an absolute lower bound, say $\text{[math]}$ , on the risk set size by using the RSLB= option, in which case $\text{[math]}$ is replaced by $\text{[math]}$ in the definition.

EDF Estimates and Left-Truncation

If left-truncation is specified without the probability of observability, the estimate $\text{[math]}$ computed by KAPLANMEIER and MODIFIEDKM methods is a conditional estimate. In other words, $\text{[math]}$ , where $\text{[math]}$ denotes the (unknown) distribution function of $\text{[math]}$ and $\text{[math]}$ . In other words, $\text{[math]}$ is the smallest threshold with a nonzero cumulative probability. For computational purposes, PROC SEVERITY computes $\text{[math]}$ as $\text{[math]}$ .

If left-truncation is specified with the probability of observability $\text{[math]}$ , then PROC SEVERITY uses the additional information provided by $\text{[math]}$ to compute an unconditional estimate of the EDF. In particular, for each left-truncated observation $\text{[math]}$ with response variable value $\text{[math]}$ and truncation threshold $\text{[math]}$ , an observation $\text{[math]}$ is added with weight $\text{[math]}$ and $\text{[math]}$ . Each added observation is assumed to be uncensored; that is, $\text{[math]}$ . Weight on each original observation $\text{[math]}$ is assumed to be 1; that is, $\text{[math]}$ . Let $\text{[math]}$ denote the number of observations in this appended set of observations. Then, the specified EDF method is used by assuming no left-truncation. For the KAPLANMEIER and MODIFIEDKM methods, definitions of $\text{[math]}$ and $\text{[math]}$ are modified to account for the weights on the observations. $\text{[math]}$ is now defined as $\text{[math]}$ , and $\text{[math]}$ is defined as $\text{[math]}$ . From the definition of $\text{[math]}$ , note that each observation in the appended set is assumed to be observed; that is, the left-truncation information is not used, because it was used along with $\text{[math]}$ to add the observations. The estimate that is obtained using this method is an unconditional estimate of the EDF.

Note: This procedure is experimental.

Top of Page