PROC SEVERITY assumes the following model for the response variable
|
where is a continuous probability distribution with parameters . The model hypothesizes that the observed response is generated from a stochastic process that is governed by the distribution . This model is typically referred to as the error model. Given a representative input sample of response variable values, PROC SEVERITY estimates the model parameters for any distribution and computes the statistics of fit for each model. This enables you to find the distribution that is most likely to generate the observed sample.
A set of predefined distributions is provided with the SEVERITY procedure. A summary of the distributions is provided in Table 23.2. For each distribution, the table lists the name of the distribution that should be used in the DIST statement, the parameters of the distribution along with their bounds, and the mathematical expressions for the probability density function (PDF) and cumulative distribution function (CDF) of the distribution.
All the predefined distributions, except LOGN and TWEEDIE, are parameterized such that their first parameter is the scale parameter. For LOGN, the first parameter is a log-transformed scale parameter. TWEEDIE does not have a scale parameter. The presence of scale parameter or a log-transformed scale parameter enables you to use all of the predefined distributions, except TWEEDIE, as a candidate for estimating regression effects.
A distribution model is associated with each predefined distribution. You can also define your own distribution model, which is a set of functions and subroutines that you define by using the FCMP procedure. See the section Defining a Distribution Model with the FCMP Procedure for more information.
Table 23.2: Predefined SEVERITY Distributions
Name |
Distribution |
Parameters |
PDF () and CDF () |
|
---|---|---|---|---|
BURR |
Burr |
, , |
|
|
|
|
|
||
EXP |
Exponential |
|
|
|
|
|
|||
GAMMA |
Gamma |
, |
|
|
|
|
|||
GPD |
Generalized |
, |
|
|
Pareto |
|
|
||
IGAUSS |
Inverse Gaussian |
, |
|
|
(Wald) |
|
|
||
|
||||
LOGN |
Lognormal |
(no bounds), |
|
|
|
|
|
||
PARETO |
Pareto |
, |
|
|
|
|
|||
TWEEDIE |
Tweedie |
, , |
|
|
|
|
|
||
STWEEDIE |
Scaled Tweedie |
, , |
|
|
|
|
|
||
WEIBULL |
Weibull |
, |
|
|
|
|
|||
Notes: |
||||
1. , wherever is used. |
||||
2. denotes the scale parameter for all the distributions. For LOGN, . |
||||
3. Parameters are listed in the order in which they are defined in the distribution model. |
||||
4. is the lower incomplete gamma function. |
||||
5. is the standard normal CDF. |
||||
6. See the section Tweedie Distributions for more information. |
Tweedie distributions are a special case of the exponential dispersion family (Jørgensen, 1987) with a property that the variance of the distribution is equal to , where is the mean of the distribution, is a dispersion parameter, and is an index parameter as discovered by Tweedie (1984). The distribution is defined for all values of except for values of in the open interval . Many important known distributions are a special case of Tweedie distributions including normal (=0), Poisson (=1), gamma (=2), and the inverse Gaussian (=3). Apart from these special cases, the probability density function (PDF) of the Tweedie distribution does not have an analytic expression. For , it has the form (Dunn and Smyth 2005),
|
where for and for . The function does not have an analytical expression. It is typically evaluated using series expansion methods described in Dunn and Smyth (2005).
For , the Tweedie distribution is a compound Poisson-gamma mixture distribution, which is the distribution of defined as
|
where and are iid gamma random variables with shape parameter and scale parameter . At , the density is a probability mass that is governed by the Poisson distribution, and for values of , it is a mixture of gamma variates with Poisson mixing probability. The parameters , and are related to the natural parameters , , and of the Tweedie distribution as
|
|
|
|
|
|
The mean of a Tweedie distribution is positive for .
Two predefined versions of the Tweedie distribution are provided with the SEVERITY procedure. The first version, named TWEEDIE and defined for , has the natural parameterization with parameters , , and . The second version, named STWEEDIE and defined for , is the version with a scale parameter. It corresponds to the compound Poisson-gamma distribution with gamma scale parameter , Poisson mean parameter , and the index parameter . The index parameter decides the shape parameter of the gamma distribution as
|
The parameters and of the STWEEDIE distribution are related to the parameters and of the TWEEDIE distribution as
|
|
|
|
You can fit either version when there are no regression variables. Each version has its own merits. If you fit the TWEEDIE version, you have the direct estimate of the overall mean of the distribution. If you are interested in the most practical range of the index parameter , then you can fit the STWEEDIE version, which provides you direct estimates of the Poisson and gamma components that comprise the distribution (an estimate of the gamma shape parameter is easily obtained from the estimate of ).
If you want to estimate the effect of exogenous (regression) variables on the distribution, then you must use the STWEEDIE version, because PROC SEVERITY requires a distribution to have a scale parameter in order to estimate regression effects. See the section Estimating Regression Effects for more information. The gamma scale parameter is the scale parameter of the STWEEDIE distribution. If you are interested in determining the effect of regression variables on the mean of the distribution, you can do so by first fitting the STWEEDIE distribution to determine the effect of the regression variables on the scale parameter . Then, you can easily estimate how the mean of the distribution is affected by the regression variables using the relationship , where . The estimates of the regression parameters remain the same, whereas the estimate of the intercept parameter is adjusted by the estimates of the and parameters.
The parameters are initialized by using the method of moments for all the distributions, except for the gamma and the Weibull distributions. For the gamma distribution, approximate maximum likelihood estimates are used. For the Weibull distribution, the method of percentile matching is used.
Given observations of the severity value (), the estimate of th raw moment is denoted by and computed as
|
The 100th percentile is denoted by (). By definition, satisfies
|
where . PROC SEVERITY uses the following practical method of computing . Let denote the empirical distribution function (EDF) estimate at a severity value . Let and denote two consecutive values in the ascending sequence of values such that and . Then, the estimate is computed as
|
Let denote the smallest double-precision floating-point number such that . This machine precision constant can be obtained by using the CONSTANT function in Base SAS software.
The details of how parameters are initialized for each predefined distribution are as follows:
The parameters are initialized by using the method of moments. The th raw moment of the Burr distribution is:
|
Three moment equations () need to be solved for initializing the three parameters of the distribution. In order to get an approximate closed form solution, the second shape parameter is initialized to a value of . If , then simplifying and solving the moment equations yields the following feasible set of initial values:
|
If , then the parameters are initialized as follows:
|
The parameters are initialized by using the method of moments. The th raw moment of the exponential distribution is:
|
Solving yields the initial value of .
The parameter is initialized by using its approximate maximum likelihood (ML) estimate. For a set of iid observations (), drawn from a gamma distribution, the log likelihood, , is defined as follows:
|
|
|
|
Using a shorter notation of to denote and solving the equation yields the following ML estimate of :
|
Substituting this estimate in the expression of and simplifying gives
|
Let be defined as follows:
|
Solving the equation yields the following expression in terms of the digamma function, :
|
The digamma function can be approximated as follows:
|
This approximation is within 1.4% of the true value for all the values of except when is arbitrarily close to the positive root of the digamma function (which is approximately 1.461632). Even for the values of that are close to the positive root, the absolute error between true and approximate values is still acceptable ( for ). Solving the equation that arises from this approximation yields the following estimate of :
|
If this approximate ML estimate is infeasible, then the method of moments is used. The th raw moment of the gamma distribution is:
|
Solving and yields the following initial value for :
|
If (almost zero sample variance), then is initialized as follows:
|
After computing the estimate of , the estimate of is computed as follows:
|
Both the maximum likelihood method and the method of moments arrive at the same relationship between and .
The parameters are initialized by using the method of moments. Notice that for , the CDF of the generalized Pareto distribution (GPD) is:
|
|
|
|
This is equivalent to a Pareto distribution with scale parameter and shape parameter . Using this relationship, the parameter initialization method used for the PARETO distribution is used to get the following initial values for the parameters of the GPD distribution:
|
If (almost zero sample variance) or , then the parameters are initialized as follows:
|
The parameters are initialized by using the method of moments. Note that the standard parameterization of the inverse Gaussian distribution (also known as the Wald distribution), in terms of the location parameter and shape parameter , is as follows (Klugman, Panjer, and Willmot 1998, p. 583):
|
|
|
|
For this parameterization, it is known that the mean is and the variance is , which yields the second raw moment as (computed by using ).
The predefined IGAUSS distribution in PROC SEVERITY uses the following alternate parameterization to allow the distribution to have a scale parameter, :
|
|
|
|
The parameters (scale) and (shape) of this alternate form are related to the parameters and of the preceding form such that and . Using this relationship, the first and second raw moments of the IGAUSS distribution are:
|
|
|
|
Solving and yields the following initial values:
|
If (almost zero sample variance), then the parameters are initialized as follows:
|
The parameters are initialized by using the method of moments. The th raw moment of the lognormal distribution is:
|
Solving and yields the following initial values:
|
The parameters are initialized by using the method of moments. The th raw moment of the Pareto distribution is:
|
Solving and yields the following initial values:
|
If (almost zero sample variance) or , then the parameters are initialized as follows:
|
The parameter is initialized by assuming that the sample is generated from a gamma distribution with shape parameter and by computing . The initial value is obtained from using the method previously described for the GAMMA distribution. The parameter is the mean of the distribution. Hence, it is initialized to the sample mean as
|
Variance of a Tweedie distribution is equal to . Thus, the sample variance is used to initialize the value of as
|
STWEEDIE is a compound Poisson-gamma mixture distribution with mean , where is the shape parameter of the gamma random variables in the mixture and the parameter is determined solely by . First, the parameter is initialized by assuming that the sample is generated from a gamma distribution with shape parameter and by computing . The initial value is obtained from using the method previously described for the GAMMA distribution. As done for initializing the parameters of the TWEEDIE distribution, the sample mean and variance are used to compute the values and as
|
|
|
|
Based on the relationship between the parameters of TWEEDIE and STWEEDIE distributions described in the section Tweedie Distributions, values of and are initialized as
|
|
|
|
The parameters are initialized by using the percentile matching method. Let and denote the estimates of the th and th percentiles, respectively. Using the formula for the CDF of Weibull distribution, they can be written as
|
|
|
|
Simplifying and solving these two equations yields the following initial values:
|
where . These initial values agree with those suggested in Klugman, Panjer, and Willmot (1998).
A summary of the initial values of all the parameters for all the predefined distributions is given in Table 23.3. The table also provides the names of the parameters to use in the INIT= option in the DIST statement if you want to provide a different initial value.
Table 23.3: Parameter Initialization for Predefined Distributions
Distribution |
Parameter |
Name for INIT option |
Default Initial Value |
---|---|---|---|
BURR |
|
theta |
|
|
alpha |
|
|
|
gamma |
|
|
EXP |
|
theta |
|
GAMMA |
|
theta |
|
|
alpha |
|
|
GPD |
|
theta |
|
|
xi |
|
|
IGAUSS |
|
theta |
|
|
alpha |
|
|
LOGN |
|
mu |
|
|
sigma |
|
|
PARETO |
|
theta |
|
|
alpha |
|
|
TWEEDIE |
|
mu |
|
|
phi |
|
|
|
p |
|
|
where |
|||
STWEEDIE |
|
theta |
|
|
lambda |
|
|
|
p |
|
|
where |
|||
WEIBULL |
|
theta |
|
|
tau |
|
|
Notes: |
|||
denotes the th raw moment |
|||
|
|||
and denote the th and th percentiles, respectively |
|||
|