24166 - Estimate parameters of various distributions for use in RAND, PDF, or related functions

Usage Note 24166: Estimate parameters of various distributions for use in RAND, PDF, or related functions

With a sample of data from a single population, you can estimate the parameter(s) of several distributions by fitting an intercept-only model in the GENMOD, GLIMMIX, FMM or other procedure. The parameters displayed by the procedure (intercept, scale, dispersion or other parameters) are the estimated parameters of the distribution you specified in the procedure (usually in the DIST= option). However, the form (parameterization) of the distribution might not be the same as appears in statistical literature and might not be the same as the parameterization used in other procedures or in the RAND, PDF, QUANTILE, or related functions available in the DATA step. The parameterization used by the modeling procedure is typically shown in the Details section of the procedure's documentation.

If predictors are specified in the model, then the settings of the predictors in the observations define multiple distributions of the type specified in the procedure with each distribution having a different mean. The predicted value for an observation (usually available from an OUTPUT statement) provides the estimated mean of that observation's distribution. Prediction intervals for future values in each distribution can be obtained from the QUANTILE function using the estimated distribution parameters. This can be done for the distributions below and several others using the GLMPI macro (SAS Note 69692).

The following examples illustrate using data from a single population to estimate the parameters of various theoretical distributions. The estimated parameters in the distribution parameterization used by the modeling procedure are then transformed as needed to those used in the parameterization used in the RAND, PDF, and related functions as described in the Base SAS^® documentation (SAS Note 22930).

Poisson

Suppose you have Poisson data values in variable Y, where the Poisson mass function is f(y) = λ^ye^-λ/y! as documented in the PDF and RAND functions. The mean and variance of the Poisson distribution equal λ. The Poisson distribution is a discrete distribution that is commonly used to model count data. Examples of using the Poisson distribution in modeling can be seen in SAS Note 56549.

The following statements fit the intercept-only Poisson model to the Y data. Since the default link function for the Poisson distribution is the log link, the model is log(λ) = intercept. Exponentiating the intercept provides an estimate of λ. The OUTPUT statement saves the mean estimate, along with 95% confidence limits, in data set OUT. The ODS OUTPUT statement saves the intercept estimate in data set PE. In the DATA step below, a point estimate for λ is computed from the intercept. The PDF function is then used to obtain an estimate of the probability that Y=0. The 90th percentile is estimated using the QUANTILE function. Finally, ten random observations are produced from the Poisson distribution with mean λ.

   proc genmod;
      model y = / dist=poisson;
      ods output parameterestimates=pe;
      output out=out p=lambda l=lower u=upper;
      run;
   proc print data=out(obs=1);
      var lambda lower upper;
      run;
   proc transpose data=pe out=tpe;
      var estimate;
      id parameter;
      run;
   data tpe;
      set;
      lambda = exp(intercept);
      mean = lambda;
      var = lambda;
      PrYeq0 = pdf("poisson",0,lambda);
      Pct90 = quantile("poisson",.90,lambda);
      do i=1 to 10; 
        RandomY = rand("poisson",lambda);
        output;
      end;
      run;
   proc print;
      run;

Negative Binomial

Suppose you have negative binomial data values in variable Y. The negative binomial mass function is f(y) = (^y+k-1C_k-1)p^k(1-p)^y as documented in the PDF and RAND functions, where (^y+k-1C_k-1) is the number of combinations of y+k-1 items taken k-1 at a time. The negative binomial distribution is a discrete distribution that is commonly used to model overdispersed count data. Examples of using the negative binomial distribution in modeling can be seen in SAS Note 56549.

The following PROC GENMOD step fits the intercept-only negative binomial model and estimates an intercept and dispersion parameter. As for the Poisson distribution above, the log is the default link function for this distribution. The model is log(mean) = intercept so exponentiating the intercept provides an estimate of the mean. The DATA step transforms the intercept and dispersion parameter estimates into the parameters (NB_p and NB_k) in the parameterization used by the RAND, PDF, and related functions. These parameters are then used to compute estimates of the mean (in two equivalent ways) and variance of the distribution. Estimates of the probability that Y=0 and the 90th percentile are produced by the PDF and QUANTILE functions. Ten random observations from the distribution are generated using the RAND function.

   proc genmod;
      model y = / dist=negbin;
      ods output parameterestimates=pe;
      run;
   proc transpose data=pe out=tpe;
      var estimate;
      id parameter;
      run;
   data tpe;
      set;
      NB_k = 1/dispersion;
      NB_p = 1/(1+exp(intercept)*dispersion);
      NB_mean_1 = exp(intercept);
      NB_mean_2 = nb_k*(1-nb_p)/nb_p;
      NB_var = nb_k*(1-nb_p)/nb_p**2;
      PrYeq0 = pdf("negbinomial",0,nb_p,nb_k);
      Pct90 = quantile("negbinomial",.90,nb_p,nb_k);
      do i=1 to 10; 
        RandomY = rand("negbinomial",nb_p,nb_k);
        output;
      end;
      run;
   proc print;
      run;

Geometric

The geometric mass function takes two similar forms: f(x) = p(1-p)^x-1, x=1,2,... as documented in the RAND function with mean 1/p, or f(y) = p(1-p)^y, y=0,1,2,... as documented in the PDF function with mean (1/p)-1. The variance of both distributions is (1-p)/p². Note that these distributions are shifted relative to one another with y = x-1. Also note that the RAND function generates values starting at 1 rather than 0, but support for the distribution used in the PDF function begins at 0. The geometric distribution is equivalent to the negative binomial when the negative binomial dispersion parameter, k, equals one. The GENMOD procedure uses yet another parameterization that includes 0 in its support.

The following statements generate values from each of the two forms of the geometric distribution with parameter p=0.5. Note that values from the zero-based form (variable Y) can be generated by subtracting 1 from values generated by RAND("geometric",p) as shown below or with RAND("negbinomial",p,1.).

   data GeoSamp;
      do n=1 to 10000;
        call streaminit(23422);
        x=rand('geometric',.5); *x=1,2,3,...;
        y=x-1;                  *y=0,1,2,...;
        output;
      end;
      run;

The DATA step below computes the true population mean, variance, probability of 1, and 90th percentile using the true population parameter, p=0.5. Since the PDF and QUANTILE functions use the parameterization for Y=X-1, the 90th percentile for X is obtained by adding 1 to the 90th of Y. Similarly, the probability that X=1 is obtained by computing the probability that Y=0. In the MEANS and FREQ steps that follow, estimates of the means, variances, and 90th percentiles as well as counts at each observed value are computed from the generated sample of data for both X and Y.

   data parms;
      p=0.5;
      Variable='X';
         Mean = 1/p;
         Var = (1-p)/p**2;
         PrEq1 = pdf("geometric",0,p); 
         Pct90 = quantile("geometric",.90,p)+1;
         output;
      Variable='Y';
         Mean = 1/p - 1;
         Var = (1-p)/p**2;
         PrEq1 = pdf("geometric",1,p); 
         Pct90 = quantile("geometric",.90,p);
         output;
      run;
   proc print data=parms;
      id Variable;
      var p mean var preq1 pct90;
      title "Geometric population statistics";
      run;
   proc means data=GeoSamp mean var min max p90;
      var x y;
      title "Geometric sample statistics";
      run;
   proc freq data=GeoSamp;
      tables x y;
      title "Geometric sample statistics";
      run;

Notice that the sample statistics computed from the 10,000 observations in the generated sample closely match the true population values.

Geometric population statistics

Variable	p	Mean	Var	PrEq1	Pct90
X	0.5	2	2	0.50	4
Y	0.5	1	2	0.25	3

Geometric sample statistics

The MEANS Procedure

x
y

2.0174000

1.0174000

2.0439016

1.0000000

14.0000000

13.0000000

4.0000000

3.0000000

Geometric sample statistics

x	Frequency	Percent	Cumulative Frequency	Cumulative Percent
1	4982	49.82	4982	49.82
2	2485	24.85	7467	74.67
3	1233	12.33	8700	87.00
4	631	6.31	9331	93.31
5	324	3.24	9655	96.55
6	178	1.78	9833	98.33
7	84	0.84	9917	99.17
8	48	0.48	9965	99.65
9	21	0.21	9986	99.86
10	8	0.08	9994	99.94
11	4	0.04	9998	99.98
12	1	0.01	9999	99.99
14	1	0.01	10000	100.00

y	Frequency	Percent	Cumulative Frequency	Cumulative Percent
0	4982	49.82	4982	49.82
1	2485	24.85	7467	74.67
2	1233	12.33	8700	87.00
3	631	6.31	9331	93.31
4	324	3.24	9655	96.55
5	178	1.78	9833	98.33
6	84	0.84	9917	99.17
7	48	0.48	9965	99.65
8	21	0.21	9986	99.86
9	8	0.08	9994	99.94
10	4	0.04	9998	99.98
11	1	0.01	9999	99.99
13	1	0.01	10000	100.00

The following PROC GENMOD step fits the intercept-only geometric model to the X data. The log is the default link function for this distribution. As a result, the model is log(mean) = log(1/p) = intercept. Exponentiating the intercept provides an estimate of the mean (also available from the P= option in the OUTPUT statement). The reciprocal of the exponentiated intercept is an estimate of p. In the DATA step, the above statistics are computed using the estimate of p from GENMOD on the generated sample.

   proc genmod data=GeoSamp;
      model x = / dist=geometric;
      ods output parameterestimates=pe;
      run;
   proc transpose data=pe out=tpe;
      var estimate;
      id parameter;
      run;
   data Parms;
      set tpe;
      Variable = 'X';
      p = 1/exp(intercept);
      Mean = exp(intercept);
      Var = (1-p)/p**2;
      PrXeq1 = pdf("geometric",0,p); 
      Pct90 = quantile("geometric",.90,p)+1;
      run;
   proc print;
      id Variable;
      var intercept p mean var prxeq1 pct90;
      title "Estimated geometric statistics from sample data";
      title2 "X = 1,2,3, ...";
      run;

Notice that the estimates of the mean, variance, probability of 1 and 90th percentile for variable X all closely match the values computed by the MEANS and FREQ procedures and the true population values shown above.

Estimated geometric statistics from sample data

X = 1,2,3, ...

Variable	Intercept	p	Mean	Var	PrXeq1	Pct90
X	0.7018	0.49569	2.0174	2.05250	0.49569	4

The following PROC GENMOD step fits the intercept-only geometric model to the Y data. The model is the same as above, but since Y=X-1, the mean of Y is also shifted down and is now (1/p)-1. As a result, the geometric parameter, p, is now estimated by 1/(1+exp(intercept)). The DATA step again computes the above statistics using the estimate of p from GENMOD on the generated sample.

   proc genmod data=GeoSamp;
      model y = / dist=geometric;
      ods output parameterestimates=pe;
      run;
   proc transpose data=pe out=tpe;
      var estimate;
      id parameter;
      run;
   data Parms;
      set tpe;
      Variable = 'Y';
      p = 1/(1+exp(intercept));
      Mean = 1/p - 1;
      Var = (1-p)/p**2;
      PrYeq1 = pdf("geometric",1,p); 
      Pct90 = quantile("geometric",.90,p);
      run;
   proc print;
      id Variable;
      var intercept p mean var pryeq1 pct90;
      title "Estimated geometric statistics from sample data";
      title2 "Y = 0,1,2, ...";
      run;

As for variable X, the estimates of the mean, variance, probability of 1 and 90th percentile for variable Y all closely match the values computed by the MEANS and FREQ procedures and the true population values shown above.

Estimated geometric statistics from sample data

Y = 0,1,2, ...

Variable	Intercept	p	Mean	Var	PrYeq1	Pct90
Y	0.0173	0.49569	1.0174	2.05250	0.24998	3

Gamma

Examples of using the gamma distribution in modeling can be seen in SAS Note 60335. The gamma distribution for modeling positive continuous data and a modification of it are discussed in SAS Note 68202.

The most common parameterization for the gamma distribution is as a function of a shape parameter, a, and a scale parameter, λ, as shown in the description of the RAND('gamma') function. This is the parameterization used for the gamma distribution in PROC NLMIXED, which directly reports estimates for those parameters. The mean of the gamma distribution is aλ in this parameterization and its variance is aλ². Note that sometimes the scale parameter is β=1/λ.

   proc nlmixed;
      model y ~ gamma(a,lambda);
      run;

Estimates for the a and λ parameters can also be obtained from other procedures such as GENMOD or GLIMMIX, which use a slightly different parameterization for the gamma distribution as described in detail in SAS Note 33068. The following statements fit the intercept-only gamma model. The default link function for the gamma distribution is the inverse, but the log link is most commonly used. With the log link, the model is log(mean) = log(aλ) = intercept. In the DATA step below, the mean and variance are estimated using the intercept and scale parameters provided by GENMOD and using the a and λ parameters derived from them. The median (50th percentile) is then estimated using the QUANTILE function. The SDF function is also used to estimate the probability that Y>20. Note that the similar CDF function computes the probability of smaller values than specified rather than larger as provided by the SDF function. Finally, ten random observations are generated by the RAND function using the estimated gamma parameters.

   proc genmod;
      model y= / dist=gamma link=log;
      ods output parameterestimates=pe;
      run;
   proc transpose data=pe out=tpe;
      id parameter; var estimate;
      run;
   data gamma;
      set tpe;
      a = scale;
      lambda = exp(intercept)/scale;
      mean_alam = lambda*a;
      mean_munu = exp(intercept);
      var_alam = a*lambda**2;
      var_munu = exp(intercept)**2/scale;
      median = quantile('gamma',.5,a,lambda);
      PrGT20 = sdf("gamma",20,a,lambda);
      do i=1 to 10;
        RanGam = rand("gamma",a,lambda);
        output;
      end;
      run;
   proc print;
      run;

Generalized Poisson

Suppose you have generalized Poisson data values in variable Y. The generalized Poisson mass function is shown in the documentation of the PDF function and in the Details section of the FMM procedure documentation. Like the Poisson and negative binomial distributions, the generalized Poisson distribution is a discrete distribution that is commonly used to model count data and can accommodate under- and overdispersion in the data. Examples of using the generalized Poisson distribution can be seen in SAS Note 68202 and SAS Note 56549.

The following PROC FMM step fits the intercept-only generalized Poisson model and estimates an intercept and scale parameter. The log is the default link function for this distribution. The model is log(mean) = intercept so exponentiating the intercept provides an estimate of the mean. The DATA step below transforms the intercept and dispersion parameter estimates into the parameters (θ and η) in the parameterization used by the RAND, PDF, and related functions. These parameters are then used to compute estimates of the mean and variance of the distribution. Under this parameterization, the mean is θ/(1-η) and the variance is θ/(1-η)³. Estimates of the probability that Y=0, the probability that Y>15, and the median are produced by the SDF, PDF, and QUANTILE functions. Ten random observations from the distribution are generated using the RAND function.

   proc fmm;
      model y= / dist=genpoisson;
      ods output parameterestimates=pe;
      run;
   proc transpose data=pe out=tpe;
      id parameter; var estimate;
      run;
   data genpoi;
      set tpe;
      theta = exp(intercept)*exp(-scale); 
      eta = 1-exp(-scale);
      mean = exp(intercept);
      var = mean/(1-eta)**2;
      PrYeq0 = pdf("genpoisson",0,theta,eta);
      med = quantile("genpoisson",0.5,theta,eta);
      PrGT15 = sdf("genpoisson",15,theta,eta);
      do i=1 to 10;
        RanGP = rand("genpoisson",theta,eta);
        output;
      end;
      run;
   proc print;
      run;

Tweedie

Suppose you have Tweedie data values in variable Y. The Tweedie density function is shown in the documentation of the PDF function and in the Details section of the GENMOD procedure documentation. Details of the distribution are further discussed in "Tweedie Distribution for Generalized Linear Models" in the Details section of the GENMOD documentation. The Tweedie distribution is a continuous distribution for positive data that might also have a point mass at zero. Examples of using the generalized Poisson distribution can be seen in SAS Note 68202 and SAS Note 60335.

The following PROC GENMOD step fits the intercept-only Tweedie model and estimates an intercept and power (p) and dispersion (φ) parameters. The log is the default link function for this distribution. The model is log(mean) = intercept so exponentiating the intercept provides an estimate of the mean. These parameters can be used directly in the RAND, PDF, and related functions. The DATA step below uses these parameters to compute estimates of the mean and variance of the distribution. Estimates of the probability that Y>60 and the median are produced by the SDF and QUANTILE functions. Ten random observations from the distribution are generated using the RAND function. Some rounding of the power parameter is done to avoid computational problems in generating random values.

   proc genmod;
     model y= / dist=tweedie;
     ods output parameterestimates=pe;
     run;
   proc transpose data=pe out=tpe;
     id parameter; var estimate;
     run;
   data twee;
     set tpe;
     mean = exp(intercept);
     p = power;
     phi = dispersion;
     var = phi*mean**p;
     med = quantile("tweedie",0.5,p,mean,phi);
     PrGT60 = sdf("tweedie",60,p,mean,phi);
     do i=1 to 10;
       RanTW = rand("tweedie",round(p,.001),mean,phi);
       output;
     end;
     run;
   proc print;
      run;

Weibull

The Weibull distribution is often used to model failure time data. That can be done using the LIFEREG procedure, as illustrated below, but the NLMIXED procedure can also model Weibull data.

Suppose you have Weibull data values from a single population in variable Y. The most common parameterization for the Weibull distribution is as a function of a shape parameter, k, and a scale parameter, λ, as shown in the descriptions of the RAND('weibull') and PDF('weibull') functions. This parameterization is used in NLMIXED statements below. An alternative but equivalent parameterization, used in PROC LIFEREG, has λ = exp(xbeta), where xbeta is a model specification that defines one or more Weibull-distributed populations. When the model contains only an intercept, only a single population is defined as is appropriate for this scenario.

The following PROC NLMIXED step fits an intercept-only model to the data. In the DATA step that follows, the mean and variance are computed using appropriate formulas involving the k and λ parameters estimated by NLMIXED. The median (50th percentile) is then estimated using the QUANTILE function. The SDF function is also used to estimate the probability that Y>1.25. Note that the similar CDF function computes the probability of smaller values than specified rather than larger as provided by the SDF function. Finally, ten random observations are generated by the RAND function using the estimated Weibull parameters.

   proc nlmixed; 
      xbeta=b0;
      lambda=exp(xbeta);
      ll=log((k/lambda)*((y/lambda)**(k-1))*exp(-(y/lambda)**k)); 
      model y ~ general(ll); 
      estimate 'mean(y)' lambda*gamma(1+1/k); 
      estimate 'lambda' lambda; 
      estimate 'scale' 1/k;
      ods output parameterestimates=wpe;
      run; 
   proc transpose data=wpe out=twpe;
      id parameter; var estimate;
      run;
   data weibull;
      set twpe;
      lambda=exp(b0);
      mean = lambda*gamma(1+1/k);
      var = lambda**2*(gamma(1+2/k)-(gamma(1+1/k))**2);
      median = quantile('weibull',.5,k,lambda);
      PrGT1_25 = sdf("weibull",1.25,k,lambda);
      do i=1 to 10;
        RanWei = rand("weibull",k,lambda);
        output;
      end;
      run;
   proc print;
      run;

The same can be done using the LIFEREG procedure to estimate the Weibull distribution parameters and then compute the mean, variance, and other values of interest for each population. The following statements fit the same intercept-only model and save the model parameter estimates using an ODS OUTPUT statement. The table of parameter estimates includes estimates of the Weibull shape (k) and scale (λ) parameters. The evaluated model, xbeta, and an estimate of the response median, p50, are computed and saved by the OUTPUT statement. The xbeta value is an estimate of log(λ). Since the only model parameter is the intercept, all observations have the same xbeta and median values, so only one observation needs to be retained from this data set.

     proc lifereg;
        model y= / dist=weibull;
        output out=xb xbeta=loglambda p=p50;
        ods output parameterestimates=pe;
        run;
     data weibull;
        if _n_=1 then set pe(where=(parameter ? 'Shape') rename=(estimate=k));
        set xb(where=(i=1));  
        lambda=exp(loglambda);
        mean=lambda*gamma(1+1/k);
        variance=lambda**2*(gamma(1+2/k)-(gamma(1+1/k))**2);
        median = quantile('weibull',.5,k,lambda);
        PrGT1_25 = sdf("weibull",1.25,k,lambda);
        do i=1 to 10;
          RanWei = rand("weibull",k,lambda);
          output;
        end;
        run;
     proc print;
        run;

Operating System and Release Information

Product Family	Product	System	SAS Release
			Reported	Fixed*
SAS System	SAS/STAT	All	n/a

* For software releases that are not yet generally available, the Fixed Release is the software release in which the problem is planned to be fixed.

Type:	Usage Note
Priority:	low
Topic:	SAS Reference ==> Procedures ==> GENMOD Analytics ==> Distribution Analysis Analytics ==> Descriptive Statistics Analytics ==> Simulation SAS Reference ==> Procedures ==> NLMIXED SAS Reference ==> Functions ==> Probability ==> PDF SAS Reference ==> Functions ==> Probability ==> SDF SAS Reference ==> Quantile ==> Quantile ==> QUANTILE SAS Reference ==> Procedures ==> FMM SAS Reference ==> Procedures ==> GLIMMIX SAS Reference ==> Functions ==> Random Number ==> RAND SAS Reference ==> Functions ==> Probability ==> CDF SAS Reference ==> Functions ==> Probability ==> LOGCDF SAS Reference ==> Functions ==> Probability ==> LOGPDF SAS Reference ==> Functions ==> Probability ==> LOGSDF SAS Reference ==> Functions ==> Probability ==> POISSON SAS Reference ==> Functions ==> Probability ==> PROBGAM SAS Reference ==> Functions ==> Probability ==> PROBNEGB SAS Reference ==> Procedures ==> LIFEREG

Date Modified:	2022-12-03 08:16:18
Date Created:	2004-09-13 14:29:35

Support