There are several response distributions available when fitting generalized linear models (GLMs) in procedures such as GENMOD, GLIMMIX, and others. The distributions include the normal, Poisson, gamma, inverse Gaussian, Tweedie, and others. One important aspect of how these distributions differ is in the relationship between the mean and the variance. For the normal distribution, the variance is not related to the mean. The mean and variance are the same in the Poisson. In the gamma distribution, the variance is the square of the mean. The variance is the mean cubed in the inverse Gaussian. The Tweedie distribution includes a power parameter which defines a family of distributions that can have any of the above mean-variance relationships, as well as relationships in between.
When it comes to selecting a distribution to use when modeling a response, choosing a distribution that matches the observed mean-variance relationship is clearly important. Discussed below are some tools for finding a suitable response distribution if it is not already known. First, graphical methods using the observed data can be used to visually assess the nature of the relationship. Alternatively, assuming that the variance is proportional to a power of the mean, V(y)=φμp, then one can attempt to estimate p. One way to do this is using a modification of Park's test for heteroscedasticity. Manning and Mullahy (2001) discuss this modification of the test and its use. The idea is to use the predicted values and residuals from an initial log-linked GLM to model the variance as a function of the mean using another log-linked GLM of the form
ln(V(y)) = c + p*ln(μ) | (1) |
If the estimate of p is near 0, then this suggests that a normal response is reasonable. For p near 1, the response distribution is similar to the Poisson, but continuous rather than discrete. Note that PROC GENMOD does not require that the response be discrete in order to use the Poisson response distribution. The gamma distribution is suggested for p near 2, and the inverse Gaussian for p near 3. The method assumes independent observations, that the mean model is a log-linked model, that the response is positive with no censoring or truncation, and that exposures do not vary (as when using an offset). The variance is assumed to be a function of the covariates through the mean, not directly of the covariates.
An alternative to the modified Park method is to fit the desired model using a distribution which encompasses a range of powers. The Tweedie distribution assumes that the variance is proportional to a power, p, of the mean. For p=0, 1, 2, or 3, the Tweedie is equivalent to the above distributions. But it also accommodates noninteger powers so you do not have to choose between, say, the gamma and inverse Gaussian if p=2.5. Using the Tweedie distribution, PROC GENMOD can estimate the power for p>1.1. For more information on the Tweedie distribution, see "Tweedie Distribution For Generalized Linear Models" in the Details section of the GENMOD documentation.
In the following examples, the methods discussed above are illustrated using simulated data from known distributions and mean models and also using observed data.
The statements below produce a data set of values yij ~ gamma(a,λi), where a=e2, λi=e0.1xi, i=1, 2, ..., 10, j=1, 2, ..., 1000. The ten gamma-distributed populations correspond to the levels of predictor, xi=i. As shown in this note, the means of the populations are μi = aλi. The true log-linked model for the mean is then
log(μi) = 2 + 0.1xi | (2) |
The population variances are aλi2. Notice that the variances are proportional to the square of the means since V(y)= 1 a μ2. As a result, we expect to obtain an estimate of the power, p, near 2. In the following code, the use of the STREAMINIT function allows the random stream of gamma variates and subsequent analyses to be reproduced.
data P2; call streaminit(48945); keep id x y; a=exp(2); do i=1 to 10; do j=1 to 1000; x=i; lambda=exp(0.1*x); y=lambda*rand("gamma",a); id+1; output; end; end; run;
Graphical assessment
The mean-variance relationship can be assessed graphically using the following statements. For each of the Poisson, gamma, and inverse Gaussian distributions, the variance is proportional to a power of the mean with the square or reciprocal of the distribution's scale parameter as the proportionality constant. See "Generalized Linear Models Theory" in the Details section of the GENMOD documentation. To plot the mean-variance relationships under these distributions for comparison to the observed data, a model using each of the distributions is fit and the estimated scale parameter is obtained. Since the Poisson distribution does not include a scale parameter, one is added and estimated from the deviance using the SCALE=D option.
proc genmod data=P2; model y=x / dist=igaussian link=log; run; proc genmod data=P2; model y=x / dist=gamma link=log; run; proc genmod data=P2; model y=x / dist=poisson link=log scale=d; run;
The estimated variance for each distribution is then computed for each population using the observed mean and the estimated scale.
proc means data=P2 nway mean var min max; class x; var y; output out=MV2 mean=Mean var=Variance; run; data MV2; set MV2; v3=Mean**3 * 0.1128**2; * estimate inverse Gaussian variance; v2=Mean**2 * (1/7.4134); * estimate gamma variance; v1=Mean * 1.3226**2; * estimate Poisson-like variance; run;
Finally, these statements plot the observed means and variances and the relationships under the various distributions representing powers of the mean.
proc sort data=MV2; by Mean; run; proc sgplot data=MV2 cycleattrs noautolegend; loess y=Variance x=Mean / lineattrs=(thickness=2 pattern=solid) curvelabel="Observed"; spline y=v1 x=Mean / lineattrs=(pattern=5) curvelabel="p=1"; spline y=v2 x=Mean / lineattrs=(pattern=4) curvelabel="p=2"; spline y=v3 x=Mean / lineattrs=(pattern=2) curvelabel="p=3"; yaxis label="Variance"; run;
The observed mean-variance relationship, approximated with a smoothed, loess curve through the observed values, seems closest to the p=2 curve, suggesting the gamma distribution.
Modified Park method
The method begins by estimating an initial log-linked GLMNote1 to the data and saving the predicted values and raw residuals in a data set. The gamma distribution is used below for the initial model. To model the variance, the natural log of the predicted values and of the raw residuals are computed. PROC REG is then used to fit the variance model (1) aboveNote2. The power of the mean, p, is the estimated parameter for Lnp. The CLB option provides a confidence interval for p.
proc genmod data=P2; model y=x / dist=gamma link=log; output out=P2predres p=p resraw=rr; run; data P2predres; set P2predres; Lnp=log(p); Lnrr2=log(rr**2); run; proc reg data=P2predres plots=none; model Lnrr2 = Lnp / clb; run; quit;
The estimated power is 1.97 with a 95% confidence interval of (1.82, 2.12). This estimate is very near 2 as expected for gamma distributed data and is in agreement with the visual assessment above.
|
The final model then is the same as the initial gamma model above. Notice that the estimated parameters closely match the true log-linked mean model (2). The scale parameter, 7.41 = a = e2 as shown in this note.
|
Tweedie model
Since the graphical assessment above suggests that the power of the mean is at least 1.1, a GLM using the Tweedie distribution can be used to simultaneously estimate the power and fit the model.
proc genmod data=P2; model y=x / dist=tweedie link=log; run;
The estimated power is again very close to 2 – 1.96 with 95% confidence interval (1.89, 2.03) – and the estimated parameters are essentially identical to those from the gamma model above.
|
A data set is generated containing inverse Gaussian random values yij ~ IG(μi,σ), where μi=e1+x, x=0.2, 0.4, ..., 2, and σ=1. The true log-linked model for the mean is then log(μi)=1+xi. The population variances are σμi3, which is proportional to μ3 so the estimated power is expected to near 3.
Graphical assessment
As in the previous example, models are fit to estimate the Poisson, gamma, and inverse Gaussian scale parameters and a plot is produced of the observed population means and variances along with curves showing the expected mean-variance relationships under each of those distributions. The resulting plot correctly suggests that the inverse Gaussian distribution closely approximates the observed mean-variance relationship.
Modified Park method
Applying the modified Park method as shown in the previous example, the estimated power is 2.39 with 95% confidence interval (2.33, 2.45). This suggests that the distribution is somewhere between gamma and inverse Gaussian.
|
Given that the mean-variance plot points more strongly to the inverse Gaussian, that distribution is selected for the final model. The estimated model parameters closely match the true values.
|
Tweedie model
Since the power of the mean-variance relationship is clearly shown by the above plot and modified Park method to be greater than 1.1, it can be estimated and the data modeled by fitting a log-linked Tweedie model in PROC GENMOD. The power estimate, 2.98 with 95% confidence interval (2.95, 3.01), is very close to its true value. The estimated parameters of the model are also very close to their true values.
|
The sashelp.baseball data set contains salary and performance information for Major League Baseball players (excluding pitchers) who played at least one game in both the 1986 and 1987 seasons. The salaries are for the 1987 season (Sports Illustrated, April 20, 1987), and the performance measures are from the 1986 season (Collier Books, The 1987 Baseball Encyclopedia Update). Salaries will be modeled as a function of several predictors – League, Division, career runs (CrRuns), number of times at bat (nAtBat), number of hits (nHits), and number of outs (nOuts). An appropriate response distribution must be chosen for the log-linked model that will be fit.
Graphical assessment
Graphical assessment proceeds similarly to the first example above. First, the p=1, 2, and 3 models are fit to obtain their scale parameters for use in computing the variances under the three distributions.
proc genmod data=sashelp.baseball; class League Division; model Salary=League Division CrRuns nAtBat nHits nOuts / dist=igaussian link=log; run; proc genmod data=sashelp.baseball; class League Division; model Salary=League Division CrRuns nAtBat nHits nOuts / dist=gamma link=log; run; proc genmod data=sashelp.baseball; class League Division; model Salary=League Division CrRuns nAtBat nHits nOuts / dist=poisson link=log scale=d; run;
Since the continuous predictors have a large number of distinct values, they are each categorized into quintiles using PROC RANK before the means and variances of the resulting populations are computed and plotted. The observed mean-variance relationship was found not to be close to cubic, so only curves for p=1 and 2 are shown in the plot.
proc rank data=sashelp.baseball out=rankbb groups=5; var CrRuns nAtBat nHits nOuts; run; proc means data=rankbb nway mean var min max; class League Division CrRuns nAtBat nHits nOuts; var salary; output out=obsmv mean=Mean var=Variance; run; data obsmv; set obsmv; where Variance ne .; v3=Mean**3 * 0.0285**2; * estimate inverse Gaussian variance; v2=Mean**2 * (1/3.3558); * estimate gamma variance; v1=Mean * 12.0001**2; * estimate Poisson-like variance; run; proc sort data=obsmv; by Mean; run; proc sgplot data=obsmv cycleattrs noautolegend; loess y=Variance x=Mean / lineattrs=(thickness=2 pattern=solid) curvelabel="Observed"; spline y=v1 x=Mean / lineattrs=(pattern=5) curvelabel="p=1"; spline y=v2 x=Mean / lineattrs=(pattern=4) curvelabel="p=2"; yaxis label="Variance"; xaxis label="Mean"; run;
The plot suggests the mean-variance relationship is close to quadratic, so the gamma distribution seems reasonable.
Modified Park method
As before, an initial log-linked gamma model is used to obtain predicted and residual values which are then transformed and modeled to estimate the power of the mean-variance relationship.
proc genmod data=sashelp.baseball; class League Division; model Salary=League Division CrRuns nAtBat nHits nOuts / dist=g link=log; output out=bb p=p resraw=rr; run; data bb; set bb; Lnp=log(p); Lnrr2=log(rr**2); run; proc reg data=bb plots=none; model Lnrr2 = Lnp / clb; run; quit;
The estimated power, p=1.80 (1.45, 2.14), agrees that the relationship is approximately quadratic.
|
These statements fit the suggested gamma model and produce the following results.
proc genmod data=sashelp.baseball; class League Division; model Salary=League Division CrRuns nAtBat nHits nOuts / dist=g link=log; run;
|
Tweedie model
Since the power is clearly larger than 1.1, the Tweedie model can be used. The Tweedie model estimates the power to be p=2.17 (1.82, 2.5) also suggesting a gamma model.
proc genmod data=sashelp.baseball; class League Division; model Salary=League Division CrRuns nAtBat nHits nOuts / dist=tweedie link=log; run;
The Tweedie model with its estimated power is similar to the above gamma model.
|
__________
Note 1: Simulation results shown in Manning and Mullahy (2001) suggest that the distribution used for this tentative initial model is not critical. However, it might be advisable to try the method using other distributions in the initial model to see if the estimated power changes substantially.
Note 2: An alternative version of the modified Park method fits a log-linked model using GEE estimation (REPEATED statement in PROC GENMOD) to model the squared raw residuals from the initial model as a function of the log predicted values.
Reference
Manning, W.G., Mullahy, J. (2001). Estimating log models: to transform or not to transform? Journal of Health Economics. 20(4), 461-494.
Product Family | Product | System | SAS Release | |
Reported | Fixed* | |||
SAS System | SAS/STAT | z/OS | ||
z/OS 64-bit | ||||
OpenVMS VAX | ||||
Microsoft® Windows® for 64-Bit Itanium-based Systems | ||||
Microsoft Windows Server 2003 Datacenter 64-bit Edition | ||||
Microsoft Windows Server 2003 Enterprise 64-bit Edition | ||||
Microsoft Windows XP 64-bit Edition | ||||
Microsoft® Windows® for x64 | ||||
OS/2 | ||||
Microsoft Windows 8 Enterprise 32-bit | ||||
Microsoft Windows 8 Enterprise x64 | ||||
Microsoft Windows 8 Pro 32-bit | ||||
Microsoft Windows 8 Pro x64 | ||||
Microsoft Windows 8.1 Enterprise 32-bit | ||||
Microsoft Windows 8.1 Enterprise x64 | ||||
Microsoft Windows 8.1 Pro 32-bit | ||||
Microsoft Windows 8.1 Pro x64 | ||||
Microsoft Windows 10 | ||||
Microsoft Windows 95/98 | ||||
Microsoft Windows 2000 Advanced Server | ||||
Microsoft Windows 2000 Datacenter Server | ||||
Microsoft Windows 2000 Server | ||||
Microsoft Windows 2000 Professional | ||||
Microsoft Windows NT Workstation | ||||
Microsoft Windows Server 2003 Datacenter Edition | ||||
Microsoft Windows Server 2003 Enterprise Edition | ||||
Microsoft Windows Server 2003 Standard Edition | ||||
Microsoft Windows Server 2003 for x64 | ||||
Microsoft Windows Server 2008 | ||||
Microsoft Windows Server 2008 R2 | ||||
Microsoft Windows Server 2008 for x64 | ||||
Microsoft Windows Server 2012 Datacenter | ||||
Microsoft Windows Server 2012 R2 Datacenter | ||||
Microsoft Windows Server 2012 R2 Std | ||||
Microsoft Windows Server 2012 Std | ||||
Microsoft Windows XP Professional | ||||
Windows 7 Enterprise 32 bit | ||||
Windows 7 Enterprise x64 | ||||
Windows 7 Home Premium 32 bit | ||||
Windows 7 Home Premium x64 | ||||
Windows 7 Professional 32 bit | ||||
Windows 7 Professional x64 | ||||
Windows 7 Ultimate 32 bit | ||||
Windows 7 Ultimate x64 | ||||
Windows Millennium Edition (Me) | ||||
Windows Vista | ||||
Windows Vista for x64 | ||||
64-bit Enabled AIX | ||||
64-bit Enabled HP-UX | ||||
64-bit Enabled Solaris | ||||
ABI+ for Intel Architecture | ||||
AIX | ||||
HP-UX | ||||
HP-UX IPF | ||||
IRIX | ||||
Linux | ||||
Linux for x64 | ||||
Linux on Itanium | ||||
OpenVMS Alpha | ||||
OpenVMS on HP Integrity | ||||
Solaris | ||||
Solaris for x64 | ||||
Tru64 UNIX |
Type: | Usage Note |
Priority: | |
Topic: | Analytics ==> Regression SAS Reference ==> Procedures ==> GENMOD SAS Reference ==> Procedures ==> GLIMMIX SAS Reference ==> Procedures ==> HPGENSELECT |
Date Modified: | 2017-04-21 14:19:52 |
Date Created: | 2017-04-21 13:58:25 |