The GENMOD Procedure

What Is a Generalized Linear Model?

A traditional linear model is of the form

\[ y_ i = \mb{x}_ i^\prime \bbeta + \varepsilon _ i \]

where $y_ i$ is the response variable for the ith observation. The quantity $\mb{x}_ i$ is a column vector of covariates, or explanatory variables, for observation i that is known from the experimental setting and is considered to be fixed, or nonrandom. The vector of unknown coefficients $\bbeta $ is estimated by a least squares fit to the data $\mb{y}$. The $\varepsilon _ i$ are assumed to be independent, normal random variables with zero mean and constant variance. The expected value of $y_ i$, denoted by $\mu _ i$, is

\[ \mu _ i = \mb{x}_ i^\prime \bbeta \]

While traditional linear models are used extensively in statistical data analysis, there are types of problems such as the following for which they are not appropriate.

  • It might not be reasonable to assume that data are normally distributed. For example, the normal distribution (which is continuous) might not be adequate for modeling counts or measured proportions that are considered to be discrete.

  • If the mean of the data is naturally restricted to a range of values, the traditional linear model might not be appropriate, since the linear predictor $\mb{x}_ i^\prime \bbeta $ can take on any value. For example, the mean of a measured proportion is between 0 and 1, but the linear predictor of the mean in a traditional linear model is not restricted to this range.

  • It might not be realistic to assume that the variance of the data is constant for all observations. For example, it is not unusual to observe data where the variance increases with the mean of the data.

A generalized linear model extends the traditional linear model and is therefore applicable to a wider range of data analysis problems. A generalized linear model consists of the following components:

  • The linear component is defined just as it is for traditional linear models:

    \[ \eta _ i = \mb{x}_ i^\prime \bbeta \]
  • A monotonic differentiable link function g describes how the expected value of $y_ i$ is related to the linear predictor $\eta _ i$:

    \[ g(\mu _ i) = \mb{x}_ i^\prime \bbeta \]
  • The response variables $y_ i$ are independent for i = 1, 2,…and have a probability distribution from an exponential family. This implies that the variance of the response depends on the mean $\mu $ through a variance function V:

    \[ \mr{Var}(y_ i) = \frac{\phi V(\mu _ i)}{w_ i} \]

    where $\phi $ is a constant and $w_ i$ is a known weight for each observation. The dispersion parameter $\phi $ is either known (for example, for the binomial or Poisson distribution, $\phi = 1$) or must be estimated.

See the section Response Probability Distributions for the form of a probability distribution from the exponential family of distributions.

As in the case of traditional linear models, fitted generalized linear models can be summarized through statistics such as parameter estimates, their standard errors, and goodness-of-fit statistics. You can also make statistical inference about the parameters by using confidence intervals and hypothesis tests. However, specific inference procedures are usually based on asymptotic considerations, since exact distribution theory is not available or is not practical for all generalized linear models.