The ALLELE Procedure

Frequency Estimates

A marker locus M can have a series of alleles $M_ u$, $u=1,\ldots ,k$. A sample of $n$ individuals can therefore have several different genotypes at the locus, with $n_{uv}$ copies of type $M_ u/M_ v$. The number $n_ u$ of copies of allele $M_ u$ can be found directly by summation: $n_ u = 2n_{uu}+\sum _{v\neq u} n_{uv}$. The sample frequencies are written as $\tilde{p}_ u=n_ u/(2n)$ and $\tilde{P}_{uv}=n_{uv}/n$. The $\tilde{P}_{uv}$’s are unbiased maximum likelihood estimates (MLEs) of the population proportions $P_{uv}$.

The variance of the sample allele frequency $\tilde{p}_ u$ is calculated as

\[  \mbox{Var}(\tilde{p}_ u) = \frac{1}{2n}(p_ u + P_{uu} - 2p_ u^2)  \]

and can be estimated by replacing $p_ u$ and $P_{uu}$ with their sample values $\tilde{p}_ u$ and $\tilde{P}_{uu}$. The variance of the sample genotype frequency $\tilde{P}_{uv}$ is not generally calculated; instead, an MLE of the HWD coefficient $d_{uv}$ for alleles $M_ u$ and $M_ v$ is calculated as

\[  \hat{d}_{uv} = \left\{  \begin{array}{rl} \tilde{P}_{uv}- \tilde{p}_ u \tilde{p}_ v, &  u=v \\ \tilde{p}_ u \tilde{p}_ v - \frac{1}{2} \tilde{P}_{uv}, &  u \neq v \end{array} \right.  \]

and the MLE’s variance is estimated using one of the following formulas, depending on whether the two alleles are the same or different:

$\displaystyle  \mbox{Var}(\hat{d}_{uu})  $
$\displaystyle  =  $
$\displaystyle  \frac{1}{n}\Big[ \tilde{p}_ u^2(1-\tilde{p}_ u)^2 + (1-2\tilde{p}_ u)^2 \hat{d}_{uu} - \hat{d}_{uu}^2 \Big] $
$\displaystyle \mbox{Var}(\hat{d}_{uv})  $
$\displaystyle  =  $
$\displaystyle  \frac{1}{2n}\Big\{  \tilde{p}_ u \tilde{p}_ v(1-\tilde{p}_ u)(1-\tilde{p}_ v) +\sum _{w\neq u,v} (\tilde{p}^2_ u \hat{d}_{vw} + \tilde{p}^2_ v \hat{d}_{uw}) $
$\displaystyle  $
$\displaystyle  $
$\displaystyle  -\left[(1-\tilde{p}_ u-\tilde{p}_ v)^2 - 2(\tilde{p}_ u-\tilde{p}_ v)^2\right]\hat{d}_{uv}+\tilde{p}^2_ u\tilde{p}^2_ v- 2\hat{d}^2_{uv} \Big\}   $

The standard error, the square root of the variance, is reported for the sample allele frequencies and the disequilibrium coefficient estimates. When the BOOTSTRAP= option of the PROC ALLELE statement is specified, bootstrap confidence intervals are formed by resampling individuals from the data set and are reported for these estimates, with the $100(1-\alpha )$% confidence level given by the ALPHA=$\alpha $ option (or $\alpha =0.05$ by default).