The ALLELE Procedure

Linkage Disequilibrium (LD)

The set of genetic material an individual receives from each parent contains an allele at every locus, and statements can be made about these allelic combinations, or haplotypes. The probability $p_{uv}$ (called the gametic or haplotype frequency) that an individual receives the haplotype $M_ uN_ v$ for marker loci M and N can be compared to the product of the probabilities that each allele is received. The difference is the linkage, or gametic, disequilibrium (LD) coefficient $D_{uv}$ for those two alleles: $D_{uv}= p_{uv} - p_ up_ v$. There is a general expectation that the amount of linkage disequilibrium is inversely related to the distance between the two loci, but there are many other factors that can affect disequilibrium. There can even be disequilibrium between alleles at loci that are located on different chromosomes. Note that these tests and measures are calculated only for pairs of markers at most $d$ markers (or the unit used in the LOCATION variable of the NDATA= data set) apart, where $d$ is the value specified in the MAXDIST= option of the PROC ALLELE statement (or 50 by default) when the WITH statement is omitted; otherwise, all pairs of markers containing one marker from the VAR statement and one from the WITH statement are examined.

Table 2.1 displays how the HAPLO= option of the PROC ALLELE statement interacts with the linkage disequilibrium calculations. These calculations are discussed in more detail in the following two sections.

Table 2.1: Interaction of HAPLO= Option with LD Calculations

HAPLO=

LD Test

 

Estimate of

Option

Statistic

LD Exact Test

Haplotype Freq

GIVEN

$\tilde{D}_{uv}$

Permutes alleles to form

Observed freq, $\tilde{p}_{uv}$

   

new 2-locus haplotypes

 

EST

$\hat{D}_{uv}$

Not performed

Estimated freq, $\hat{p}_{uv}$

NONE

$\tilde{\Delta }_{uv}$

Permutes alleles to form

Composite freq, $\tilde{p}_{uv}^*$

   

new 2-locus genotypes

 

NONEHWD

$\tilde{\Delta }_{uv}$

Permutes genotypes to form

Composite freq, $\tilde{p}_{uv}^*$

   

new 2-locus genotypes

 


Tests

When haplotypes are known, the HAPLO=GIVEN option should be included in the PROC ALLELE statement so that the linkage disequilibrium can be computed directly by substituting the observed frequencies $\tilde{p}_{uv}$, $\tilde{p}_ u$, and $\tilde{p}_ v$ into the equation in the preceding section for $D_{uv}$. This creates the MLE, $\tilde{D}_{uv}$, of the LD coefficient between a pair of alleles at different markers. PROC ALLELE calculates an overall chi-square statistic to test that all of the $D_{uv}$’s between two markers are zero as follows:

\[  X_ T^2 = \sum _{u=1}^ k \sum _{v=1}^ l \frac{(2n)\tilde{D}_{uv}^2}{\tilde{p}_ u\tilde{p}_ v}  \]

which has $(k-1)(l-1)$ degrees of freedom for markers with $k$ and $l$ alleles, respectively.

There is also a Monte Carlo estimate of the exact test available when haplotypes are known. An estimate of the exact $p$-value for testing the hypothesis in the preceding paragraph can be calculated by conditioning on the allele counts as with the permutation version of the exact test for HWE. The conditional probability of the haplotype counts is then

\[  T = \frac{ \prod _ u n_ u! \prod _ v n_ v!}{(2n)! \prod _{u,v} n_{uv}!}  \]

and the significance level is obtained again by permuting the alleles at one locus to form $2n$ new two-locus haplotypes. You can indicate the number of permutations that are used in the PERMS= option of the PROC ALLELE statement and the random seed used to randomly permute the data in the SEED= option of the PROC ALLELE statement.

When it is requested that haplotype frequencies be estimated with the HAPLO=EST option, $D_{uv}$ is estimated using $\hat{D}_{uv}=\hat{p}_{uv} - \tilde{p}_ u\tilde{p}_ v$, where $\hat{p}_{uv}$ is the MLE of $p_{uv}$ assuming HWE. The estimate $\hat{p}_{uv}$ is calculated according to the method described by Weir and Cockerham (1979). Again, a chi-square test statistic can be calculated to test that all of the $D_{uv}$’s between a pair of markers are zero as

\[  X_ T^2 = \sum _{u=1}^ k \sum _{v=1}^ l \frac{n\hat{D}_{uv}^2}{\tilde{p}_ u\tilde{p}_ v}  \]

which has $(k-1)(l-1)$ degrees of freedom for markers with $k$ and $l$ alleles, respectively. No exact test is available when haplotype frequencies are estimated.

The HAPLO=NONE and HAPLO=NONEHWD options indicate that haplotypes are unknown and $\hat{D}_{uv}$ should not be used in the tests for LD between pairs of markers. Instead of using the estimated haplotype frequencies which assumes HWE, a test can be formed using the composite linkage disequilibrium (CLD) coefficient $\Delta _{uv}$ that does not require this assumption and uses only allele and two-locus genotype frequencies. The MLE $\tilde{\Delta }_{uv}$ of $\Delta _{uv}$ can be calculated as described by Weir (1979), and a chi-square statistic that tests all $\Delta _{uv}$’s between a pair of markers are zero can be formed as follows:

\[  X_ T^2 = \sum _{u=1}^ k \sum _{v=1}^ l \frac{n\tilde{\Delta }_{uv}^2}{\tilde{p}_ u \tilde{p}_ v}  \]

which has $(k-1)(l-1)$ degrees of freedom for markers with $k$ and $l$ alleles, respectively. This statistic is used when HAPLO=NONE is specified. When each marker in the pair being analyzed is biallelic, a correction in this test statistic for departures from HWE can be requested with the HAPLO=NONEHWD option. The 1 df chi-square statistic is then represented as

\[  X_ T^2 = \frac{n\tilde{\Delta }_{uv}^2}{[\tilde{p}_ u (1-\tilde{p}_ u) + \hat{d}_{uu}][\tilde{p}_ v (1-\tilde{p}_ v) + \hat{d}_{vv}]}  \]

with $ u = v = 1$.

Permutation versions of exact tests for CLD are given by Zaykin, Zhivotovsky, and Weir (1995), either assuming HWE or accounting for departures from HWE. The conditional probability of the two-locus genotypes given the one-locus alleles assuming HWE is

\[  T = \frac{n! \prod _ r n_ r! \prod _ u n_ u! \prod _{r,s,u,v}2^{n_{rsuv}H_{rsuv}} }{(2n!)^2 \prod _{r,s,u,v}n_{rsuv}!}  \]

where $n_{rsuv}$ is the count of $M_ rM_ sN_ uN_ v$ genotypes, $n_{r}$ and $n_{u}$ are the counts of $M_ r$ and $N_ u$ alleles, respectively, and $H_{rsuv}$ represents the number of loci that are heterozygous for genotype $M_ rM_ sN_ uN_ v$ (0, 1, or 2). An estimate of the exact significance level is obtained by permuting the alleles at both of the loci and counting a permuted sample toward the $p$-value when its probability $T$ is not larger than for the observed sample.

When departures from HWE are accounted for, the conditional probability of the two-locus genotypes given the one-locus genotypes is

\[  T_{HWD} = \frac{\prod _{r,s} n_{rs}! \prod _{u,v} n_{uv}!}{n! \prod _{r,s,u,v}n_{rsuv}!}  \]

with $n_{rs}$ and $n_{uv}$ as the counts of $M_ r/M_ s$ and $N_ u/N_ v$ genotypes, respectively. An estimate of the exact significance level is obtained by permuting the genotypes at one of the loci and calculating the probability $T_{HWD}$ for each permuted sample. When HAPLO=NONEHWD is specified, the $p$-value is reported as the proportion of samples that have a $T_{HWD}$ less than or equal to the one from the original sample. Note: $T_{HWD}$ can be used for multiallelic markers, while the formula for the chi-square statistic cannot. When HAPLO=NONEHWD, the chi-square statistic and asymptotic $p$-value that are reported for a marker with more than two alleles do not account for departures from HWE; however, the estimate of the exact $p$-value does make this adjustment as expected.

Measures

PROC ALLELE offers several linkage disequilibrium measures to be calculated for each pair of alleles $M_ u$ and $N_ v$ located at loci M and N, respectively. Devlin and Risch (1995) discuss the correlation coefficient $r$, the population attributable risk $\delta $, Lewontin’s $D’$, the proportional difference $d$, and Yule’s $Q$; Morton et al. (2001) define $\rho $ and its information $K_{\rho }$, which is calculated under the null hypothesis that $D=0$ and also included in the Linkage Disequilibrium Measures table when the RHO option is specified. Since these measures are designed for biallelic markers, the measures are calculated for each allele at locus M with each allele at locus N, where all other alleles at each loci are combined to represent one allele. Thus for each allele $M_ u$ in turn, $\tilde{p}_1$ is used as the frequency of allele $M_ u$, and $\tilde{p}_2$ represents the frequency of not $M_ u$; similarly for each $N_ v$ in turn, $\tilde{q}_1$ represents the frequency of allele $N_ v$, and $\tilde{q}_2$ represents the frequency of not $N_ v$. All measures have the same numerator, ${D}=p_{11} p_{22} - p_{12} p_{21}$, the LD coefficient, which can be directly estimated using the observed haplotype frequencies $\tilde{p}_{uv}$ when HAPLO=GIVEN, or estimated using the MLEs of the haplotype frequencies $\hat{p}_{uv}$ assuming HWE when HAPLO=EST. The computations for the measures are as follows:

\begin{eqnarray*}  r &  = &  \frac{ D }{ (p_1 p_2 q_1 q_2)^{1/2} } \\ \delta &  = &  \frac{ D }{q_1 p_{22} } \\ D’ &  = &  \frac{ D }{D_{\max }}, \mbox{ }D_{\max } = \left\{ \begin{array}{ll} \min (p_1 q_2, q_1 p_2), &  D > 0\\ \min (p_1 q_1, q_2 p_2), &  D < 0 \end{array} \right. \\ d &  = &  \frac{D }{q_1 q_2 } \\ \rho &  = &  \frac{ D }{\mbox{denom}}, \mbox{ denom} = \left\{ \begin{array}{ll} \min (p_1,p_2)\times \max (q_1,q_2), &  \min (p_1,p_2)\le \min (q_1,q_2)\\ \min (q_1,q_2)\times \max (p_1,p_2), &  \min (p_1,p_2)>\min (q_1,q_2) \end{array} \right. \\ Q &  = &  \frac{ D }{p_{11}p_{22} + p_{12} p_{21}} \end{eqnarray*}

with estimates of measures calculated by replacing parameters with their appropriate estimates. Under the option HAPLO=NONE (the default) or HAPLO=NONEHWD, the numerator $D$ can be replaced by the CLD coefficient $\Delta $, described in the preceding section, for measures $r$ and $D’$. In place of the preceding formula for the denominator of $D’$, the bounds used for $\Delta $ ($\Delta _{\max }$) are given by: Hamilton and Cole (2004); Zaykin (2004). The denominator of the correlation coefficient $r$ is adjusted for departures from HWE when HAPLO=NONEHWD in the same manner as the corresponding chi-square statistic, so that $r = \Delta _{uv} / \{ [p_ u (1-p_ u) + d_{uu}][q_ v (1-q_ v) + d_{vv}]\} ^{1/2}$. The measures $\delta $, $d$, $\rho $, and $Q$ cannot be calculated for either of these two options. The information $K_{\rho }$ is estimated by $n Q(1-R)/(R(1-Q))$, where $Q=\min (p_1,p_2,q_1,q_2)$ and $R$ is the smaller allele frequency ($\min (p_1,p_2)$ or $\min (q_1,q_2)$) at the locus not used for $Q$.