The HTSNP Procedure (Experimental)

Statistical Computations

Diversity
Evaluating Sets of htSNPs

Diversity

Let $f_1,\ldots ,f_ n$ represent the proportional frequencies of the unique -locus haplotypes in the input data set. The locus or allelic diversity $D_1,\ldots ,D_ M$ for the individual loci and the overall haplotype diversity can be calculated as

$\displaystyle D_ m$	$\displaystyle =$	$\displaystyle \sum _{i=1}^ n \sum _{j=1}^ n f_ i f_ j \mbox{ I}(h_{im} \neq h_{jm})$
$\displaystyle D$	$\displaystyle =$	$\displaystyle \sum _{m=1}^ M D_ m$

where $h_{im}$ is the allele of the th haplotype observed at the th locus and the indicator function $\mbox{I}()$ equals 1 when true and 0 otherwise (Clayton 2002).

Based on a selected subset of SNPs, the observed haplotypes can be partitioned into distinct groups. Let $\mathcal{T}_ t$ represent the set of haplotypes in group $t=1,\ldots ,T$ , where each set contains all haplotypes that have identical alleles at the selected loci. The residual diversity is calculated by Clayton (2002) by summing the within-group diversity over the groups, again both for the individual loci and over all haplotypes:

$\displaystyle R_ m$	$\displaystyle =$	$\displaystyle \sum _{t=1}^ T \sum _{i\in \mathcal{T}_ t} \sum _{j\in \mathcal{T}_ t} f_ i f_ j \mbox{ I}(h_{im} \neq h_{jm})$
$\displaystyle R$	$\displaystyle =$	$\displaystyle \sum _{m=1}^ M R_ m$

where $m=1,\ldots ,M$ . Note that if locus is one of the selected SNPs.

Evaluating Sets of htSNPs

One of two criteria for finding the optimal set of htSNPs can be selected with the CRITERION= option. Using the diversity measures previously defined, the proportion of diversity explained (PDE) by a candidate SNP set can be calculated to evaluate the goodness of the set (Clayton 2002):

$\mbox{PDE}=1-\frac{R}{D}$

The higher (that is, closer to 1) the value of PDE is, the better the set of htSNPs is for explaining the diversity among the haplotypes.

Alternatively, the approach of Stram et al. (2003) is implemented when CRITERION=RSQH. For these computations, define $\delta _ h(\mathcal{H}_ i)$ to be the actual number of copies of haplotype that an individual with the -locus haplotype pair $\mathcal{H}_ i$ (usually unknown) and genotype carries. Assuming Hardy-Weinberg equilibrium, this can be estimated as

$E[\delta _ h(\mathcal{H}_ i)|G_ i] = \frac{ \sum _{j\in H_ i} \delta _ h (h_ j, h_ j^{ci})f_ j f_ j^{ci}}{\sum _{j\in H_ i}f_ j f_ j^{ci}}$

where is the set of haplotype pairs, and its complement $h_ j^{ci}$ , compatible with genotype . Then can be defined as follows for each haplotype :

$R^2_ h = \frac{\mathrm{Var}\{ E[\delta _ h(\mathcal{H}_ i)|G_ i]\} }{2f_ h(1-f_ h)} = \frac{\sum _ i\{ [E(\delta _ h (\mathcal{H}_ i)|G_ i)]^2 \Pr (G_ i)\} -4f_ h^2}{2f_ h(1-f_ h)}$

with representing each possible -locus genotype at the selected SNPs and $\Pr (G_ i)=\sum _{j\in H_ i} f_ j f_ j^{ci}$ . The set of SNPs with the highest (that is, closest to 1) value of $\min _ h R^2_ h$ is selected as the best set of htSNPs, for it optimizes the predictability of the common haplotypes (Stram et al. 2003).