The HTSNP Procedure (Experimental)

Example 8.1 Using the HAPLOTYPE and HTSNP Procedures Together

Before using PROC HTSNP, you might need to run PROC HAPLOTYPE (see Chapter 7: The HAPLOTYPE Procedure, for more details) if you have data with unknown phase in order to estimate the haplotype frequencies. This example demonstrates how output from PROC HAPLOTYPE can be manipulated to be in the appropriate form for an input data set for PROC HTSNP.

The following data set contains 150 individuals with genotypes at 13 SNPs that were simulated to mimic the frequencies of SNPs in the CASP8 gene (Johnson et al., 2001).

data casp8;
   input id (m1-m13) ($);
   datalines;
1 T/T T/T A/G G/G C/G A/G A/G G/C C/C G/G A/A A/G A/C
2 G/T T/T A/G T/G C/G G/G G/G C/C C/C G/G A/A A/G C/C
3 T/T C/T G/A G/G C/C G/A G/G C/G C/C G/G A/A A/A C/A
4 T/T T/C G/G G/G C/C G/G G/G C/C C/C G/G A/A A/A C/C

   ... more lines ...   

150 T/T T/T A/A G/G C/C A/A A/G G/G C/C G/G A/A A/A A/A
;

The following code can be used to first estimate haplotype frequencies by using the EM algorithm, and then to identify the haplotype tag SNPs.

ods output haplotypefreq=freqout(keep=haplotype freq);
proc haplotype data=casp8 genocol cutoff=0.0075;
   var m1-m13;
run;

data hapfreq;
   set freqout;
   array m{13} $ 1;
   do i = 1 to 13;
    m{i} = substr(haplotype, 2*i-1, 1);
   end;
   drop haplotype i;
run;

proc htsnp data=hapfreq size=4 method=sa best=5 cutoff=0.05
           seed=123 nosumm;
   var m1-m13;
   freq freq;
run;

The ODS statement is used to create a data set from the Haplotype Frequencies ODS table, which is displayed in its table form in Output 8.1.1.

Output 8.1.1: ODS Table Containing Haplotype Frequencies

The HAPLOTYPE Procedure

Haplotype Frequencies
Number Haplotype Freq Standard
Error
95% Confidence Limits
1 G-T-A-T-C-G-G-C-C-G-A-A-C 0.01988 0.00807 0.00406 0.03570
2 T-C-G-G-C-G-G-C-C-G-A-A-C 0.09173 0.01669 0.05902 0.12445
3 T-T-A-G-C-A-A-G-C-G-A-A-A 0.16666 0.02155 0.12442 0.20890
4 T-T-A-G-C-A-G-G-C-G-A-A-A 0.05667 0.01337 0.03046 0.08287
5 T-T-A-G-C-A-G-G-C-G-G-A-A 0.03663 0.01086 0.01534 0.05793
6 T-T-G-G-C-A-G-G-C-G-A-A-A 0.01579 0.00721 0.00166 0.02992
7 T-T-G-G-C-G-G-C-C-G-A-A-C 0.40576 0.02840 0.35011 0.46142
8 T-T-G-G-C-G-G-G-T-C-A-A-A 0.02667 0.00932 0.00841 0.04493
9 T-T-G-G-C-G-G-G-T-G-A-A-A 0.00861 0.00534 0.00000 0.01908
10 T-T-G-G-G-G-G-C-C-G-A-G-C 0.16250 0.02133 0.12069 0.20432


With this table in the form of a SAS data set, the preceding DATA step code can be used to convert it to an input data set for PROC HTSNP, using the estimated frequencies from PROC HAPLOTYPE as the FREQ variable. In this example, the simulated annealing search method is specified for finding the best sets of size four. The htSNP Evaluation table that is created by PROC HTSNP is displayed in Output 8.1.2 to show the best five sets of SNPs that were selected.

Output 8.1.2: Candidate Sets of htSNPs from PROC HTSNP

The HTSNP Procedure

htSNP Evaluation
Rank HTSNP1 HTSNP2 HTSNP3 HTSNP4 PDE
1 m2 m5 m7 m13 1.0000
1 m2 m7 m8 m12 1.0000
1 m2 m5 m7 m8 1.0000
1 m2 m7 m12 m13 1.0000
1 m2 m5 m6 m7 1.0000


Note that the last selection shown in Output 8.1.2 matches the set of htSNPs found by Johnson et al. (2001).