Before using PROC HTSNP, you might need to run PROC HAPLOTYPE (see Chapter 7: The HAPLOTYPE Procedure, for more details) if you have data with unknown phase in order to estimate the haplotype frequencies. This example demonstrates how output from PROC HAPLOTYPE can be manipulated to be in the appropriate form for an input data set for PROC HTSNP.
The following data set contains 150 individuals with genotypes at 13 SNPs that were simulated to mimic the frequencies of SNPs in the CASP8 gene (Johnson et al., 2001).
data casp8; input id (m1-m13) ($); datalines; 1 T/T T/T A/G G/G C/G A/G A/G G/C C/C G/G A/A A/G A/C 2 G/T T/T A/G T/G C/G G/G G/G C/C C/C G/G A/A A/G C/C 3 T/T C/T G/A G/G C/C G/A G/G C/G C/C G/G A/A A/A C/A 4 T/T T/C G/G G/G C/C G/G G/G C/C C/C G/G A/A A/A C/C ... more lines ... 150 T/T T/T A/A G/G C/C A/A A/G G/G C/C G/G A/A A/A A/A ;
The following code can be used to first estimate haplotype frequencies by using the EM algorithm, and then to identify the haplotype tag SNPs.
ods output haplotypefreq=freqout(keep=haplotype freq); proc haplotype data=casp8 genocol cutoff=0.0075; var m1-m13; run; data hapfreq; set freqout; array m{13} $ 1; do i = 1 to 13; m{i} = substr(haplotype, 2*i-1, 1); end; drop haplotype i; run; proc htsnp data=hapfreq size=4 method=sa best=5 cutoff=0.05 seed=123 nosumm; var m1-m13; freq freq; run;
The ODS statement is used to create a data set from the “Haplotype Frequencies” ODS table, which is displayed in its table form in Output 8.1.1.
Output 8.1.1: ODS Table Containing Haplotype Frequencies
Haplotype Frequencies | |||||
---|---|---|---|---|---|
Number | Haplotype | Freq | Standard Error |
95% Confidence Limits | |
1 | G-T-A-T-C-G-G-C-C-G-A-A-C | 0.01988 | 0.00807 | 0.00406 | 0.03570 |
2 | T-C-G-G-C-G-G-C-C-G-A-A-C | 0.09173 | 0.01669 | 0.05902 | 0.12445 |
3 | T-T-A-G-C-A-A-G-C-G-A-A-A | 0.16666 | 0.02155 | 0.12442 | 0.20890 |
4 | T-T-A-G-C-A-G-G-C-G-A-A-A | 0.05667 | 0.01337 | 0.03046 | 0.08287 |
5 | T-T-A-G-C-A-G-G-C-G-G-A-A | 0.03663 | 0.01086 | 0.01534 | 0.05793 |
6 | T-T-G-G-C-A-G-G-C-G-A-A-A | 0.01579 | 0.00721 | 0.00166 | 0.02992 |
7 | T-T-G-G-C-G-G-C-C-G-A-A-C | 0.40576 | 0.02840 | 0.35011 | 0.46142 |
8 | T-T-G-G-C-G-G-G-T-C-A-A-A | 0.02667 | 0.00932 | 0.00841 | 0.04493 |
9 | T-T-G-G-C-G-G-G-T-G-A-A-A | 0.00861 | 0.00534 | 0.00000 | 0.01908 |
10 | T-T-G-G-G-G-G-C-C-G-A-G-C | 0.16250 | 0.02133 | 0.12069 | 0.20432 |
With this table in the form of a SAS data set, the preceding DATA step code can be used to convert it to an input data set for PROC HTSNP, using the estimated frequencies from PROC HAPLOTYPE as the FREQ variable. In this example, the simulated annealing search method is specified for finding the best sets of size four. The “htSNP Evaluation” table that is created by PROC HTSNP is displayed in Output 8.1.2 to show the best five sets of SNPs that were selected.
Output 8.1.2: Candidate Sets of htSNPs from PROC HTSNP
htSNP Evaluation | |||||
---|---|---|---|---|---|
Rank | HTSNP1 | HTSNP2 | HTSNP3 | HTSNP4 | PDE |
1 | m2 | m5 | m7 | m13 | 1.0000 |
1 | m2 | m7 | m8 | m12 | 1.0000 |
1 | m2 | m5 | m7 | m8 | 1.0000 |
1 | m2 | m7 | m12 | m13 | 1.0000 |
1 | m2 | m5 | m6 | m7 | 1.0000 |
Note that the last selection shown in Output 8.1.2 matches the set of htSNPs found by Johnson et al. (2001).