Suppose you have genotyped 25 individuals at five markers. You want to examine some basic properties of these markers, such as whether they are in HWE, how many alleles each has, what genotypes appear in the data, and whether there is linkage disequilibrium between any pairs of markers. You have ten columns of data, with the first two columns containing the set of alleles at the first marker, the next two columns containing the set of alleles for the second marker, and so on. There is one row per each individual. You input your data as follows:
data markers; input (a1-a10) ($); datalines; B B A B B B A A B B A A B B A B A B C C B B A A B B B B A C A B A B A B A B A B A A A B A B B B C C B B A A A B A B C C A B B B A B A A A B A B A A A A A A A A B B A A A A A B B B A B A B A B B B A C A A A B A A A B B C B B A B A B A B A C A B B B A A A B A C B B B B A A A A A B A B A A A B A A C C A B A A A B A B C C B B A A A A A B A A A A A B A A A B A B A B A A A A B B C C A A A A A A A A B B A B B B A A A A C C A B A B A B A A B B B B A B A B A A A C A B A A A B A B A C A B B B B B A B B B ;
You can now use PROC ALLELE to examine the frequencies of alleles and genotypes in your data, and see if these frequencies are occurring in proportions you would expect. The following statements perform the analysis you want:
proc allele data=markers outstat=ld prefix=Marker perms=10000 boot=1000 seed=123; var a1-a10; run; proc print data=ld; run;
This analysis is using 10,000 permutations to approximate an exact -value for the HWE test, as well as 1,000 bootstrap samples to obtain the confidence interval for the allele frequencies and one-locus Hardy-Weinberg disequilibrium (HWD) coefficients. The starting seed for the random number generator is 123. The PREFIX= option requests that the five markers be named Marker1–Marker5. Since the BOOTSTRAP= option is specified but the ALPHA= option is omitted, a 95% confidence interval is calculated by default.
All five markers are included in the analysis since the ten variables containing the alleles for those five markers were specified in the VAR statement.
The marker data can alternatively be read in as columns of genotypes instead of columns of alleles by using the GENOCOL and DELIMITER= options in the PROC ALLELE statement, with just one column per each marker. The following DATA step and SAS code could be used to produce the same output by using data in this alternative format:
data markers; input (g1-g5) ($); datalines; B/B A/B B/B A/A B/B A/A B/B A/B A/B C/C B/B A/A B/B B/B A/C ... more lines ... B/B A/B A/B A/A A/C A/B A/A A/B A/B A/C A/B B/B B/B A/B B/B ;
proc allele data=markers outstat=ld prefix=Marker perms=10000 boot=1000 seed=123 genocol delimiter='/'; var g1-g5; run;
proc print data=ld; run;
Note that the DELIMITER= option, which indicates the character or string that separates the alleles that compose a genotype, could have been omitted in this example since ’/’ is the default.
The results from the analysis are shown in Figure 3.1 through Figure 3.4.
Figure 3.1: Marker Summary for the ALLELE Procedure
Marker Summary | |||||||||
---|---|---|---|---|---|---|---|---|---|
Locus | Number of Indiv |
Number of Alleles |
Polymorph Info Content |
Heterozygosity | Allelic Diversity |
Test for HWE | |||
Chi- Square |
DF | Pr > ChiSq | Prob Exact |
||||||
Marker1 | 25 | 2 | 0.3714 | 0.4800 | 0.4928 | 0.0169 | 1 | 0.8967 | 1.0000 |
Marker2 | 25 | 2 | 0.3685 | 0.3600 | 0.4872 | 1.7041 | 1 | 0.1918 | 0.2262 |
Marker3 | 25 | 2 | 0.3546 | 0.4800 | 0.4608 | 0.0434 | 1 | 0.8350 | 1.0000 |
Marker4 | 25 | 2 | 0.3648 | 0.4800 | 0.4800 | 0.0000 | 1 | 1.0000 | 1.0000 |
Marker5 | 25 | 3 | 0.5817 | 0.4400 | 0.6552 | 9.3537 | 3 | 0.0249 | 0.0106 |
Figure 3.1 displays information about the five markers. From this output, you can conclude that Marker5 is the only one showing significant departure from HWE.
Figure 3.2: Allele Frequencies for the ALLELE Procedure
Allele Frequencies | ||||||
---|---|---|---|---|---|---|
Locus | Allele | Count | Frequency | Standard Error |
95% Confidence Limits | |
Marker1 | A | 22 | 0.4400 | 0.0711 | 0.3000 | 0.5800 |
B | 28 | 0.5600 | 0.0711 | 0.4200 | 0.7000 | |
Marker2 | A | 29 | 0.5800 | 0.0784 | 0.4200 | 0.7400 |
B | 21 | 0.4200 | 0.0784 | 0.2600 | 0.5800 | |
Marker3 | A | 32 | 0.6400 | 0.0665 | 0.5200 | 0.7600 |
B | 18 | 0.3600 | 0.0665 | 0.2400 | 0.4800 | |
Marker4 | A | 30 | 0.6000 | 0.0693 | 0.4600 | 0.7400 |
B | 20 | 0.4000 | 0.0693 | 0.2600 | 0.5400 | |
Marker5 | A | 14 | 0.2800 | 0.0637 | 0.1400 | 0.4200 |
B | 15 | 0.3000 | 0.0800 | 0.1600 | 0.4600 | |
C | 21 | 0.4200 | 0.0833 | 0.2800 | 0.6000 |
Figure 3.2 displays the allele frequencies for each marker with their standard errors and the lower and upper limits of the 95% confidence interval.
Figure 3.3: Genotype Frequencies for the ALLELE Procedure
Genotype Frequencies | |||||||
---|---|---|---|---|---|---|---|
Locus | Genotype | Count | Frequency | HWD Coeff |
Standard Error |
95% Confidence Limits | |
Marker1 | A/A | 5 | 0.2000 | 0.0064 | 0.0493 | -0.0916 | 0.0956 |
A/B | 12 | 0.4800 | 0.0064 | 0.0493 | -0.0916 | 0.0956 | |
B/B | 8 | 0.3200 | 0.0064 | 0.0493 | -0.0916 | 0.0956 | |
Marker2 | A/A | 10 | 0.4000 | 0.0636 | 0.0477 | -0.0336 | 0.1484 |
A/B | 9 | 0.3600 | 0.0636 | 0.0477 | -0.0336 | 0.1484 | |
B/B | 6 | 0.2400 | 0.0636 | 0.0477 | -0.0336 | 0.1484 | |
Marker3 | A/A | 10 | 0.4000 | -0.0096 | 0.0457 | -0.1044 | 0.0800 |
A/B | 12 | 0.4800 | -0.0096 | 0.0457 | -0.1044 | 0.0800 | |
B/B | 3 | 0.1200 | -0.0096 | 0.0457 | -0.1044 | 0.0800 | |
Marker4 | A/A | 9 | 0.3600 | 0.0000 | 0.0480 | -0.0916 | 0.0864 |
A/B | 12 | 0.4800 | 0.0000 | 0.0480 | -0.0916 | 0.0864 | |
B/B | 4 | 0.1600 | 0.0000 | 0.0480 | -0.0916 | 0.0864 | |
Marker5 | A/A | 2 | 0.0800 | 0.0016 | 0.0405 | -0.0756 | 0.0816 |
A/B | 4 | 0.1600 | 0.0040 | 0.0337 | -0.0664 | 0.0636 | |
A/C | 6 | 0.2400 | -0.0024 | 0.0380 | -0.0736 | 0.0680 | |
B/B | 5 | 0.2000 | 0.1100 | 0.0445 | 0.0144 | 0.1884 | |
B/C | 1 | 0.0400 | 0.1060 | 0.0282 | 0.0440 | 0.1564 | |
C/C | 7 | 0.2800 | 0.1036 | 0.0453 | 0.0096 | 0.1884 |
Figure 3.3 displays the genotype frequencies for each marker with the associated disequilibrium coefficient, its standard error, and the 95% confidence limits.
Figure 3.4: Testing for Disequilibrium Using the ALLELE Procedure
Obs | Locus1 | Locus2 | NIndiv | Distance | Test | ChiSq | DF | ProbChi | ProbEx |
---|---|---|---|---|---|---|---|---|---|
1 | Marker1 | Marker1 | 25 | 0 | HWE | 0.01687 | 1 | 0.89667 | 1.0000 |
2 | Marker1 | Marker2 | 25 | 1 | LD | 1.05799 | 1 | 0.30367 | 0.4882 |
3 | Marker1 | Marker3 | 25 | 2 | LD | 1.42074 | 1 | 0.23328 | 0.8544 |
4 | Marker1 | Marker4 | 25 | 3 | LD | 0.33144 | 1 | 0.56481 | 0.9885 |
5 | Marker1 | Marker5 | 25 | 4 | LD | 2.29785 | 2 | 0.31698 | 0.0940 |
6 | Marker2 | Marker2 | 25 | 0 | HWE | 1.70412 | 1 | 0.19175 | 0.2262 |
7 | Marker2 | Marker3 | 25 | 1 | LD | 0.13798 | 1 | 0.71030 | 0.5096 |
8 | Marker2 | Marker4 | 25 | 2 | LD | 1.34100 | 1 | 0.24686 | 0.6455 |
9 | Marker2 | Marker5 | 25 | 3 | LD | 1.13574 | 2 | 0.56673 | 0.0126 |
10 | Marker3 | Marker3 | 25 | 0 | HWE | 0.04340 | 1 | 0.83497 | 1.0000 |
11 | Marker3 | Marker4 | 25 | 1 | LD | 0.46296 | 1 | 0.49624 | 0.9712 |
12 | Marker3 | Marker5 | 25 | 2 | LD | 0.95899 | 2 | 0.61909 | 0.0261 |
13 | Marker4 | Marker4 | 25 | 0 | HWE | 0.00000 | 1 | 1.00000 | 1.0000 |
14 | Marker4 | Marker5 | 25 | 1 | LD | 6.16071 | 2 | 0.04594 | 0.1281 |
15 | Marker5 | Marker5 | 25 | 0 | HWE | 9.35374 | 3 | 0.02494 | 0.0106 |
Figure 3.4 displays the output data set created using the OUTSTAT= option of the PROC ALLELE statement. This data set contains the statistics for testing individual markers for HWE and marker pairs for linkage disequilibrium.