The ALLELE Procedure

Example

Suppose you have genotyped 25 individuals at five markers. You want to examine some basic properties of these markers, such as whether they are in HWE, how many alleles each has, what genotypes appear in the data, and whether there is linkage disequilibrium between any pairs of markers. You have ten columns of data, with the first two columns containing the set of alleles at the first marker, the next two columns containing the set of alleles for the second marker, and so on. There is one row per each individual. You input your data as follows:

data markers;
   input (a1-a10) ($);
   datalines;
B  B  A  B  B  B  A  A  B  B
A  A  B  B  A  B  A  B  C  C
B  B  A  A  B  B  B  B  A  C
A  B  A  B  A  B  A  B  A  B
A  A  A  B  A  B  B  B  C  C
B  B  A  A  A  B  A  B  C  C
A  B  B  B  A  B  A  A  A  B
A  B  A  A  A  A  A  A  A  A
B  B  A  A  A  A  A  B  B  B
A  B  A  B  A  B  B  B  A  C
A  A  A  B  A  A  A  B  B  C
B  B  A  B  A  B  A  B  A  C
A  B  B  B  A  A  A  B  A  C
B  B  B  B  A  A  A  A  A  B
A  B  A  A  A  B  A  A  C  C
A  B  A  A  A  B  A  B  C  C
B  B  A  A  A  A  A  B  A  A
A  A  A  B  A  A  A  B  A  B
A  B  A  A  A  A  B  B  C  C
A  A  A  A  A  A  A  A  B  B
A  B  B  B  A  A  A  A  C  C
A  B  A  B  A  B  A  A  B  B
B  B  A  B  A  B  A  A  A  C
A  B  A  A  A  B  A  B  A  C
A  B  B  B  B  B  A  B  B  B
;

You can now use PROC ALLELE to examine the frequencies of alleles and genotypes in your data, and see if these frequencies are occurring in proportions you would expect. The following statements perform the analysis you want:

proc allele data=markers outstat=ld prefix=Marker
            perms=10000 boot=1000 seed=123;
   var a1-a10;
run; 

proc print data=ld;
run;

This analysis is using 10,000 permutations to approximate an exact $p$-value for the HWE test, as well as 1,000 bootstrap samples to obtain the confidence interval for the allele frequencies and one-locus Hardy-Weinberg disequilibrium (HWD) coefficients. The starting seed for the random number generator is 123. The PREFIX= option requests that the five markers be named Marker1–Marker5. Since the BOOTSTRAP= option is specified but the ALPHA= option is omitted, a 95% confidence interval is calculated by default.

All five markers are included in the analysis since the ten variables containing the alleles for those five markers were specified in the VAR statement.

The marker data can alternatively be read in as columns of genotypes instead of columns of alleles by using the GENOCOL and DELIMITER= options in the PROC ALLELE statement, with just one column per each marker. The following DATA step and SAS code could be used to produce the same output by using data in this alternative format:

data markers;
   input (g1-g5) ($);
   datalines;                                                    
B/B  A/B  B/B  A/A  B/B
A/A  B/B  A/B  A/B  C/C
B/B  A/A  B/B  B/B  A/C

   ... more lines ...   

B/B  A/B  A/B  A/A  A/C
A/B  A/A  A/B  A/B  A/C
A/B  B/B  B/B  A/B  B/B
;
proc allele data=markers outstat=ld prefix=Marker
            perms=10000 boot=1000 seed=123 genocol delimiter='/';
   var g1-g5;
run; 
proc print data=ld;
run;

Note that the DELIMITER= option, which indicates the character or string that separates the alleles that compose a genotype, could have been omitted in this example since ’/’ is the default.

The results from the analysis are shown in Figure 3.1 through Figure 3.4.

Figure 3.1: Marker Summary for the ALLELE Procedure

The ALLELE Procedure

Marker Summary
Locus Number
of
Indiv
Number
of
Alleles
Polymorph
Info
Content
Heterozygosity Allelic
Diversity
Test for HWE
Chi-
Square
DF Pr > ChiSq Prob
Exact
Marker1 25 2 0.3714 0.4800 0.4928 0.0169 1 0.8967 1.0000
Marker2 25 2 0.3685 0.3600 0.4872 1.7041 1 0.1918 0.2262
Marker3 25 2 0.3546 0.4800 0.4608 0.0434 1 0.8350 1.0000
Marker4 25 2 0.3648 0.4800 0.4800 0.0000 1 1.0000 1.0000
Marker5 25 3 0.5817 0.4400 0.6552 9.3537 3 0.0249 0.0106


Figure 3.1 displays information about the five markers. From this output, you can conclude that Marker5 is the only one showing significant departure from HWE.

Figure 3.2: Allele Frequencies for the ALLELE Procedure

Allele Frequencies
Locus Allele Count Frequency Standard
Error
95% Confidence Limits
Marker1 A 22 0.4400 0.0711 0.3000 0.5800
  B 28 0.5600 0.0711 0.4200 0.7000
Marker2 A 29 0.5800 0.0784 0.4200 0.7400
  B 21 0.4200 0.0784 0.2600 0.5800
Marker3 A 32 0.6400 0.0665 0.5200 0.7600
  B 18 0.3600 0.0665 0.2400 0.4800
Marker4 A 30 0.6000 0.0693 0.4600 0.7400
  B 20 0.4000 0.0693 0.2600 0.5400
Marker5 A 14 0.2800 0.0637 0.1400 0.4200
  B 15 0.3000 0.0800 0.1600 0.4600
  C 21 0.4200 0.0833 0.2800 0.6000


Figure 3.2 displays the allele frequencies for each marker with their standard errors and the lower and upper limits of the 95% confidence interval.

Figure 3.3: Genotype Frequencies for the ALLELE Procedure

Genotype Frequencies
Locus Genotype Count Frequency HWD
Coeff
Standard
Error
95% Confidence Limits
Marker1 A/A 5 0.2000 0.0064 0.0493 -0.0916 0.0956
  A/B 12 0.4800 0.0064 0.0493 -0.0916 0.0956
  B/B 8 0.3200 0.0064 0.0493 -0.0916 0.0956
Marker2 A/A 10 0.4000 0.0636 0.0477 -0.0336 0.1484
  A/B 9 0.3600 0.0636 0.0477 -0.0336 0.1484
  B/B 6 0.2400 0.0636 0.0477 -0.0336 0.1484
Marker3 A/A 10 0.4000 -0.0096 0.0457 -0.1044 0.0800
  A/B 12 0.4800 -0.0096 0.0457 -0.1044 0.0800
  B/B 3 0.1200 -0.0096 0.0457 -0.1044 0.0800
Marker4 A/A 9 0.3600 0.0000 0.0480 -0.0916 0.0864
  A/B 12 0.4800 0.0000 0.0480 -0.0916 0.0864
  B/B 4 0.1600 0.0000 0.0480 -0.0916 0.0864
Marker5 A/A 2 0.0800 0.0016 0.0405 -0.0756 0.0816
  A/B 4 0.1600 0.0040 0.0337 -0.0664 0.0636
  A/C 6 0.2400 -0.0024 0.0380 -0.0736 0.0680
  B/B 5 0.2000 0.1100 0.0445 0.0144 0.1884
  B/C 1 0.0400 0.1060 0.0282 0.0440 0.1564
  C/C 7 0.2800 0.1036 0.0453 0.0096 0.1884


Figure 3.3 displays the genotype frequencies for each marker with the associated disequilibrium coefficient, its standard error, and the 95% confidence limits.

Figure 3.4: Testing for Disequilibrium Using the ALLELE Procedure

Obs Locus1 Locus2 NIndiv Distance Test ChiSq DF ProbChi ProbEx
1 Marker1 Marker1 25 0 HWE 0.01687 1 0.89667 1.0000
2 Marker1 Marker2 25 1 LD 1.05799 1 0.30367 0.4882
3 Marker1 Marker3 25 2 LD 1.42074 1 0.23328 0.8544
4 Marker1 Marker4 25 3 LD 0.33144 1 0.56481 0.9885
5 Marker1 Marker5 25 4 LD 2.29785 2 0.31698 0.0940
6 Marker2 Marker2 25 0 HWE 1.70412 1 0.19175 0.2262
7 Marker2 Marker3 25 1 LD 0.13798 1 0.71030 0.5096
8 Marker2 Marker4 25 2 LD 1.34100 1 0.24686 0.6455
9 Marker2 Marker5 25 3 LD 1.13574 2 0.56673 0.0126
10 Marker3 Marker3 25 0 HWE 0.04340 1 0.83497 1.0000
11 Marker3 Marker4 25 1 LD 0.46296 1 0.49624 0.9712
12 Marker3 Marker5 25 2 LD 0.95899 2 0.61909 0.0261
13 Marker4 Marker4 25 0 HWE 0.00000 1 1.00000 1.0000
14 Marker4 Marker5 25 1 LD 6.16071 2 0.04594 0.1281
15 Marker5 Marker5 25 0 HWE 9.35374 3 0.02494 0.0106


Figure 3.4 displays the output data set created using the OUTSTAT= option of the PROC ALLELE statement. This data set contains the statistics for testing individual markers for HWE and marker pairs for linkage disequilibrium.