The ALLELE Procedure

Example

Suppose you have genotyped 25 individuals at five markers. You want to examine some basic properties of these markers, such as whether they are in HWE, how many alleles each has, what genotypes appear in the data, and whether there is linkage disequilibrium between any pairs of markers. You have ten columns of data, with the first two columns containing the set of alleles at the first marker, the next two columns containing the set of alleles for the second marker, and so on. There is one row per each individual. You input your data as follows:

data markers;
   input (a1-a10) ($);
   datalines;
B  B  A  B  B  B  A  A  B  B
A  A  B  B  A  B  A  B  C  C
B  B  A  A  B  B  B  B  A  C
A  B  A  B  A  B  A  B  A  B
A  A  A  B  A  B  B  B  C  C
B  B  A  A  A  B  A  B  C  C
A  B  B  B  A  B  A  A  A  B
A  B  A  A  A  A  A  A  A  A
B  B  A  A  A  A  A  B  B  B
A  B  A  B  A  B  B  B  A  C
A  A  A  B  A  A  A  B  B  C
B  B  A  B  A  B  A  B  A  C
A  B  B  B  A  A  A  B  A  C
B  B  B  B  A  A  A  A  A  B
A  B  A  A  A  B  A  A  C  C
A  B  A  A  A  B  A  B  C  C
B  B  A  A  A  A  A  B  A  A
A  A  A  B  A  A  A  B  A  B
A  B  A  A  A  A  B  B  C  C
A  A  A  A  A  A  A  A  B  B
A  B  B  B  A  A  A  A  C  C
A  B  A  B  A  B  A  A  B  B
B  B  A  B  A  B  A  A  A  C
A  B  A  A  A  B  A  B  A  C
A  B  B  B  B  B  A  B  B  B
;

You can now use PROC ALLELE to examine the frequencies of alleles and genotypes in your data, and see if these frequencies are occurring in proportions you would expect. The following statements perform the analysis you want:

proc allele data=markers outstat=ld prefix=Marker
            perms=10000 boot=1000 seed=123;
   var a1-a10;
run; 

proc print data=ld;
run;

This analysis is using 10,000 permutations to approximate an exact $\text{[math]}$ -value for the HWE test, as well as 1,000 bootstrap samples to obtain the confidence interval for the allele frequencies and one-locus Hardy-Weinberg disequilibrium (HWD) coefficients. The starting seed for the random number generator is 123. The PREFIX= option requests that the five markers be named Marker1–Marker5. Since the BOOTSTRAP= option is specified but the ALPHA= option is omitted, a 95% confidence interval is calculated by default.

All five markers are included in the analysis since the ten variables containing the alleles for those five markers were specified in the VAR statement.

The marker data can alternatively be read in as columns of genotypes instead of columns of alleles by using the GENOCOL and DELIMITER= options in the PROC ALLELE statement, with just one column per each marker. The following DATA step and SAS code could be used to produce the same output by using data in this alternative format:

data markers;
   input (g1-g5) ($);
   datalines;                                                    
B/B  A/B  B/B  A/A  B/B
A/A  B/B  A/B  A/B  C/C

   ... more lines ...   

B/B  A/B  A/B  A/A  A/C
A/B  A/A  A/B  A/B  A/C
A/B  B/B  B/B  A/B  B/B
;

proc allele data=markers outstat=ld prefix=Marker
            perms=10000 boot=1000 seed=123 genocol delimiter='/';
   var g1-g5;
run;

proc print data=ld;
run;

Note that the DELIMITER= option, which indicates the character or string that separates the alleles that compose a genotype, could have been omitted in this example since ’/’ is the default.

The results from the analysis are shown in Figure 3.1 through Figure 3.4.

Figure 3.1 Marker Summary for the ALLELE Procedure

The ALLELE Procedure

Marker Summary
Locus	Number of Indiv	Number of Alleles	Polymorph Info Content	Heterozygosity	Allelic Diversity	Test for HWE
Locus	Number of Indiv	Number of Alleles	Polymorph Info Content	Heterozygosity	Allelic Diversity	Chi- Square	DF	Pr > ChiSq	Prob Exact
Marker1	25	2	0.3714	0.4800	0.4928	0.0169	1	0.8967	1.0000
Marker2	25	2	0.3685	0.3600	0.4872	1.7041	1	0.1918	0.2262
Marker3	25	2	0.3546	0.4800	0.4608	0.0434	1	0.8350	1.0000
Marker4	25	2	0.3648	0.4800	0.4800	0.0000	1	1.0000	1.0000
Marker5	25	3	0.5817	0.4400	0.6552	9.3537	3	0.0249	0.0106

Figure 3.1 displays information about the five markers. From this output, you can conclude that Marker5 is the only one showing significant departure from HWE.

Figure 3.2 Allele Frequencies for the ALLELE Procedure

Allele Frequencies
Locus	Allele	Count	Frequency	Standard Error	95% Confidence Limits
Marker1	A	22	0.4400	0.0711	0.3000	0.5800
	B	28	0.5600	0.0711	0.4200	0.7000
Marker2	A	29	0.5800	0.0784	0.4200	0.7400
	B	21	0.4200	0.0784	0.2600	0.5800
Marker3	A	32	0.6400	0.0665	0.5200	0.7600
	B	18	0.3600	0.0665	0.2400	0.4800
Marker4	A	30	0.6000	0.0693	0.4600	0.7400
	B	20	0.4000	0.0693	0.2600	0.5400
Marker5	A	14	0.2800	0.0637	0.1400	0.4200
	B	15	0.3000	0.0800	0.1600	0.4600
	C	21	0.4200	0.0833	0.2800	0.6000

Figure 3.2 displays the allele frequencies for each marker with their standard errors and the lower and upper limits of the 95% confidence interval.

Figure 3.3 Genotype Frequencies for the ALLELE Procedure

Genotype Frequencies
Locus	Genotype	Count	Frequency	HWD Coeff	Standard Error	95% Confidence Limits
Marker1	A/A	5	0.2000	0.0064	0.0493	-0.0916	0.0956
	A/B	12	0.4800	0.0064	0.0493	-0.0916	0.0956
	B/B	8	0.3200	0.0064	0.0493	-0.0916	0.0956
Marker2	A/A	10	0.4000	0.0636	0.0477	-0.0336	0.1484
	A/B	9	0.3600	0.0636	0.0477	-0.0336	0.1484
	B/B	6	0.2400	0.0636	0.0477	-0.0336	0.1484
Marker3	A/A	10	0.4000	-0.0096	0.0457	-0.1044	0.0800
	A/B	12	0.4800	-0.0096	0.0457	-0.1044	0.0800
	B/B	3	0.1200	-0.0096	0.0457	-0.1044	0.0800
Marker4	A/A	9	0.3600	0.0000	0.0480	-0.0916	0.0864
	A/B	12	0.4800	0.0000	0.0480	-0.0916	0.0864
	B/B	4	0.1600	0.0000	0.0480	-0.0916	0.0864
Marker5	A/A	2	0.0800	0.0016	0.0405	-0.0756	0.0816
	A/B	4	0.1600	0.0040	0.0337	-0.0664	0.0636
	A/C	6	0.2400	-0.0024	0.0380	-0.0736	0.0680
	B/B	5	0.2000	0.1100	0.0445	0.0144	0.1884
	B/C	1	0.0400	0.1060	0.0282	0.0440	0.1564
	C/C	7	0.2800	0.1036	0.0453	0.0096	0.1884

Figure 3.3 displays the genotype frequencies for each marker with the associated disequilibrium coefficient, its standard error, and the 95% confidence limits.

Figure 3.4 Testing for Disequilibrium Using the ALLELE Procedure

Obs	Locus1	Locus2	NIndiv	Distance	Test	ChiSq	DF	ProbChi	ProbEx
1	Marker1	Marker1	25	0	HWE	0.01687	1	0.89667	1.0000
2	Marker1	Marker2	25	1	LD	1.05799	1	0.30367	0.4882
3	Marker1	Marker3	25	2	LD	1.42074	1	0.23328	0.8544
4	Marker1	Marker4	25	3	LD	0.33144	1	0.56481	0.9885
5	Marker1	Marker5	25	4	LD	2.29785	2	0.31698	0.0940
6	Marker2	Marker2	25	0	HWE	1.70412	1	0.19175	0.2262
7	Marker2	Marker3	25	1	LD	0.13798	1	0.71030	0.5096
8	Marker2	Marker4	25	2	LD	1.34100	1	0.24686	0.6455
9	Marker2	Marker5	25	3	LD	1.13574	2	0.56673	0.0126
10	Marker3	Marker3	25	0	HWE	0.04340	1	0.83497	1.0000
11	Marker3	Marker4	25	1	LD	0.46296	1	0.49624	0.9712
12	Marker3	Marker5	25	2	LD	0.95899	2	0.61909	0.0261
13	Marker4	Marker4	25	0	HWE	0.00000	1	1.00000	1.0000
14	Marker4	Marker5	25	1	LD	6.16071	2	0.04594	0.1281
15	Marker5	Marker5	25	0	HWE	9.35374	3	0.02494	0.0106

Figure 3.4 displays the output data set created using the OUTSTAT= option of the PROC ALLELE statement. This data set contains the statistics for testing individual markers for HWE and marker pairs for linkage disequilibrium.