Example 5.2 Analyzing Data in the Tall-Skinny Format

This example demonstrates how data in the tall-skinny format can be analyzed using PROC CASECONTROL with the options TALL , MARKER= , and INDIV= . Here, the same data that were used in the "Getting Started" example are used, but in this alternative format.


data talldata;
   input affected $ id snpname $ allele1 allele2;
   datalines;
 N  1 Marker1 1 1
 N  2 Marker1 1 1
 N  3 Marker1 2 1
 N  4 Marker1 2 2
 N  5 Marker1 1 1
 N  6 Marker1 2 1
 N  7 Marker1 1 1
 N  8 Marker1 2 2
 N  9 Marker1 2 1
 N 10 Marker1 2 1
 N 11 Marker1 2 1
 N 12 Marker1 2 2
 N 13 Marker1 2 1
 N 14 Marker1 2 1
 N 15 Marker1 2 2
 N 16 Marker1 1 1
 N 17 Marker1 1 1
 N 18 Marker1 2 1
 A 19 Marker1 2 1
 A 20 Marker1 2 1
 A 21 Marker1 2 2
 A 22 Marker1 1 2
 A 24 Marker1 1 1
 A 25 Marker1 2 1
 A 26 Marker1 2 1
 A 27 Marker1 2 1
 A 28 Marker1 2 1
 A 29 Marker1 1 1
 A 30 Marker1 2 1
 A 31 Marker1 2 2
 A 32 Marker1 1 1
 A 33 Marker1 1 1
 A 34 Marker1 2 2
 N  1 Marker2 2 2
 N  2 Marker2 1 1

   ... more lines ...   

 A 29 Marker8 2 2
 A 30 Marker8 1 2
 A 31 Marker8 2 2
 A 32 Marker8 2 2
 A 33 Marker8 2 2
 A 34 Marker8 1 2
;

Note how all marker alleles are contained in two columns, and there are identifiers for the markers and individuals sampled. The data set is first sorted by the marker ID, then by the individual ID. One advantage of this data format is that there is no restriction on the number of markers analyzed since, unlike the columns, there is no limit on the number of rows in a SAS data set. The following code can be used to analyze this data set:

proc casecontrol data=talldata tall marker=snpname indiv=id;
   var allele1 allele2;
   trait affected;
run;
 
proc print; 
    format ProbAllele ProbGenotype ProbTrend pvalue6.5;
    format ChiSqAllele ChiSqGenotype ChiSqTrend 6.3;
run;

Applying this code to the data in this format produces the same output shown in the "Getting Started" example, Figure 5.1.