The CASECONTROL Procedure

Statistical Computations

Biallelic Markers

PROC CASECONTROL offers three statistics to test for an association between a biallelic marker and a binary variable, typically affection status of a particular disease. Table 5.1 displays the quantities that are used for the three case-control tests for biallelic markers (Sasieni 1997).

Table 5.1: Genotype Distribution for Case-Control Sample

 

Number of $M_1$ Alleles

 
 

0

1

2

Total

Case

$r_0$

$r_1$

$r_2$

$R$

Control

$s_0$

$s_1$

$s_2$

$S$

Total

$n_0$

$n_1$

$n_2$

$N$


The three statistical methods for testing a marker for association with a disease locus are Armitage’s trend test (1955), the allele case-control test, and the genotype case-control test. The trend test and allele case-control test are most useful when there is an additive allele effect on the disease susceptibility. When Hardy-Weinberg equilibrium (HWE) holds in the combined sample of cases and controls, these statistics are approximately equal and have an asymptotic $\chi ^2_1$ distribution. However, if the assumption of HWE in the combined sample is violated, then the variance for the allele case-control statistic is incorrect; only the trend test remains valid under this violation. The statistics for the trend and allele case-control test, respectively, are given by Sasieni (1997) as

$\displaystyle  X^2_ T  $
$\displaystyle  =  $
$\displaystyle  \frac{N[N(r_1+2r_2)-R(n_1+2n_2)]^2}{R(N-R)[N(n_1+4n_2)-(n_1+2n_2)^2]}  $
$\displaystyle X^2_ A  $
$\displaystyle  =  $
$\displaystyle  \frac{2N[2N(r_1+2r_2) - 2R(n_1+2n_2)]^2}{(2R)2(N-R)[2N(n_1+2n_2)-(n_1+2n_2)^2]}  $

Devlin and Roeder (1999) describe a genomic control method that adjusts the trend test statistic for correlation between alleles from members of the same subpopulation. Assuming the variance inflation factor $\lambda $ is constant across the genome, it can be estimated by $\hat{\lambda }=\max ([\mbox{median}(X_1,\ldots ,X_ m)/0.675]^2, 1)$, where $X_ i=X_ T$ for the $i$th biallelic marker, $i=1,\ldots ,m$ (Devlin and Roeder 1999; Bacanu, Devlin, and Roeder 2000). The adjusted trend statistic, $X^2_{T_ a}=X^2_ T/\hat{\lambda }$, is approximately distributed as $\chi ^2_1$. This variance correction is made to biallelic markers when the VIF option is specified in the PROC statement. By default, any biallelic markers that are specified in the VAR statement are used in computing $\hat{\lambda }$. Alternatively, the NULLSNPS= option can be used to specify biallelic markers other than those in the VAR statement to be used to calculate $\hat{\lambda }$. This enables markers that are assumed to have no effect on disease susceptibility or to not be in linkage disequilibrium with a disease-susceptibility locus to be used in calculating the inflation factor (Bacanu, Devlin, and Roeder 2000).

If dominance effects of alleles are also suspected to contribute to disease susceptibility, the genotype case-control test can be used. The standard 2$\times $3 contingency table analysis is used to form the $\chi ^2_2$ statistic for the genotype case-control test as

\[  X^2_ G = \sum _{i=0}^2 \left[\frac{(Nr_ i - Rn_ i)^2}{NRn_ i} + \frac{(Ns_ i-Sn_ i)^2}{NSn_ i}\right]  \]

which tests for both additive and dominance (nonadditive) allelic effects (Nielsen and Weir 1999).

When the OR option is specified in the PROC CASECONTROL statement, odds ratios for biallelic markers are calculated based on the 2$\times $2 table of allele-by-trait counts. Using the values given in Table 5.1 to form the cell counts $a=2r_2 + r_1$, $b=2s_0 + s_1$, $c=2r_0 + r_1$, and $d=2s_2+s_1$, the odds ratio can be estimated as $\hat{\theta } = ab/(cd)$. The asymptotic $(1-\alpha )$% confidence limits for the estimated odds ratio $\hat{\theta }$ are

\[  (\hat{\theta }\cdot \exp {(-z\sqrt {v})}, \hat{\theta }\cdot \exp {(z\sqrt {v})})  \]

where

\[  v = \frac{1}{a} + \frac{1}{b} + \frac{1}{c} + \frac{1}{d}  \]

and $z$ is the $100(1-\alpha /2)$ percentile in the standard normal distribution. If any of the four cell frequencies are zero, the limits are not computed. The order of rows and columns is determined by the formatted values of the alleles and trait. Also note that if there are no heterozygous genotypes, $2v$ is used in place of $v$ in the formula for the confidence limits so that each individual is counted only once. This provides the correct limits when combining the heterozygous genotype with a homozygous genotype to obtain odds ratios for dominant or recessive disease models (see Example 5.3).

Multiallelic Markers

When there are multiple alleles of interest at a marker, the same three tests can be performed, except that Devlin and Roeder’s genomic control adjustment is not applied to any markers with more than two alleles. To construct the test statistic for the multiallelic trend test for a marker with $k$ alleles (Slager and Schaid 2001), the $p \times (k-1)$ matrix $\mathbf{X}$ is created such that each element $X_{iu}$ represents the number of times the $M_ u$ allele appears in the $i$th genotype, $i=1,\ldots ,p$ and $u=1,\ldots ,k-1$, where $p=k(k+1)/2$, the number of possible genotypes. Vectors $\mathbf{r}$ and $\mathbf{s}$ of length $p$ contain the genotype counts for the cases and controls, respectively, and $\phi =R/N$, the proportion of cases in the sample. The multiallelic trend test statistic can then be expressed as $\mathbf{U}’[\mbox{Var}(\mathbf{U})]^{-1}\mathbf{U}$, where the vector $\mathbf{U}=\mathbf{X}’[(1-\phi )\mathbf{r}-\phi \mathbf{s}]$. $\mbox{Var}(\mathbf{U})$ is calculated under the assumption of independent (or unrelated) subjects in the sample by using $\mbox{Var}(\mathbf{r})$ and $\mbox{Var}(\mathbf{s})$. These matrices contain elements $\sigma _{ii}=Rn_ i(N-n_ i)/N^2$ and $\sigma _{ij}=-Rn_ i n_ j/N^2$, where $i,j=1,\ldots ,p$ (the $R$ is replaced by $S$ for $\mbox{Var}(\mathbf{s})$). This statistic has an asymptotic $\chi ^2_{k-1}$ distribution.

Another way to test for additive allele effects at the disease or trait locus is the allele case-control test, executed using a contingency table analysis similar to the genotype case-control test described in the preceding section, assuming HWE (Nielsen and Weir 1999). For a marker with $k$ alleles, a 2$\times k$ contingency table is formed with one row for cases, one for controls, and a column for each allele. The $\chi ^2_{k-1}$ statistic is formed by summing $(O-E)^2/E$ over all cells in the table, where $O$ is the observed count for the cell and $E$ is the expected count, the cell’s column total multiplied by $R/N$ (or $S/N$) for a cell in the case (or control) row.

The genotypic case-control test statistic is calculated in a similar manner, with columns now representing the $p$ observed genotype classes instead of alleles. Significance of this test statistic based on the $\chi ^2_{p-1}$ distribution indicates dominance and/or additive allelic effects on the disease or trait (Nielsen and Weir 1999).

Stratified Analysis

A stratified case-control test can be performed to adjust for categorical covariates, such as gender or treatment; to analyze a sample from a matched or nested case-control design; or to accommodate the analysis of X-linked markers. The generalized Cochran-Mantel-Haenszel (CMH) test statistic given by Agresti (1990) can be used to test whether there is an association between the trait and marker alleles or genotypes in any of the strata, still with the same chi-square distribution and degrees of freedom as the test statistic from the nonstratified analysis. For the allele and genotype tests, which are based on contingency tables, the statistic is formed with the following quantities that use observed cell counts $c_{ijh}$ from the $i$th row (corresponding to one of the two trait categories), $j$th out of $J$ columns (corresponding to the $j$th allele or genotype), and $h$th stratum:

\[  \mathbf{c}_ h = (c_{11h},c_{12h},\ldots ,c_{1,J-1,h})’  \]
\[  \mathbf{e}_ h = (c_{1+h}c_{+1h},\ldots ,c_{1+h}c_{+,J-1,h})’/c_{++h}  \]
\[  \mbox{Cov}(c_{1jh},c_{1jh}) = \frac{c_{1+h}(c_{++h}-c_{1+h})c_{+jh}(\delta _{jj}c_{++h}-c_{+jh})}{c^2_{++h}(c_{++h}-1)}  \]

with covariance matrix $\mathbf{V}_ h$ of $\mathbf{c}_ h$ comprising these covariance terms for all $h$ and $j,j’=1,\ldots ,J-1$ and $\delta _{jj}=1$ when $j=j’$ and 0 otherwise. Note that cell counts for $i=2$ are omitted from the vectors and matrix since they are completely dependent on the cell counts from the first row and column totals. For the stratified trend test, which is based on the Mantel score test of conditional independence (Agresti 1990), a trend test vector $\mathbf{U}_ h$ and the covariance matrix $\mathbf{V}_ h=c_{++h}\mathrm{Var}(\mathbf{U}_ h)/(c_{++h}-1)$ are calculated within each stratum with $\mathbf{U}_ h$ and $\mathrm{Var}(\mathbf{U}_ h)$ defined as in the previous section for the multiallelic trend test. All three test statistics can then be represented as

\[  X^2_ M = \mathbf{S}’\mathbf{V}^{-1}\mathbf{S}  \]

with a $\chi ^2_{J-1}$ distribution under the null hypothesis, where $J$ represents the number of genotypes for the genotype test or the number of alleles for the allele and trend tests,

\[  \mathbf{S} = \left\{  \begin{array}{ll} \sum _ h (\mathbf{c}_ h - \mathbf{e}_ h), &  \mbox{genotype and allele tests} \\ \sum _ h \mathbf{U}_ h, &  \mbox{trend test} \end{array} \right.  \]

and $\mathbf{V}=\sum _ h \mathbf{V}_ h$.

The Mantel-Haenszel estimate of the common odds ratio across strata (Agresti 1990) for biallelic markers is reported when the STRATA statement is used along with the OR option in the PROC CASECONTROL statement. For a contingency table with two columns representing the two alleles at a marker, the estimate in terms of the observed cell counts is

\[  \hat{\theta }_{\mbox{MH}} = \frac{\sum _ h (c_{11h}c_{22h}/c_{++h})}{\sum _ h (c_{12h}c_{21h}/c_{++h})}  \]

The asymptotic $(1-\alpha )$% confidence limits for the estimate of the odds ratio $\hat{\theta }_{\mbox{MH}}$ are again given by Agresti (1990) as

\[  (\hat{\theta }_{\mbox{MH}}\cdot \exp {(-z\sqrt {v})}, \hat{\theta }_{\mbox{MH}}\cdot \exp {(z\sqrt {v})})  \]

now with

$\displaystyle  v  $
$\displaystyle  =  $
$\displaystyle  \frac{\sum _ h (c_{11h}+c_{22h})(c_{11h}c_{22h})/c^2_{++h}}{2(\sum _ h c_{11h}c_{22h}/c_{++h})^2}  $
$\displaystyle  $
$\displaystyle  +  $
$\displaystyle  \frac{\sum _ h [(c_{11h}+c_{22h})(c_{12h}c_{21h})+(c_{12h}+c_{21h})(c_{11h}c_{22h})]/c^2_{++h}}{2(\sum _ h c_{11h}c_{22h}/c_{++h})(\sum _ h c_{12h}c_{21h}/c_{++h})}  $
$\displaystyle  $
$\displaystyle  +  $
$\displaystyle  \frac{\sum _ h (c_{12h}+c_{21h})(c_{12h}c_{21h})/c^2_{++h}}{2(\sum _ h c_{12h}c_{21h}/c_{++h})^2}  $

Again, if all of the strata contain no heterozygous genotypes, $v$ is replaced by $2v$ in the confidence limits formula.

Permutation Tests

By default, the $p$-values from the $\chi ^2$ distribution with the appropriate degrees of freedom are reported for all three case-control tests. However, if the PERMS= option is specified in the PROC CASECONTROL statement, then Monte Carlo estimates of exact $p$-values are computed instead using the permutation procedure. For the genotype and trend tests, new samples of individuals are formed by permuting the trait value of the individuals in the sample; permutations for the allele test treat the two marker alleles per individual as separate observations each with the same trait, and the trait value is then permuted across these observations. If there are any STRATA variables, permutations are performed within each stratum. For $p$ permutations, the exact $p$-value is estimated as the proportion of times the chi-square statistic from one of the $p$ new samples is equal to or exceeds the chi-square statistic from the original sample (Westfall and Young 1993).