Suppose you test m null hypotheses, , and obtain the pvalues . Denote the ordered pvalues as and order the tests appropriately: . Suppose you know of the null hypotheses are true and are false. Let R indicate the number of null hypotheses rejected by the tests, where V of these are incorrectly rejected (that is, V tests are Type I errors) and are correctly rejected (so tests are Type II errors). This information is summarized in the following table:
Null is Rejected 
Null is Not Rejected 
Total 


Null is True 
V 
m – V 
m 
Null is False 
R – V 
m – R + V 
m 
Total 
R 
m – R 
m 
The familywise error rate (FWE) is the overall Type I error rate for all the comparisons (possibly under some restrictions); that is, it is the maximum probability of incorrectly rejecting one or more null hypotheses:
The FWE is also known as the maximum experimentwise error rate (MEER), as discussed in the section Pairwise Comparisons of Chapter 42: The GLM Procedure.
The false discovery rate (FDR) is the expected proportion of incorrectly rejected hypotheses among all rejected hypotheses:




Under the overall null hypothesis (all the null hypotheses are true), the FDR=FWE since V = R gives . Otherwise, FDR is always less than FWE, and an FDRcontrolling adjustment also controls the FWE. Another definition used is the positive false discovery rate:
The pvalue adjustment methods discussed in the following sections attempt to correct the raw pvalues while controlling either the FWE or the FDR. Note that the methods might impose some restrictions in order to achieve this; restrictions are discussed along with the methods in the following sections. Discussions and comparisons of some of these methods are given in Dmitrienko et al. (2005), Dudoit, Shaffer, and Boldrick (2003), Westfall et al. (1999), and Brown and Russell (1997).
PROC MULTTEST provides several pvalue adjustments to control the familywise error rate. Singlestep adjustment methods are computed without reference to the other hypothesis tests under consideration. The available singlestep methods are the Bonferroni and Šidák adjustments, which are simple functions of the raw pvalues that try to distribute the significance level across all the tests, and the bootstrap and permutation resampling adjustments, which require the raw data. The Bonferroni and Šidák methods are calculated from the permutation distributions when exact permutation tests are used with the CA or Peto test.
Stepwise tests, or sequentially rejective tests, order the hypotheses in stepup (least significant to most significant) or stepdown fashion, then sequentially determine acceptance or rejection of the nulls. These tests are more powerful than the singlestep tests, and they do not always require you to perform every test. However, PROC MULTTEST still adjusts every pvalue. PROC MULTTEST provides the following stepwise pvalue adjustments: stepdown Bonferroni (Holm), stepdown Šidák, stepdown bootstrap and permutation resampling, Hochberg’s (1988) stepup, Hommel’s (1988), Fisher’s combination method, and the StoufferLiptak combination method. Adaptive versions of Holm’s stepdown Bonferroni and Hochberg’s stepup Bonferroni methods, which require an estimate of the number of true null hypotheses, are also available.
Liu (1996) shows that all singlestep and stepwise tests based on marginal pvalues can be used to construct a closed test (Marcus, Peritz, and Gabriel, 1976; Dmitrienko et al., 2005). Closed testing methods not only control the familywise error rate at size , but are also more powerful than the tests on which they are based. Westfall and Wolfinger (2000) note that several of the methods available in PROC MULTTEST are closed—namely, the stepdown methods, Hommel’s method, and Fisher’s combination; see that reference for conditions and exceptions.
All methods except the resampling methods are calculated by simple functions of the raw pvalues or marginal permutation distributions; the permutation and bootstrap adjustments require the raw data. Because the resampling techniques incorporate distributional and correlational structures, they tend to be less conservative than the other methods.
When a resampling (bootstrap or permutation) method is used with only one test, the adjusted pvalue is the bootstrap or permutation pvalue for that test, with no adjustment for multiplicity, as described by Westfall and Soper (1994).
The Bonferroni pvalue for test is simply . If the adjusted pvalue exceeds 1, it is set to 1. The Bonferroni test is conservative but always controls the familywise error rate.
If the unadjusted pvalues are computed by using exact permutation distributions, then the Bonferroni adjustment for is , where is the largest pvalue from the permutation distribution of test j satisfying , or 0 if all permutational pvalues of test j are greater than . These adjustments are much less conservative than the ordinary Bonferroni adjustments because they incorporate the discrete distributional characteristics. However, they remain conservative in that they do not incorporate correlation structures between multiple contrasts and multiple variables (Westfall and Wolfinger, 1997).
A technique slightly less conservative than Bonferroni is the Šidák pvalue (Šidák, 1967), which is . It is exact when all of the pvalues are uniformly distributed and independent, and it is conservative when the test statistics satisfy the positive orthant dependence condition (Holland and Copenhaver, 1987).
If the unadjusted pvalues are computed by using exact permutation distributions, then the Šidák adjustment for is , where the are as described previously. These adjustments are less conservative than the corresponding Bonferroni adjustments, but they do not incorporate correlation structures between multiple contrasts and multiple variables (Westfall and Wolfinger, 1997).
The bootstrap method creates pseudodata sets by sampling observations with replacement from each withinstratum pool of observations. An entire data set is thus created, and pvalues for all tests are computed on this pseudodata set. A counter records whether the minimum pvalue from the pseudodata set is less than or equal to the actual pvalue for each base test. (If there are m tests, then there are m such counters.) This process is repeated a large number of times, and the proportion of resampled data sets where the minimum pseudopvalue is less than or equal to an actual pvalue is the adjusted pvalue reported by PROC MULTTEST. The algorithms are described in Westfall and Young (1993).
In the case of continuous data, the pooling of the groups is not likely to recreate the shape of the null hypothesis distribution, since the pooled data are likely to be multimodal. For this reason, PROC MULTTEST automatically meancenters all continuous variables prior to resampling. Such meancentering is akin to resampling residuals in a regression analysis, as discussed by Freedman (1981). You can specify the NOCENTER option if you do not want to center the data.
The bootstrap method implicitly incorporates all sources of correlation, from both the multiple contrasts and the multivariate structure. The adjusted pvalues incorporate all correlations and distributional characteristics. This method always provides weak control of the familywise error rate, and it provides strong control when the subset pivotality condition holds; that is, for any subset of the null hypotheses, the joint distribution of the pvalues for the subset is identical to that under the complete null (Westfall and Young, 1993).
The permutationstyleadjusted pvalues are computed in identical fashion as the bootstrapadjusted pvalues, with the exception that the withinstratum resampling is performed without replacement instead of with replacement. This produces a rerandomization analysis such as in Brown and Fears (1981) and Heyse and Rom (1988). In the spirit of rerandomization analyses, the continuous variables are not centered prior to resampling. This default can be overridden by using the CENTER option.
The permutation method implicitly incorporates all sources of correlation, from both the multiple contrasts and the multivariate structure. The adjusted pvalues incorporate all correlations and distributional characteristics. This method always provides weak control of the familywise error rate, and it provides strong control of the familywise error rate under the subset pivotality condition, as described in the preceding section.
Stepdown testing is available for the Bonferroni, Šidák, bootstrap, and permutation methods. The benefit of using stepdown methods is that the tests are made more powerful (smaller adjusted pvalues) while, in most cases, maintaining strong control of the familywise error rate. The stepdown method was pioneered by Holm (1979) and further developed by Shaffer (1986), Holland and Copenhaver (1987), and Hochberg and Tamhane (1987).
The Bonferroni stepdown (Holm) pvalues are obtained from
As always, if any adjusted pvalue exceeds 1, it is set to 1.
The Šidák stepdown pvalues are determined similarly:
Stepdown Bonferroni adjustments that use exact tests are defined as
where the are defined as before. Note that is taken from the permutation distribution corresponding to the jthsmallest unadjusted pvalue. Also, any greater than 1.0 is reduced to 1.0.
Stepdown Šidák adjustments for exact tests are defined analogously by substituting for .
The resamplingstyle stepdown methods are analogous to the preceding stepdown methods; the most extreme pvalue is adjusted according to all m tests, the secondmost extreme pvalue is adjusted according to (m – 1) tests, and so on. The difference is that all correlational and distributional characteristics are incorporated when you use resampling methods. More specifically, assuming the same ordering of pvalues as discussed previously, the resamplingstyle stepdownadjusted pvalue for test i is the probability that the minimum pseudopvalue of tests is less than or equal to .
This probability is evaluated by using Monte Carlo simulation, as are the previously described resamplingstyleadjusted pvalues. In fact, the computations for stepdownadjusted pvalues are essentially no more timeconsuming than the computations for the nonstepdownadjusted pvalues. After Monte Carlo, the stepdownadjusted pvalues are corrected to ensure monotonicity; this correction leaves the first adjusted pvalues alone, then corrects the remaining ones as needed. The stepdown method approximately controls the familywise error rate, and it is described in more detail by Westfall and Young (1993), Westfall et al. (1999), and Westfall and Wolfinger (2000).
Hommel’s (1988) method is a closed testing procedure based on Simes’ test (Simes, 1986). The Simes pvalue for a joint test of any set of S hypotheses with pvalues is . The Hommeladjusted pvalue for test j is the maximum of all such Simes pvalues, taken over all joint tests that include j as one of their components.
Hochbergadjusted pvalues are always as large or larger than Hommeladjusted pvalues. Sarkar and Chang (1997) shows that Simes’ method is valid under independent or positively dependent pvalues, so Hommel’s and Hochberg’s methods are also valid in such cases by the closure principle.
Assuming pvalues are independent and uniformly distributed under their respective null hypotheses, Hochberg (1988) demonstrates that Holm’s stepdown adjustments control the familywise error rate even when calculated in stepup fashion. Since the adjusted pvalues are uniformly smaller for Hochberg’s method than for Holm’s method, the Hochberg method is more powerful. However, this improved power comes at the cost of having to make the assumption of independence. Hochberg’s method can be derived from Hommel’s (Liu, 1996), and is thus also derived from Simes’ test (Simes, 1986).
Hochbergadjusted pvalues are always as large or larger than Hommeladjusted pvalues. Sarkar and Chang (1997) showed that Simes’ method is valid under independent or positively dependent pvalues, so Hommel’s and Hochberg’s methods are also valid in such cases by the closure principle.
The Hochbergadjusted pvalues are defined in reverse order of the stepdown Bonferroni:
The FISHER_C option requests adjusted pvalues by using closed tests, based on the idea of Fisher’s combination test. The Fisher combination test for a joint test of any set of S hypotheses with pvalues uses the chisquare statistic , with 2S degrees of freedom. The FISHER_C adjusted pvalue for test j is the maximum of all pvalues for the combination tests, taken over all joint tests that include j as one of their components. Independence of pvalues is required for the validity of this method.
The STOUFFER option requests adjusted pvalues by using closed tests, based on the StoufferLiptak combination test. The Stouffer combination joint test of any set of S onesided hypotheses with pvalues, , yields the pvalue, . The STOUFFER adjusted pvalue for test j is the maximum of all pvalues for the combination tests, taken over all joint tests that include j as one of their components.
Independence of the onesided pvalues is required for the validity of this method. Westfall (2005) shows that the StoufferLiptak adjustment might have more power than the Fisher combination and Simes’ adjustments when the test results reinforce each other.
Adaptive adjustments modify the FWE and FDRcontrolling procedures by taking an estimate of the number or proportion of true null hypotheses into account. The adjusted pvalues for Holm’s and Hochberg’s methods involve the number of unadjusted pvalues larger than , . So the minimal significance level at which the ith ordered pvalue is rejected implies that the number of true null hypotheses is . However, if you know , then you can replace with , thereby obtaining more power while maintaining the original level significance.
Since is unknown, there are several methods used to estimate the value—see the NTRUENULL= option for more information. The estimation method described by Hochberg and Benjamini (1990) considers the graph of versus i, where the are the ordered pvalues of your tests. See Output 61.6.4 for an example. If all null hypotheses are actually true (), then the pvalues behave like a sample from a uniform distribution and this graph should be a straight line through the origin. However, if points in the upperright corner of this plot do not follow the initial trend, then some of these null hypotheses are probably false and .
The ADAPTIVEHOLM option uses this estimate of to adjust the stepup Bonferroni method while the ADAPTIVEHOCHBERG option adjusts the stepdown Bonferroni method. Both of these methods are due to Hochberg and Benjamini (1990). When is known, these procedures control the familywise error rate in the same manner as their nonadaptive versions but with more power; however, since must be estimated, the FWE control is only approximate. The ADAPTIVEFDR and PFDR options also use , and are described in the following section.
The adjusted pvalues for the ADAPTIVEHOLM method are computed by
The adjusted pvalues for the ADAPTIVEHOCHBERG method are computed by
Methods that control the false discovery rate (FDR) were described by Benjamini and Hochberg (1995). These adjustments do not necessarily control the familywise error rate (FWE). However, FDRcontrolling methods are more powerful and more liberal, and hence reject more null hypotheses, than adjustments protecting the FWE. FDRcontrolling methods are often used when you have a large number of null hypotheses. To control the FDR, Benjamini and Hochberg’s (1995) linear stepup method is provided, as well as an adaptive version, a dependence version, and bootstrap and permutation resampling versions. Storey’s (2002) pFDR methods are also provided.
The FDR option requests pvalues that control the “false discovery rate” described by Benjamini and Hochberg (1995). These linear stepup adjustments are potentially much less conservative than the Hochberg adjustments.
The FDRadjusted pvalues are defined in stepup fashion, like the Hochberg adjustments, but with less conservative multipliers:
The FDR method is guaranteed to control the false discovery rate at level when you have independent pvalues that are uniformly distributed under their respective null hypotheses. Benjamini and Yekutieli (2001) show that the false discovery rate is also controlled at level when the positive regression dependent condition holds on the set of the true null hypotheses, and they provide several examples where this condition is true.
Note: The positive regression dependent condition on the set of the true null hypotheses holds if the joint distribution of the test statistics for the null hypotheses satisfies: is nondecreasing in x for each where is true, for any increasing set A. The set A is increasing if and implies .
The DEPENDENTFDR option requests a false discovery rate controlling method that is always valid for pvalues under any kind of dependency (Benjamini and Yekutieli, 2001), but is thus quite conservative. Let . The DEPENDENTFDR procedure always controls the false discovery rate at level . The adjusted pvalues are computed as
Bootstrap and permutation resampling methods to control the false discovery rate are available with the FDRBOOT and FDRPERM options (Yekutieli and Benjamini, 1999). These methods approximately control the false discovery rate when the subset pivotality condition holds, as discussed in the section Bootstrap, and when the pvalues corresponding to the true null hypotheses are independent of those for the false null hypotheses.
The resampling methodology for the BOOTSTRAP and PERMUTATION methods is used to create B resamples. For the bth resample, let denote the number of pvalues that are less than or equal to the observed pvalue p. Let be the 100 quantile of , and let be the number of observed pvalues less than or equal to p. Compute one of the following estimators:
local estimator 

upper limit estimator 

where m is the number of tests and B is the number of resamples. Then for or , the adjusted pvalues are computed as
Since the FDR method controls the false discovery rate at , knowledge of allows improvement of the power of the adjustment while still maintaining control of the false discovery rate. The ADAPTIVEFDR option requests adaptive adjusted pvalues for approximate control of the false discovery rate, as discussed in Benjamini and Hochberg (2000). See the section Adaptive Adjustments for more details. These adaptive adjustments are also defined in stepup fashion but use an estimate of the number of true null hypotheses:
Since , the larger pvalues are adjusted down. This means that, as defined, controlling the false discovery rate enables you to reject these tests at a level less than the observed pvalue. However, by default, this reduction is prevented with an additional restriction: .
To use this adjustment, Benjamini and Hochberg (2000) suggest first specifying the FDR option—if at least one test is rejected at your level, then apply the ADAPTIVEFDR adjustment. Alternatively, Benjamini, Krieger, and Yekutieli (2006) apply the FDR adjustment at level , then specify the resulting number of true hypotheses with the NTRUENULL= option and apply the ADAPTIVEFDR adjustment; they show that this twostage linear stepup procedure controls the false discovery rate at level for independent test statistics.
The PFDR option computes the “qvalues” (Storey, 2002; Storey, Taylor, and Siegmund, 2004), which are adaptive adjusted pvalues for strong control of the false discovery rate when the pvalues corresponding to the true null hypotheses are independent and uniformly distributed. There are four versions of the PFDR available. Let be the number of observed pvalues that are less than or equal to ; let m be the number of tests; let f = 1 if the FINITE option is specified, and otherwise set f = 0; and denote the estimated proportion of true null hypotheses by
The default estimate of FDR is
If you set , then this is identical to the FDR adjustment.
The positive FDR is estimated by
The finitesample versions of these two estimators for independent null pvalues are given by






Finally, the adjusted pvalues are computed as
This method can produce adjusted pvalues that are smaller than the raw pvalues. This means that, as defined, controlling the false discovery rate enables you to reject these tests at a level less than the observed pvalue. However, by default, this reduction is prevented with an additional restriction: .