13.2 The Challenge of Multiple Testing

In the previous section, we saw that rejecting H 0 if the p -value is below (say) 0 . 01 provides us with a simple way to control the Type I error for H 0 at level 0 . 01: if H 0 is true, then there is no more than a 1% probability that we will reject it. But now suppose that we wish to test m null hypotheses, H 01 , . . . , H 0 m . Will it do to simply reject all null hypotheses for which the corresponding p -value falls below (say) 0 . 01? Stated another way, if we reject all null hypotheses for which the p -value falls below 0 . 01, then how many Type I errors should we expect to make?

As a first step towards answering this question, consider a stockbroker who wishes to drum up new clients by convincing them of her trading

10There are parallels between Table 13.1 and Table 4.6, which has to do with the output of a binary classifier. In particular, recall from Table 4.6 that a false positive results from predicting a positive (non-null) label when the true label is in fact negative (null). This is closely related to a Type I error, which results from rejecting the null hypothesis when in fact the null hypothesis holds.

564 13. Multiple Testing

acumen. She tells 1,024 (1 , 024 = 2[10] ) potential new clients that she can correctly predict whether Apple’s stock price will increase or decrease for 10 days running. There are 2[10] possibilities for how Apple’s stock price might change over the course of these 10 days. Therefore, she emails each client one of these 2[10] possibilities. The vast majority of her potential clients will find that the stockbroker’s predictions are no better than chance (and many will find them to be even worse than chance). But a broken clock is right twice a day, and one of her potential clients will be really impressed to find that her predictions were correct for all 10 of the days! And so the stockbroker gains a new client.

What happened here? Does the stockbroker have any actual insight into whether Apple’s stock price will increase or decrease? No. How, then, did she manage to predict Apple’s stock price perfectly for 10 days running? The answer is that she made a lot of guesses, and one of them happened to be exactly right.

How does this relate to multiple testing? Suppose that we flip 1,024 fair coins[11] ten times each. Then we would expect (on average) one coin to come up all tails. (There’s a 1 / 2[10] = 1 / 1 , 024 chance that any single coin will come up all tails. So if we flip 1 , 024 coins, then we expect one coin to come up all tails, on average.) If one of our coins comes up all tails, then we might therefore conclude that this particular coin is not fair. In fact, a standard hypothesis test for the null hypothesis that this particular coin is fair would lead to a p -value below 0 . 002![12] But it would be incorrect to conclude that the coin is not fair: in fact, the null hypothesis holds, and we just happen to have gotten ten tails in a row by chance.

These examples illustrate the main challenge of multiple testing : when multiple testing a huge number of null hypotheses, we are bound to get some very testing small p -values by chance. If we make a decision about whether to reject each null hypothesis without accounting for the fact that we have performed a very large number of tests, then we may end up rejecting a great number of true null hypotheses — that is, making a large number of Type I errors.

How severe is the problem? Recall from the previous section that if we reject a single null hypothesis, H 0, if its p -value is less than, say, α = 0 . 01, then there is a 1% chance of making a false rejection if H 0 is in fact true. Now what if we test m null hypotheses, H 01 , . . . , H 0 m , all of which are true? There’s a 1% chance of rejecting any individual null hypothesis; therefore, we expect to falsely reject approximately 0 . 01 × m null hypotheses. If m = 10 , 000, then that means that we expect to falsely reject 100 null hypotheses by chance! That is a lot of Type I errors.

The crux of the issue is as follows: rejecting a null hypothesis if the p -value is below α controls the probability of falsely rejecting that null hypothesis at level α . However, if we do this for m null hypotheses, then the chance of falsely rejecting at least one of the m null hypotheses is quite a bit higher!

11A fair coin is one that has an equal chance of landing heads or tails.

12Recall that the p -value is the probability of observing data at least this extreme, under the null hypothesis. If the coin is fair, then the probability of observing at least ten tails is (1 / 2)[10] = 1 / 1 , 024 < 0 . 001. The p -value is therefore 2 / 1 , 024 < 0 . 002, since this is the probability of observing ten heads or ten tails.

13.3 The Family-Wise Error Rate 565

  13.3 The Family-Wise Error Rat
  _H_0 is True
_H_0 is False
Total
Reject _H_0
Do Not Reject _H_0
V
S
U
W
R
m −R
Total _m_0
_m −m_0
m

TABLE 13.2. A summary of the results of testing m null hypotheses. A given null hypothesis is either true or false, and a test of that null hypothesis can either reject or fail to reject it. In practice, the individual values of V , S, U , and W are unknown. However, we do have access to V + S = R and U + W = m − R, which are the numbers of null hypotheses rejected and not rejected, respectively.

We will investigate this issue in greater detail, and pose a solution to it, in Section 13.3.

서브목차