October 27th, 2008

Hypothesis Testing: The Basics

Say I hand you a coin. How would you tell if it’s fair? If you flipped it 100 times and it came up heads 51 times, what would you say? What if it came up heads 5 times, instead?

In the first case you’d be inclined to say the coin was fair and in the second case you’d be inclined to say it was biased towards tails. How certain are you? Or, even more specifically, how likely is it actually that the coin is fair in each case?

Hypothesis Testing

Questions like the ones above fall into a domain called hypothesis testing. Hypothesis testing is a way of systematically quantifying how certain you are of the result of a statistical experiment.

In the coin example the “experiment” was flipping the coin 100 times. There are two questions you can ask. One, assuming the coin was fair, how likely is it that you’d observe the results we did? Two, what is the likelihood that the coin is fair given the results you observed?

Of course, an experiment can be much more complex than coin flipping. Any situation where you’re taking a random sample of a population and measuring something about it is an experiment, and for our purposes this includes A/B testing.

Let’s focus on the coin flip example understand the basics.

The Null Hypothesis

The most common type of hypothesis testing involves a null hypothesis. The null hypothesis, denoted H0, is a statement about the world which can plausibly account for the data you observe. Don’t read anything into the fact that it’s called the “null” hypothesis — it’s just the hypothesis we’re trying to test.

For example, “the coin is fair” is an example of a null hypothesis, as is “the coin is biased.” The important part is that the null hypothesis be able to be expressed in simple, mathematical terms. We’ll see how to express these statements mathematically in just a bit.

The main goal of hypothesis testing is to tell us whether we have enough evidence to reject the null hypothesis. In our case we want to know whether the coin is biased or not, so our null hypothesis should be “the coin is fair.” If we get enough evidence that contradicts this hypothesis, say, by flipping it 100 times and having it come up heads only once, then we can safely reject it.

All of this is perfectly quantifiable, of course. What constitutes “enough” and “safely” are all a matter of statistics.

The Statistics, Intuitively

So, we have a coin. Our null hypothesis is that this coin is fair. We flip it 100 times and it comes up heads 51 times. Do we know whether the coin is biased or not?

Our gut might say the coin is fair, or at least probably fair, but we can’t say for sure. The expected number of heads is 50 and 51 is quite close. But what if we flipped the coin 100,000 times and it came up heads 51,000 times? We see 51% heads both times, but in the second instance the coin is more likely to be biased.

Lack of evidence to the contrary is not evidence that the null hypothesis is true. Rather, it means that we don’t have sufficient evidence to conclude that the null hypothesis is false. The coin might actually have a 51% bias towards heads, after all.

If instead we saw 1 head for 100 flips that would be another story. Intuitively we know that the chance of seeing this if the null hypothesis were true is so small that we would be comfortable rejecting the null hypothesis and declaring the coin to (probably) be biased.

Let’s quantify our intuition.

The Coin Flip

Formally the flip of a coin can be represented by a Bernoulli trial. A Bernoulli trial is a random variable X such that

Pr\left(X = 1\right) = 1 - Pr\left(X = 0\right) = 1 - q = p

That is, X takes on the value 1 (representing heads) with probability p, and 0 (representing tails) with probability 1 – p1.

Now, let’s say we have 100 coin flips. Let Xi represent the ith coin flip. Then the random variable

Y = \sum_{i=1}^{100} X_i
represents the run of 100 coin flips.

The Statistics, Mathematically

Say you have a set of observations O and a null hypothesis H0. In the above coin example we were trying to calculate

P\left(O \mid H_0\right)
i.e., the probability that we observed what we did given the null hypothesis. If that probability is sufficiently small we’re confident concluding the null hypothesis is false2

We can use whatever level of confidence we want before rejecting the null hypothesis, but most people choose 90%, 95%, or 99%. For example if we choose a 95% confidence level we reject the null hypothesis if

P\left(O \mid H_0\right) \le 1 - 0.95 = 0.05

The Central Limit Theorem is the main piece of math here. Briefly, the Central Limit Theorem says that the sum of any number of re-averaged identically distributed random variables approximates a normal distribution.

Remember our random variables from before? If we let

p = \frac{Y}{N}
then p is the proportion of heads in our sample of 100 coin flips. In our case, it is equal to 0.51, or 51%.

But by the central limit theorem we also know that p approximates a normal distribution. This means we can estimate the standard deviation of p as

\sigma = \sqrt{\frac{p(1-p)}{N}}

Wrapping It Up

Our null hypothesis is that the coin is fair. Mathematically we’re saying

H_0 : p_0 = 0.50

Here’s the normal curve:

A 95% level of confidence means we reject the null hypothesis if p falls outside 95% of the area of the normal curve. Looking at that chart we see that this corresponds to approximately 1.98 standard deviations.

The so-called “z-score” tells us how many standard deviations away from the mean our sample is, and it’s calculated as

z = \frac{p-0.50}{\sqrt{\frac{0.50(1-0.50)}{N}}}

The numerator is “p – 0.50″ because our null hypothesis is that p = 0.50. This measures how far the sample mean, p, diverges from the expect mean of a fair coin, 0.50.

The Data

Let’s say we flipped three coins 100 times each and got the following data.

Data for 100 Flips of a Coin
Coin Flips Pct. Heads Z-score
Coin 1 100 51% 0.20
Coin 2 100 60% 2.04
Coin 3 100 75% 5.77

Using a 95% confidence level we’d conclude that Coin 2 and Coin 3 are biased using the techniques we’ve developed so far. Coin 2 is 2.04 standard deviations from the mean and Coin 3 is 5.77 standard deviations.

When your test statistic meets the 95% confidence threshold we call it statistically significant.

This means there’s only a 5% chance of observing what you did assuming the null hypothesis was true. Phrased another way, there’s only a 5% chance that your observation is due to random variation.

Recap

Hypothesis testing is a way of systematically quantifying how certain you are of the result of a statistical experiment. You start by forming a null hypothesis, e.g., “this coin is fair,” and then calculate the likelihood that your observations are due to pure chance rather than a real difference in the population.

The confidence interval is the level at which you reject the null hypothesis. If there is a 95% chance that there’s a real difference in your observations, given the null hypothesis, then you are confident in rejecting it. This also means there is a 5% chance you’re wrong and the difference is due to random fluctuations.

The null hypothesis can be any mathematical statement and the test you use depends on both the underlying data and your null hypothesis. In our coin flipping example the underlying data approximated a normal distribution and we wanted to test whether the observed proportion of heads was different enough to be significant. In this case we were measuring the sample mean.

We can measure anything, though: the sample variance, correlation, etc. Different tests needs to be used to determine whether these are statistically significant, as we’ll see in coming articles.

What’s Next?

Now that we understand the innards of hypothesis testing we can apply our knowledge to A/B tests to determine whether new features actually effect user behavior. Until then!

  1. Of course, 1 can represent either heads or tails so long as you’re consistent and 0 represents the opposite outcome []
  2. But remember, if that probability is not sufficiently small, that doesn’t mean the null hypothesis is true! []
  • "This means that there's only a 5% chance that this data was due to pure chance rather than a real difference between a fair coin and the coin you flipped."

    Uh, I think you mean the other way around; there's a 5% chance that there's a real difference, and 95% that it was just a matter of luck. You're 95% sure that there is no difference.
  • Tordek,

    The sentence is awkward, but I think it's correct.

    P(O | H<sub>0</sub>) ≤ 0.05 means that if the null hypothesis is correct then there's only a 5% chance of observing what you did.
  • Jaynes
    I'm confused about one thing. If the null hypothesis is that the coin is fair and there is insufficient evidence to reject this, but we cannot assert the null hypothesis, doesn't that mean we cannot conclude that the coin is fair?
  • You can never know if a coin is fair, at least not by taking a statistical approach. There will always be some level of variance in the process.

    Hypothesis testing lets us quantify that variance and see whether or not the observed results fall outside that bounds. If they do we can say with some level of confidence that the coin is biased.

    Let's say a coin lands on heads 51% of the time. If we flip a coin 100 times and get 51 heads it's impossible to tell whether that was the natural variance of a fair coin or the bias of a 51% coin.

    In reality a "fair coin" means a "fair enough coin." We'd have to flip a coin 10,000 times before we'd be able to detect a 51% bias for heads.
  • C
    I have a question;
    Why do we need to test the null hypothesis and not the experimental hypothesis?
  • C,

    What do you mean by "experimental hypothesis?" The only hypotheses involved are the null hypothesis and its negation, the alternative hypothesis.

    If the data is unlikely to have occurred under the null hypothesis, we accept the alternative hypothesis with some level of confidence — usually 95%.
  • C
    More specifically; why does a researcher try to disprove the null hypotheses rather then prove the research hypothesis? Why not just test the research hypothesis?
  • How would you prove the alternative hypothesis?
  • Let me put my question another way.

    You have a coin and don't know if it's fair. You flip it 100 times and it lands on heads 51 times.

    What can you say about the coin? Can you say it's fair? Can you say it's biased?

    More generally, it's difficult, if not impossible, to prove a hypothesis is correct. You can prove a hypothesis is false, however. So if you want to know whether a coin is biased you should see whether the data falsifies the converse, viz., that the coin is biased.
  • I think that the "p" in the denominator for the "z score" should be the hypothesized value, in this case 0.50. You table uses calculated values to find z.

    This text is from the referenced web site. (McClave and Sincich also define the formula that way.)
    "Analyze Sample Data
    Using sample data, find the test statistic and its associated P-Value.
    • Standard deviation. Compute the standard deviation (σ) of the sampling distribution.
    σ = sqrt[ P * ( 1 - P ) / n ]
    where P is the hypothesized value of population proportion in the null hypothesis, and n is the sample size."
  • You're right, I equivocated in my use of "p". At first I use it to mean the value of p under the null hypothesis, H_0: p = 0.50, but then use it to mean the measured value elsewhere.

    I'll fix it.
  • kayla
    i would like for this page to tell me the answer not give me examples. i need an answer not stupid examples.please thank you
  • The answer to what?
  • What a useful post here. Very informative for me..TQ friends...

    Cheers,
    Blog Review
blog comments powered by Disqus