Hypothesis Testing: The Basics

by Jesse Farmer on Wednesday, April 22, 2009

Say I hand you a coin. How would you tell if it's fair? If you flipped it 100 times and it came up heads 51 times, what would you say? What if it came up heads 5 times, instead?

In the first case you'd be inclined to say the coin was fair and in the second case you'd be inclined to say it was biased towards tails. How certain are you? Or, even more specifically, how likely is it actually that the coin is fair in each case?

Hypothesis Testing

Questions like the ones above fall into a domain called hypothesis testing. Hypothesis testing is a way of systematically quantifying how certain you are of the result of a statistical experiment.

In the coin example the "experiment" was flipping the coin 100 times. There are two questions you can ask. One, assuming the coin was fair, how likely is it that you'd observe the results we did? Two, what is the likelihood that the coin is fair given the results you observed?

Of course, an experiment can be much more complex than coin flipping. Any situation where you're taking a random sample of a population and measuring something about it is an experiment, and for our purposes this includes A/B testing.

Let's focus on the coin flip example understand the basics.

The Null Hypothesis

The most common type of hypothesis testing involves a null hypothesis. The null hypothesis, denoted H₀, is a statement about the world which can plausibly account for the data you observe. Don't read anything into the fact that it's called the "null" hypothesis — it's just the hypothesis we're trying to test.

For example, "the coin is fair" is an example of a null hypothesis, as is "the coin is biased." The important part is that the null hypothesis be able to be expressed in simple, mathematical terms. We'll see how to express these statements mathematically in just a bit.

The main goal of hypothesis testing is to tell us whether we have enough evidence to reject the null hypothesis. In our case we want to know whether the coin is biased or not, so our null hypothesis should be "the coin is fair." If we get enough evidence that contradicts this hypothesis, say, by flipping it 100 times and having it come up heads only once, then we can safely reject it.

All of this is perfectly quantifiable, of course. What constitutes "enough" and "safely" are all a matter of statistics.

The Statistics, Intuitively

So, we have a coin. Our null hypothesis is that this coin is fair. We flip it 100 times and it comes up heads 51 times. Do we know whether the coin is biased or not?

Our gut might say the coin is fair, or at least probably fair, but we can't say for sure. The expected number of heads is 50 and 51 is quite close. But what if we flipped the coin 100,000 times and it came up heads 51,000 times? We see 51% heads both times, but in the second instance the coin is more likely to be biased.

Lack of evidence to the contrary is not evidence that the null hypothesis is true. Rather, it means that we don't have sufficient evidence to conclude that the null hypothesis is false. The coin might actually have a 51% bias towards heads, after all.

If instead we saw 1 head for 100 flips that would be another story. Intuitively we know that the chance of seeing this if the null hypothesis were true is so small that we would be comfortable rejecting the null hypothesis and declaring the coin to (probably) be biased.

Let's quantify our intuition.

The Coin Flip

Formally the flip of a coin can be represented by a Bernoulli trial. A Bernoulli trial is a random variable X such that

Pr\left(X = 1\right) = 1 - Pr\left(X = 0\right) = 1 - q = p

That is, X takes on the value 1 (representing heads) with probability p, and 0 (representing tails) with probability 1 - pOf course, 1 can represent either heads or tails so long as you're consistent and 0 represents the opposite outcome.

Now, let's say we have 100 coin flips. Let X_i represent the i^th coin flip. Then the random variable

Y = \sum_{i=1}^{100} X_i

represents the run of 100 coin flips.

The Statistics, Mathematically

Say you have a set of observations O and a null hypothesis H₀. In the above coin example we were trying to calculate

P\left(O \mid H_0\right)

i.e., the probability that we observed what we did given the null hypothesis. If that probability is sufficiently small we're confident concluding the null hypothesis is falseBut remember, if that probability is not sufficiently small, that doesn't mean the null hypothesis is true!

We can use whatever level of confidence we want before rejecting the null hypothesis, but most people choose 90%, 95%, or 99%. For example if we choose a 95% confidence level we reject the null hypothesis if

P\left(O \mid H_0\right) \le 1 - 0.95 = 0.05

The Central Limit Theorem is the main piece of math here. Briefly, the Central Limit Theorem says that the sum of any number of re-averaged identically distributed random variables approximates a normal distribution.

Remember our random variables from before? If we let

p = \frac{Y}{N}

then p is the proportion of heads in our sample of 100 coin flips. In our case, it is equal to 0.51, or 51%.

But by the central limit theorem we also know that p approximates a normal distribution. This means we can estimate the standard deviation of p as

\sigma = \sqrt{\frac{p(1-p)}{N}}

Wrapping It Up

Our null hypothesis is that the coin is fair. Mathematically we're saying

H_0 : p_0 = 0.50

Here's the normal curve:

$\text{[math]}$

A 95% level of confidence means we reject the null hypothesis if p falls outside 95% of the area of the normal curve. Looking at that chart we see that this corresponds to approximately 1.98 standard deviations.

The so-called "z-score" tells us how many standard deviations away from the mean our sample is, and it's calculated as

z = \frac{p-0.50}{\sqrt{\frac{0.50(1-0.50)}{N}}}

The numerator is "p - 0.50" because our null hypothesis is that p = 0.50. This measures how far the sample mean, p, diverges from the expect mean of a fair coin, 0.50.

The Data

Let's say we flipped three coins 100 times each and got the following data.

Data for 100 Flips of a Coin
Coin	Flips	Pct. Heads	Z-score
Coin 1	100	51%	0.20
Coin 2	100	60%	2.04
Coin 3	100	75%	5.77

Using a 95% confidence level we'd conclude that Coin 2 and Coin 3 are biased using the techniques we've developed so far. Coin 2 is 2.04 standard deviations from the mean and Coin 3 is 5.77 standard deviations.

When your test statistic meets the 95% confidence threshold we call it statistically significant.

This means there's only a 5% chance of observing what you did assuming the null hypothesis was true. Phrased another way, there's only a 5% chance that your observation is due to random variation.

Recap

Hypothesis testing is a way of systematically quantifying how certain you are of the result of a statistical experiment. You start by forming a null hypothesis, e.g., "this coin is fair," and then calculate the likelihood that your observations are due to pure chance rather than a real difference in the population.

The confidence interval is the level at which you reject the null hypothesis. If there is a 95% chance that there's a real difference in your observations, given the null hypothesis, then you are confident in rejecting it. This also means there is a 5% chance you're wrong and the difference is due to random fluctuations.

The null hypothesis can be any mathematical statement and the test you use depends on both the underlying data and your null hypothesis. In our coin flipping example the underlying data approximated a normal distribution and we wanted to test whether the observed proportion of heads was different enough to be significant. In this case we were measuring the sample mean.

We can measure anything, though: the sample variance, correlation, etc. Different tests needs to be used to determine whether these are statistically significant, as we'll see in coming articles.

What's Next?

Now that we understand the innards of hypothesis testing we can apply our knowledge to A/B tests to determine whether new features actually effect user behavior. Until then!