An Introduction to A/B Testing

by Jesse Farmer on Monday, January 19, 2009

A/B testing is one of the primary tools in any data-driven environment. You can think of it as a big cage match. Send in your champion versus several other challengers and out comes a victor.

Of course, on the web there's less blood and more statistics, but the principle remains the same: how do you know who will win unless you force them to fight to the death?

A/B Testing lets you compare several alternate versions of the same web page simultaneously and see which produces the best outcome, e.g., increased click-through, engagement, or any other metric of your choice.

Ok, What is A/B Testing, Really?

A/B Testing is a way of conducting an experiment where you compare a control group to the performance of one or more test groups by randomly assigning each group a specific single-variable treatment. Let's break that down.

First, you decide on an experiment. Maybe you're building a web application that forces users to register and you want to experiment on your landing page. You want to see if you can improve the percentage of people who register.

The conversion rate for your landing page is

\text{conversion rate} = \frac{\text{\# of visitors who register}}{\text{\# of total visitors}}

For example, if 100 people visit your landing page today and 20 of those people register then you have a conversion rate of 20%. All else being equal, the landing page with the higher conversion rate is better"All else being equal" is important here — if one of your landing pages promises free candy to people who register you might get a higher conversion rate, but the resulting users will have less long-term value once they realize you're a big fat liar. I'm also not going to talk about statistical significance, yet..

Building Treatments

Once you know what you want to test you have to create treatments to test it. One of the treatments will be the control, i.e., your current landing page. The other treatments will be variations on that. Here are some things worth testing:

Layout. Move the registration forms around. Add fields, remove fields.
Headings. Add headings. Make them different colors. Change the copy.
Copy. Change the size, color, placement, and content of any text you have on the page.

You can have as many treatments as you want, but you get better data more quickly with fewer treatments. I rarely conduct A/B tests with more than four treatments.

Randomization Means Control

You can't just throw up one landing page on Friday and another landing page on Saturday and compare the conversion rates — there's no reason to believe that the conversion rate for users who visit on a Friday is the same for users who visit on a Saturday. In fact, they're probably not.

A/B testing solves this by running the experiment in parallel and randomly assigning a treatment each person who visits. This controls for any time-sensitive variables and distributes the population proportionally across the treatments.

Let's look an example data set.

An Example

Say we have a service called "Foobar" and we're conducting an experiment on our landing page. Our goal is to improve the conversion rate by at least 10%. When a new visitor arrives on the landing page we randomly assign them one of three treatments: the control, Treatment A, or Treatment B.

Let's also say these treatments involve the headline copy. For example, the control treatment's headline copy might be "Foobar is a great service! Sign up here." One of the experimental treatments might have "Foobar lets you stay in touch with family all across the country — easily."

You run the experiment for a few days and get the following data:

A/B Testing Example Data for the Foobar Service
Treatment	Visitors Treated	Visitors Registered	Conversion Rate
Control	1,406	356	25.32%
Treatment A	1,488	372	25.67%
Treatment B	1,392	425	30.53%

From the data above you'd conclude that Treatment B is the winner, but you have to be careful — if the conversion rates were closer or if your sample size were smaller you wouldn't be able to tell which treatment won. For example, can you say for certain that Treatment A is better than the control treatment, or could it just be due to chance?

Sample Size Matters

The sample size of a treatment is the number of people who received that treatment. The larger the sample size the more certain you are that the sample's performance reflects the real performance of the treatment.

For example, what if the above data looked like this, instead?

A/B Testing Example Data for the Foobar Service
Treatment	Visitors Treated	Visitors Registered	Conversion Rate
Control	10	3	30.00%
Treatment A	12	6	50.00%
Treatment B	9	4	44.44%

Which treatment is the best, now? You might be inclined to say that Treatment A is the winner because it has a higher conversion rate. But this is akin to saying that you know a coin is biased because you flipped it three times and got all heads.

That might be unlikely, but it's not impossible. The larger the sample size the more certain you are that the effects you're observing are from real differences in the treatments and not from pure chance. In fact, none of these results are statistically significant, i.e., they're just as likely to be caused by chance as by real differences in the treatments.

Since sample size is per-treatment there are primarily two ways to increase it: use fewer treatments or run the experiment for longer.

What's Next?

There's a lot more to cover when it comes to A/B testing. Here are a few topics I'll be writing about over the coming weeks:

Implementation: Once we understand what A/B testing is about, how do we implement it? Do different products require different implementations?
Statistical Significance: Once we have results from our A/B test, how can we quantify our level of certainty? How long do we have to run an experiment before we can be certain of the results?
Hypothesis Testing: What if we want to test more complex behavior? What if the data we get back can't be modeled as a simple percentage?
Best Practices: What is worth testing? How do you balance short-term and long-term goals in the context of testing?

That's it for today. Feel free to leave a comment and let me know what you want me to write about next. Cheers!