Two years ago, Wired breathlessly extolled the virtues of A/B testing (link). A lot of Web companies are in the forefront of running hundreds or thousands of tests daily. The reality is that most A/B tests fail.
A/B tests fail for many reasons. Typically, business leaders consider a test to have failed when the analysis fails to support their hypothesis. "We ran all these tests varying the color of the buttons, and nothing significant ever surfaced, and it was all a waste of time!" For smaller websites, it may take weeks or even months to collect enough samples to read a test, and so business managers are understandably upset when no action can be taken at its conclusion. It feels like waiting for the train which is running behind schedule.
Bad outcome isn't the primary reason for A/B test failure. The main ways in which A/B tests fail are:
1. Bad design (or no design);
2. Bad execution;
3. Bad measurement.
These issues are often ignored or dismissed. They may not even be noticed if the engineers running the tests have not taken a proper design of experiments class. However, even though I earned an A at school, it wasn't until I started running real-world experiments that I really learned the subject. This is an area in which theory and practice are both necessary.
The Facebook Data Science team just launched an open platform for running online experiments, called PlanOut. This looks like a helpful tool to avoid design and execution problems. I highly recommend looking into how to integrate it with your website. An overview is here, and a more technical paper (PDF) is also available. There is a github page.
The rest of this post gets into some technical, sausage-factory stuff, so be warned.
***
Bad design is when the experiment is set up in such a way that it does not provide data to answer the research question. I will just give one example here, which is one of many, many ways to fail.
Let's say you want to test changing the text on your registration button from "Sign up Now" to "Join Free". You run a standard A/B test randomizing your visitors into two paths, everything else being the same except the different button text. After you present the test result, the business owner asks to look at geographical segments. That is when you realized that your visitors are split 40% English and 60% non-English. Your site doesn't yet have translated pages so everyone sees the English version. What's the problem?
The problem is noise in your data. Since you don't have translated pages, it is very likely that your registrations would be English-speakers-biased even though your traffic is more international. A better test design should have restricted the test to only English-speaking countries. Because the change in the button text is not likely to influence non-English speaking visitors, when you average those results into the average registration rate, they dilute the signal coming from English speakers. You are at risk of making a false negative error.
The examples of failed designs are endless. I will discuss other examples in future posts. Worse than failed designs is no design at all. The Facebook team quotes Ronald Fisher: "To consult the statistician after an experiment is finished is often merely to ask him to conduct a post-mortem examination. He can perhaps say what the experiment died of."
***
Worse than bad or no design is bad execution of a good design. The essence of designed experiments is controlling the system--making sure everything else is the same so that we can isolate the effect of the treatment being investigated. Let's just say web sites are complex beasts and bad execution is the norm rather than the exception.
Start with the experimental unit. How do you identify individuals? The simplest way is to assign user ids. But user ids are only available when the user logs in. If the same user performs activities in both logged-in and logged-out states, you're out of luck. The next technology available is the cookie. But cookies can be cleaned out, and cookies do not persist across different devices or browsers. There are also caching and other technical complications--solutions for other engineering problems that adversely affect A/B test execution.
At other times, your experiment may not involve individuals. It may involve impressions (page views). You might want to randomize the design of a particular web page regardless of who loads it. There are no ready made identifiers for impressions.
Next is randomization, which is at the core of the A/B testing methodology. Through the years, I have met many formulas for computing random numbers on the fly, half of which are probably not truly random. We should have a standard for this.
Another issue relates to understanding the structure of the website, and its pathways. This is one of the toughest things to do for a large website. It's easy to unintentionally move some element of a page, not realizing that some visitors pass by that element on the way to the conversion page several steps later. Frequently, it's not that people are negligent--people really didn't know such a pathway existed when the test was designed.
Further, if you run many simultaneous tests, it is extremely important to understand how one test might affect another. Needless to say, this issue is often shoved under the rug. The typical reasoning is if we randomize everything, we will be fine. This is a fallacy: it's only fine if none of your tests produce any significant effects, which I hope is not the case; if the inputs are random, the outputs are random only if the treatment made no difference!
***
The third problem is bad measurement, which is often hidden from view. The most tricky problem is the data you are analyzing are not what you think they are.
For example, the log of the experiment only tells you what treatment the unit was supposed to receive, but not what it actually received. It is considered acceptable to assume that the actual treatment is the design treatment. Or, you may have the opposite problem, that you have a log of the actual treatment but you don't know what the design calls for, which means you are making the assumption that the random assignment of treatment has been properly administered. Needless to say, these assumptions are quite bold. Too bold for my taste.
***
So, Facebook comes to the rescue last week when they introduced PlanOut, which is an internally grown platform for managing and executing online experiments. I have previously commended the Facebook Data Science team on the meticulous way they ran their experiments (link). I'm glad that they are sharing their system with the rest of the world.
While I have not tried the system (we put together something similar at Vimeo, though less fully featured), I noted the following key components:
- abstracting the experimental parameters as a separate layer
- ability to target specific segments for testing
- ability to work with different levels of experimental units (user_ids, cookie_ids, ...)
- ability to set experimental fractions other than 50/50
- inclusion of standard random number generators
- integrated logging
- management interface for multiple simultaneous experiments to different segments
What I hope they will release in the future:
- establishing the experimental units
- quality-control charts
- monitoring reports
Why is it a test failure when you discover something? I would think that "hey, we don't have to worry about stupid color choices!" is a pretty good finding.
Posted by: Sherman Dorn | 04/16/2014 at 08:43 AM
@sherman dorn: you're right, but management may not see it that way.
The random number thing is generally less important than you might think. If the server looks at the time and produces option A when the seconds is odd, and B when its even, then it's highly unlikely you'll get a systematic bias.
Also, I always say that A/B testign lets you pick the perfect shade of blue - but what if the best option was actually a red?
Posted by: Tom West | 04/16/2014 at 09:03 AM
Tom: My point about random number generators is that there is no industry standard. I have come across a lot of fishy ones.
As for your method, I have always wanted to ask someone where the server gets the time? Is there any chance that that query fails?
As for your tongue-in-cheek complaint, why not test red against blue?
Posted by: Kaiser | 04/16/2014 at 11:30 AM
I'm wondering why you didn't discuss existing A/B-testing tools, such as Visual Website Optimizer and Optimizely? I'm interested in the pros and cons of these tools from a data scientist's (?) standpoint.
Posted by: Sjors Peerdeman | 04/17/2014 at 04:10 AM
@Kaiser: For sure, if your server time fails, then the test doesn't work. But that's the same as saying if your random number generator fails, then the test doesn't work.
Yes, you coudl test red against blue... but my point was that A/B testing is often used to zoom in on a specific solution within narrow constraints - but the best answer may be outside those constraints.
Posted by: Tom West | 04/17/2014 at 03:26 PM
Sjors: I am a happy user of Optimizely. We use both their SaaS platform and have a homegrown solution similar to PlanOut. The Facebook solution is written for developers while the Optimizely style solution is created for business people. A SaaS solution has limitations on what tests can be run.
The issues I outlined above are not solved by having tools. In fact, I encountered some of them in tests which used tools. What you need are brains - what I call numbersense. The Facebook solution provides structure and ingredients that the analyst finds helpful in diagnosing problems. I'm not saying that deploying PlanOut will magically prevent those problems.
Tom: On testing large versus small changes, I think the more interesting debate is whether one should test a complete redesign in which dozens of changes are introduced all at once against the existing design.
Posted by: Kaiser | 04/17/2014 at 04:07 PM
bad very easily happen if the measurement data collection is not done directly. especially if the facebook users are doing other activities simultaneously
Posted by: beli followers instagram | 06/04/2015 at 01:00 AM