In my HBR article about A/B testing (link), I described one of the key managerial problems related to A/B testing--the surplus of “positive” results that don’t quite seem to add up. In particular, I mentioned this issue:
When managers are reading hour-by-hour results, they will sometimes find large gaps between Groups A and B, and demand prompt reaction. Almost all such fluctuations result from temporary imbalance between the two groups, which gets corrected as new samples arrive.
Over the holidays, I paid a visit to the Optimizely team, and learned that they have been developing a solution to this problem. (Optimizely is one of the leading platforms for online A/B testing. They just made an announcement this week about a new feature they are calling “the New Stats Engine”.)
Optimizely also recognizes that their clients face a credibility crisis when the A/B testing tool returns too many “significant” results. Their new tool promises to reduce this false-positive problem. They tackle specifically two sources of the problem:
a) Many clients monitor A/B tests like horse races, and run tests to significance. This is sometimes known as “sampling to a foregone conclusion”.
b) Many clients run many (dozens to hundreds, I imagine) tests simultaneously; here, a test is any pairwise comparison of variations, comparison of variations within segments, or any comparison using multiple goals. This is the “multiple comparisons” problem.
Let me first explain why those are bad practices.
The classical hypothesis test is designed to work with fixed sample sizes, which should be determined prior to the start of the test. The testing protocol then allows up to a 5-percent probability of falsely concluding that there is an effect (That’s the same value as the significance level. This is not the same saying 5 percent of the positive results are false, but that’s a different article). However, if the analyst is peeking at the result multiple times during a test, then the analyst incurs a 5-percent false-positive chance, not once, but for every such peek. Thus, at the end of the test (when significance is reached), the probability of a false positive is much, much higher than five percent. It can be shown that every A/A test will reach significance eventually in this setting.
In a “multivariate” test, the analyst makes many pairwise comparisons, and each comparison is analogous to a peek of the data. Each comparison incurs a 5-percent false-positive chance so that across all of the comparisons within one test, the chance of seeing at least one false positive result is exponentially larger. There are many, many different ways to suffer a false positive (an error in comparison 1 only, in comparison 2 only, etc., in comparisons 1 and 2, in comparisons 1 and 3, etc.).
Now, if the multivariate test is being run to significance, you have a hydra of a head.
The Optimizely solution uses two key results from statistics:
a) A sequential testing framework is adopted, in which the analyst is presumed to be peeking at the results. The Bayesian analysis in most cases will not result in significance even if the sampling does not end--because of the skeptical prior. This line of research started in the 70s 40s with Wald.
b) All solutions to the multiple comparisons problem involves tightening the threshold of significance at the individual test level. Optimizely adopts the Benjamini-Hochberg approach to controlling the “false discovery rate,” (FDR) defined as the proportion of significant results that are in fact false. This line of research is from the 90s, and still very active. One advantage is that the FDR is an intuitive concept.
What this means for Optimizely clients is that your winning percentages (i.e., the proportion of tests returning significant results) will plunge! And before you despair, this is actually a great thing. Here’s why: In many testing programs, as I pointed out in the HBR article (link), there are too many “positive” findings, which means there are too many false positives. This is fine until the management starts asking you why those positive findings don’t show up in the corporate metrics.
If you currently rely on standard Optimizely reports to read test results, and run tests to significance, then the Stats Engine is surely a no-brainer.
In the next post, I have further thoughts for those customers who have more advanced protocols in place.
PS. This is Optimizely's official explanation of their changes on YouTube.