In Part 1, I covered the logic behind recent changes to the statistical analysis used in standard reports by Optimizely.
In Part 2, I ponder what this change means for more sophisticated customers--those who are following the proper protocols for classical design of experiments, such as running tests of predetermined sample sizes, adjusting for multiple comparisons, and constructing and analyzing multivariate tests using regression with interactions.
For this segment, the choice of sticking with the existing protocol or not depends on many factors, such as the decision-making culture and corporate priorities. No matter what you do, it is important to realize that improved analysis tools do not obviate careful planning and execution.
Let me start with my advice. Initially, keep running your tests to the usual fixed sample sizes. In essence, you ignore the stopping rule suggested by the Stats Engine. Over a series of tests, including some A/A tests, you can measure how likely those stopping rules would have correctly ended the tests (relative to the fixed-size testing protocol). This allows you to estimate the “time saving” achieved from sequential testing.
As I pointed out in last year’s presentation at the Optimizely Experience, the testing team should be concerned about what proportion of significant findings are correctly called, and what proportion of non-significant findings are incorrectly called. The “false discovery rate” is the flip side of the first quantity.
A testing program using fixed samples may face one of several problems:
a) Too few tests are called significant.
b) Too many tests are called significant.
c) It takes too long to call a test.
You need to figure out what is your biggest problem.
Conceptually, relative to a fixed-size test, a sequential test saves time if the true response rate differs from the design assumption substantially. If you’re testing on a web page for which the response rate is well-known and relatively stable, then there should be hardly any time saving on average. This is why I don’t recommend watching tests like a horse race, minute by minute. (As I said in Part 1, if you are watching a horse race, the Stats Engine will provide some sanity.)
The horizontal axis shows the sample size (at which Optimizely calls an end to the sequential test) as a ratio of the fixed sample size (by design). When this is 100%, the sequential test has the same length as the fixed-sample test. Because the true effect is substantially larger than expected, for a large proportion of tests, the sequential procedure calls for an “early” exit. However, there will be a small number of tests for which the sequential test will end much later than a fixed-sample test.
The line is mostly flat meaning there is equal probability of the test ending at any sample size, including sample sizes that are multiples of the fixed-sample requirement. This is the “price to pay” for doing sequential testing, i.e. multiple peeking. At the lower end of sample sizes, I expect a slight positive curve, because the Bayesian prior (assuming it is a skeptical prior) will prevent tests from being stopped “too early”.
[Thanks to Optimizely’s statistics team for entertaining my inquiries about this intuition.]
How important is saving time for your testing program? This depends on your readiness to move on. My experience is that unexpected time saving, say calling a winner one week before the test was supposed to end, frequently gets eaten up by the organization’s inability to move schedules around. Your IT or web developers may have other projects on their plates.
Further, if you tend to look at data by segments post-hoc, I don't think the current implementation supports that. If you know what segments you care about beforehand, then you can build those into the design.
Most importantly, please don’t fall into the trap of thinking that design and upfront planning become unimportant because of sequential testing and FDR. The design phase is very important in establishing expectations and facilitating communications within the organization.
I also recommend reading this post by Andrew Gelman on data-dependent stopping rules.