Ron Kohavi is on a mission to reform A/B testing culture within the big tech companies, and that's a good thing. On Linkedin, he posted a note about thoughts on "shipping flat". For the statistically-minded readers here, that is a term to mean switching treatment to the Test even though in the A/B test, the difference between the Test and Control treatments was judged to be statistically insignificant. (Admittedly, I heard this term for the first time myself from Kohavi although I can attest that it is common industry practice to launch the Test because it is not worse than the Control.)
More concretely, let's say the engineering team developed a new file uploader for the website. When tested against the existing file uploader, it did not improve the performance metric, say the daily frequency of uploads by users.
In the old days of direct marketing, people would just reject the new variant. We had a notion of the "Champion". Challengers don't knock off the Champion unless there is definitive proof (statistical significance) that they are better. In sports, world records do not get beaten when equaled. The first person still retains the title until someone beats (not equals) the record.
Nowadays, there are plenty of arguments made to "ship flat". Kohavi laid out many of those arguments, and presents why they are flawed. I will discuss those in the next blog post (This is now available.).
***
In this post, I want to respond to Kohavi's list of situations where "non-inferiority testing" is justified. These are situations in which he believes that tests can be designed with the hope of proving that the new treatment (Test) does no worse than the existing Control. In other words, one can accept that the Test and Control are a statistical tie.
I believe in most of the situations he outlined, non-inferiority is wishful thinking. It is better to acknowledge that those changes are expected to impact performance metrics negatively - the purpose of the A/B test in those scenarios is to quantify the extent of the damage.
The simplest case to explain this point of view is Kohavi's scenario (c): GDPR and CCPA are pro-privacy legislation that will almost surely hurt websites with advertising business models. For example, websites may have to ask for opt-in instead of opt-out. Ideally, the website would love to come up with something that would not reduce their advertising income. If they succeeded in doing so, then the legislation is proven toothless! The value of the A/B test here isn't to show no statistical difference, but to show management the likely magnitude of the harm.
Scenario (d) called "competitive parity" is the most interesting. In Kohavi's words:
The competition has the feature, and an important comparison is coming out by a third-party where our product will be missing the checkmark. Note that if the competition is a solid data-driven company, it is likely that they shipped this feature because it was positive to the OEC. A flat treatment hints at a poor implementation that could be improved.
I can even hear this argument in my head so it's great that Kohavi laid it out for discussion. I find it hard to accept it as a valid case for "shipping flat".
Here is what I don't like about the competitive parity argument:
- The competitor - the solid data-driven company - is also likely to have "shipped flat" due to any of the reasons listed by Kohavi in the same note. So the fact that they shipped the new feature does not mean it positively improved performance.
- Test results are usually highly specific to the context. Even if the new feature wins at the competitor's site, it is a big stretch to assume it would win at our own site, given our own other features, technologies, set of users, etc. Even within one's own site, a feature that worked on one page does not necessarily work on a different page.
- The performance metric used by the competitor to evaluate the test may not be the same as ours.
In a case like this, I might accept that we need that feature for strategic reasons - and the purpose of the A/B test is to measure its impact on our site. The test should be designed as a "two-tailed" test, meaning that we don't have an expectation of whether the new feature would be beneficial or harmful.
P.S. [1/28/20] The second post is now up. This new post addresses the scenarios under which Kohavi advises against "shipping flat".
Perhaps it would be best to link to the post by Kohavi that you're referencing to.
Posted by: Charles | 01/24/2020 at 10:09 AM
Charles: Since he posted on Linkedin, I am not sure he wants it in public yet. I will find out. If you're on Linkedin, you can find his post there.
Posted by: Kaiser | 01/24/2020 at 11:06 AM