Last week, I posted some comments on Ron Kohavi's note on A/B testing, which appeared on his Linkedin feed. Kohavi's chief concern is what to do when the A/B test comes back inconclusive, i.e. no significant difference between the test and control treatments. As any practitioners know, most tests come back inconclusive (!)
Kohavi correctly points out that for a variety of reasons, people push for "shipping flat", i.e. adopting the Test treatment even though it did not outperform Control in the A/B test. His note carefully lays out these reasons and debunks most of them. When he publishes the final version, I'll post it here.
The first section deals with situations in which Kohavi would accept "shipping flat". He calls these "non-inferiority" scenarios. My response to those scenarios were posted last week. I'd prefer to call several of these quantification scenarios, in which the expected effect is negative, and the purpose of A/B testing is to estimate the magnitude.
This post delves into the second section, in which Kohavi explains why many of the popular reasons for "shipping flat" should be rejected. I am in general agreement with most of these points.
His very first point is the most important. If you learn nothing else from the note, here is the part that you should pay attention to:
Null hypothesis statistical significance testing only allows us to reject the null hypothesis (in this case commonly of no treatment effect). We cannot accept that a p-value above the threshold (commonly 0.05) implies that there is no treatment effect. It could very well be that the test is underpowered. The problem is exacerbated when the first iteration shows stat-sig negative, tweaks are made, and the treatment is iterated a few times until we have a non-stat sig iteration. In such cases, a replication run should be made with high power to at least confirm that the org is not p-hacking.
What he's saying is that A/B testing is about the signal-to-noise ratio. When this ratio is too small, our test is inconclusive. This could be due to a weak signal, too much noise, or both. If "too much noise" is suspected, the test should be re-run with a larger sample size ("higher power"). If it then works out as expected, i.e., with dampened noise, the signal comes through, there will be no need to "ship flat".
A couple of other points Kohavi made are worth emphasizing here.
The "ship flat or not" decision should be based on a cost-benefit analysis. It may be true that both Test and Control lead to indistinguishable results (say, similar advertising revenues) but switching to the Test treatment incurs additional costs (e.g. the need to apply new tags across the website). In this case, the clear answer is to reject the Test treatment.
The "sunk cost" fallacy may be in play. Sometimes, labor or other investments may have been expended in order to develop the Test treatment. In the example I gave in the last post, if the Test involves a new file uploader module on a website, that new uploader application would have to be developed in order to be testable. Economists would argue that these costs should be treated as "sunk" i.e. irrelevant by the start of the A/B test. In practice, though, such costs are used to justify shipping flat.
Next is the strategic argument. The CEO is committed to this strategy, and no test result would overturn it. In this case, the purpose of the A/B test has been wrongly specified. The true purpose is to quantify the impact of the switch in strategy - as opposed to determining which strategy is better.
The appropriate follow-on question is whether the CEO would change his/her mind after receiving the new information that the strategic switch would incur costs of implementation while generating no benefits, thus negatively impacting profits. Few executives will make a decision that hurts profits while not adding revenues, knowingly.
No amount of scientifically-sound data can topple unsubstantiated speculation that some unmeasurable other effect would overwhelm the observable outcomes. Examples given by Kohavi include (a) arguing that the Test treatment enables other features or strategies that could well generate the positive results; (b) arguing that some other system such as a machine learning model would adapt to the Test treatment so that an unobserved interaction effect would materialize to justify it; (c) arguing that while the Test treatment did no better on the key performance indicator, it should be better on some unmeasured other indicator (say, long-term customer satisfaction).
In these arguments, the propenent attempts to use unobserved assumptions to override carefully measured data, and any data-driven decision-maker should reject them.
One problem with any type of experiment is that there has usually been considerable investment in the new treatment. It may be money, time, or prestige. Some told me about going to talk to a company about the results of a medical trial which showed no effect. The end result was going to be that the R&D division would be shut down, as they only had one project. It was fairly hostile, as they had gone through the report trying to find anything that might be an error.
Posted by: Ken | 02/16/2020 at 03:25 AM
Ken: Yes, that's exactly why Kohavi wrote this note. This kind of scenario is very common. Typically, it's not as extreme as lots of people will be fired because of the "flat" finding. I have had my fair share of such meetings. The fault-finding mission is pointless because (a) the nitpicking doesn't change the high-level result enough to matter, and (b) proclaiming the test is botched does not cause a positive finding to materialize. It seems to me the way the pharma industry is set up, with super-high stakes and allowing pharmas to run and analyze their own trials, provides perverse incentive to cheat. I wonder why we allow that.
Posted by: Kaiser | 02/16/2020 at 12:03 PM