I don't speak often at marketing conferences and that's because my message is not easy to take. For example, one of my talks is titled "The Accountability Paradox in Big Data Marketing." Google and other digital marketers claim that the ad-tech world is more measurable, and thus more accountable than the old world of TV advertising - they claim that advertisers save money by going digital. The reality is not so. There have been some attention to this problem recently (for example, here) - but far from enough.
Let me illustrate the problems by describing my recent experience running ads on Facebook for Principal Analytics Prep, the analytics bootcamp I recently launched. For a small-time advertiser like us, Facebook presents a channel to reach large numbers of people to build awareness of our new brand.
So far, the results from the ads have been satisfactory but not great. We are quite contented with the effectiveness but wanted to run experiments to get higher volume of "conversions". This last week, we ran an A/B test to see if different images result in more conversions. We designed a four-way split, so in reality, an A/B/C/D test. One of the test cells (call it D) is the "champion," i.e. the image that has performed well prior to the test; the other images are new. We launched the test on a Friday.
Two days later, I checked the interim results. Only one of the test cells (A) had any responses. Surprisingly, that test cell A has received about 90% of all "impressions." Said differently, test cell A received 10 times as many impressions as each of the other three cells. The other test cells were getting such measly allocation that I have lost all confidence in this test.
It turns out that an automated algorithm (what is now labeled A.I.) was behind this craziness. Apparently, this is a well-known problem among people who tried to do so-called split testing on the Facebook Ads platform. See this paragraph from the AdEspresso blog:
This often results in an uneven distribution of the budget where some experiments will receive a lot of impressions and consume most of the budget leaving others under-tested. This is due to Facebook being over aggressive determining which ad is better and driving to it most of the Adset’s budget.
Then one day later, I was shook again when checking the interim report. Suddenly, test cell C got almost all the impressions - due to one conversion that showed up overnight for the C image. Clearly, anyone using this split-testing feature is just fooling themselves.
***
This is a great example of interesting math that looks good on paper but spectacularly fails in practice. The algorithm that is driving this crazy behavior is most likely something called multi-armed bandits. This method has traditionally been used to study casino behavior but some academics have recently written many papers that argue they are suitable to use in A/B testing. The testing platform in Google Analytics used to do a similar thing - it might still do but I wouldn't know because I avoid that one like the plague as well.
The problem setup is not difficult to understand: in traditional testing as developed by statisticians, you need a certain sample size to be confident that any difference observed between the A and B cells is "statistically significant." The analyst would wait for the entire sample to be collected before making a judgment on the results. No one wants to wait especially when the interim results are showing a direction in one's favor. This is true in business as in medicine. The pharmaceutical company that is running a clinical trial on a new drug it spent gazillions to develop would love to declare the new drug successful based on interim positive results. Why wait for the entire sample when the first part of the sample gives you the answer you want?
So people come up with justifications for why one should stop a test early. They like to call this a game of "exploration versus exploitation." They claim that the statistical way of running testing is too focused on exploration; they claim that there is "lost opportunity" because statistical testing does not "exploit" interim results.
They further claim that the multi-armed bandit algorithms solve this problem by optimally balancing exploration and exploitation (don't shoot me, I am only the messenger). In this setting, they allow the allocation of treatment in the A/B test to change continuously in response to interim results. Those cells with higher interim response rates will be allocated more future testing units while those cells with lower interim response rates will be allocated fewer testing units. The allocation of units to treatment continuously shifts throughout the test.
***
When this paradigm is put in practice, it keeps running into all sorts of problems. One reality is that 80 to 90 percent of all test ideas make no difference, meaning the test version B on average performs just as well as test version A. There is nothing to "exploit." Any attempted exploitation represents swimming in the noise.
In practice, many tests using this automated algorithm produce absurd results. As AdEspresso pointed out, the algorithm is overly aggressive in shifting impressions to the current "winner." For my own test, which has very low impressions, it is simply absurd for it to start changing allocation proportions after one or two days. These shifts are driving by single-digit conversions off a small base of impressions. And it then swims around in the noise. Because of such aimless and wasteful "exploitation," it would have taken me much, much longer to collect enough samples on the other images to definitively make a call!
***
AdEspresso and others recommend a workaround. Instead of putting the four test images into one campaign, they recommend setting up four campaigns each with one image, and splitting the advertising equally between these campaigns.
Since there is only one image in each campaign, you have effectively turned off the algorithm. When you split the budget equally, each campaign will get similar numbers of impressions.
However, this workaround is also flawed. If you can spot what the issue is, say so in the comments!
Well said. Thank you for this insight. It sounds like these tools do not ensure an adequate number of samples prior to modifying the assignment probabilities in these sequential trials. To quote from the SAS documentation for the SEQDESIGN procedure, a well-designed experiment should acknowledge that
"the null hypothesis is more difficult to reject in the early stages than in the later stages. That is, the trial is rejected in the early stages only with overwhelming evidence because in these stages there might not be a sufficient number of responses for a reliable estimate of the treatment effect."
A second comment: not only do pharmaceutical companies want to stop trials early when a drug shows exceptional promise, but they also want to stop trials for which there is early evidence of possible harm to patients. Thus sequential trials serve an ethical role as well as an economic role.
Posted by: Rick Wicklin | 07/26/2017 at 09:09 AM
RW: On your second comment, while that is true, it is also true that clinical trials are designed primarily to measure benefits and the ability to measure harm is weak. In the business world, I like to remind managers that if they want early stopping for benefit, there should be early stopping for harm too. That sometimes stop the request :)
Posted by: Kaiser | 07/26/2017 at 11:25 AM
May this desire to make a decision early be an example of heuristic fallacies explored by Kahneman and Tversky?
Posted by: Richard Ward | 07/26/2017 at 01:53 PM
In an experimental design a subject should see only one of the possible choices. The flaw with the multiple-campaign design is that the users may see more than one of the ads. Perhaps they see ad A, think about it, then see ad B and take the plunge. The A impression will be recorded as a miss and the B as a hit, when in fact they were not independent events.
Posted by: Kevin Christopher Henry | 07/26/2017 at 02:21 PM
"Unfortunately significance testing and hypothesis testing are of limited scientific value – they often ask the wrong question and almost always give the wrong answer. And they are widely misinterpreted [...] p tells us the probability of observing the data given that the null hypothesis is true. But most scientists think p tells them the probability the null hypothesis is true given their data." https://theconversation.com/why-hypothesis-and-significance-tests-ask-the-wrong-questions-11583
Surely this is a problem with the whole A/B testing paradigm!
Posted by: tc | 07/27/2017 at 03:46 AM
KCH: Yes that is one of the problems with the alternative design. In the multiple-campaign design, FB sees each campaign as separate and thus there is no tagging of users to keep them in the same test cell. Even after this is solved, there is still one other big problem.
tc: I may write about this more in the future. There are definitely issues with the significance testing framework but most of the critique of it is sloppy and poorly reasoned.
Posted by: Kaiser | 07/27/2017 at 10:13 AM
Kaiser: I would love to see a post or two around this subject.
The problems of significance testing will most likely be poorly understood by those using these tools, who are often trained as developers and UX/I designers, not analysts or statisticians.
In my experience A/B testing is often used as a substitute for strategic decision making in start-ups.
Also, as an aside, all things being equal A/B testing of images, button colours, messages etc should coalesce around the mean over time...in other words all websites should start to look the same not different, therefore I feel there is little value in small digital startups doing A/B testing - just make your website look like Amazon, or Facebook, or Wikipedia, or HuffPost, or whatever other company does millions of A/B tests all the time.
Posted by: tc | 07/27/2017 at 07:52 PM
I would argue that merely comparing multi-armed bandit and A/B testing is wrong. These two are different creatures designed to be used in different scenarios.
Posted by: Dror Atariah | 07/28/2017 at 01:31 AM
Using the AdEspresso how does FB know that individuals are to be randomly assigned to one of those four conditions? That is what does the assignment mechanism default to? Does it just fill each condition up sequentially? Without knowing more about the assignment mechanism in this new context it seems likely that you'd violate the requirement that participants in the experiment have some probability of being assigned to any one of the conditions. (I'm thinking in terms of Holland's discussion of Rubin's causal model).
Posted by: AnonAnon | 07/29/2017 at 04:29 PM
AA: That is the other flaw of the setup as separate campaigns. One of the most important prerequisites of A/B testing (any statistical testing) is random assignment of treatment. FB does not promise that simultaneous campaigns to the same target population gets randomized. (Tom Diettrich also pointed this out on Twitter: we have no way of telling FB that those campaigns are part of a test.)
Posted by: Kaiser | 07/31/2017 at 11:53 PM