Reader AR pointed me to this Fast Company article that examines the ethics of A/B testing.
The only way to comprehend this point of view is to think of A/B testing not as a scientific experiment but as a decision-making process that involves running an experiment. The researchers are unhappy that A/B tests could lend support to decisions that have undesirable impact on society.
Two such examples are described:
- Two images are tested for a job ad. During the test, site visitors were shown one of the two images, selected at random. The winner of the test is an image that disproportionately drives male applicants.
- Separate pricing tests are run in different zip codes. The "winning" prices at the conclusion of these tests are different for different zip codes. Because racial profiles differ by zip code, prices are in effect different for different races. Therefore, the test result leads to race-based discrimination.
There are two important questions to discuss here. First, what is the alternative to A/B testing? Is that method of decision-making better? Second, is the harm produced by the experiment itself, or by the decision made as a result?
Alternatives to A/B Testing
Consider the image test described above. Presumably, the test is run because someone believes that one of those two images might perform better at driving applicants. At most companies, a test sees the light of day after teams of people debate and prioritize testing ideas. If the test including a sexist image is run, then the team in charge of testing has approved it for some (possibly bad) reason.
If they didn't have A/B testing, how would they have decided which image to run? And if the image is not explicitly sexist - in other words, if the analyst had to analyze the data to learn that one image drove more male applicants - how would that insight be surfaced without running the other image? The alternative decision process may be even worse.
It is certainly true that automated A/B testing is risky - because no human beings are involved in turning test results into actions. The absence of humans is usually touted as a benefit by vendors of such testing tools. In this example, a human analyst reports on the test result, and includes the analysis by gender showing that while total applications increased, the winning image disproportionately attracted male applicants. The decision-makers can and should decide not to adopt the winning image based on that analysis. The A/B test revealed the bias but did not cause it.
Even without the gender issue, such analysis and discussion of results is necessary. For example, ad clicks can be generated by placing ads near scroll bars to stimulate accidental clicking. Human analysts can report that clicks increased but only through accidental clicking. The decision-makers can and should decide not to implement the winning design.
From where does the harm come?
The other example is more far-fetched. I am reverse-engineering the pricing test as described. Given that the test led to different prices for different zip codes, they would be running separate A/B tests stratified by zip code. Given the law of supply and demand, it might be the case that the winning price would be lower in poorer zip codes and higher in richer zip codes. This definitely results in price discrimination by zip code. If the design team did not want price discrimination by zip code, then such a test design would not have been approved so the test itself isn't creating harm.
Further, race-based price discrimination is accused because zip codes are correlated with race. Almost all variables are correlated with race. Age is correlated with race, so are income, education, what websites one visits, etc. So this standard leads to a banning of all segmentation and targeting policies. The only possible pricing policy would be one price for all.
***
In short, human supervision of A/B testing from design to interpretation is definitely needed. A/B tests provide a wealth of data to support decision-making. The biases highlighted by the Fast Company article are merely revealed by the testing - they are not caused by it.
Recent Comments