Reader AR pointed me to this Fast Company article that examines the ethics of A/B testing.
The only way to comprehend this point of view is to think of A/B testing not as a scientific experiment but as a decision-making process that involves running an experiment. The researchers are unhappy that A/B tests could lend support to decisions that have undesirable impact on society.
Two such examples are described:
- Two images are tested for a job ad. During the test, site visitors were shown one of the two images, selected at random. The winner of the test is an image that disproportionately drives male applicants.
- Separate pricing tests are run in different zip codes. The "winning" prices at the conclusion of these tests are different for different zip codes. Because racial profiles differ by zip code, prices are in effect different for different races. Therefore, the test result leads to race-based discrimination.
There are two important questions to discuss here. First, what is the alternative to A/B testing? Is that method of decision-making better? Second, is the harm produced by the experiment itself, or by the decision made as a result?
Alternatives to A/B Testing
Consider the image test described above. Presumably, the test is run because someone believes that one of those two images might perform better at driving applicants. At most companies, a test sees the light of day after teams of people debate and prioritize testing ideas. If the test including a sexist image is run, then the team in charge of testing has approved it for some (possibly bad) reason.
If they didn't have A/B testing, how would they have decided which image to run? And if the image is not explicitly sexist - in other words, if the analyst had to analyze the data to learn that one image drove more male applicants - how would that insight be surfaced without running the other image? The alternative decision process may be even worse.
It is certainly true that automated A/B testing is risky - because no human beings are involved in turning test results into actions. The absence of humans is usually touted as a benefit by vendors of such testing tools. In this example, a human analyst reports on the test result, and includes the analysis by gender showing that while total applications increased, the winning image disproportionately attracted male applicants. The decision-makers can and should decide not to adopt the winning image based on that analysis. The A/B test revealed the bias but did not cause it.
Even without the gender issue, such analysis and discussion of results is necessary. For example, ad clicks can be generated by placing ads near scroll bars to stimulate accidental clicking. Human analysts can report that clicks increased but only through accidental clicking. The decision-makers can and should decide not to implement the winning design.
From where does the harm come?
The other example is more far-fetched. I am reverse-engineering the pricing test as described. Given that the test led to different prices for different zip codes, they would be running separate A/B tests stratified by zip code. Given the law of supply and demand, it might be the case that the winning price would be lower in poorer zip codes and higher in richer zip codes. This definitely results in price discrimination by zip code. If the design team did not want price discrimination by zip code, then such a test design would not have been approved so the test itself isn't creating harm.
Further, race-based price discrimination is accused because zip codes are correlated with race. Almost all variables are correlated with race. Age is correlated with race, so are income, education, what websites one visits, etc. So this standard leads to a banning of all segmentation and targeting policies. The only possible pricing policy would be one price for all.
***
In short, human supervision of A/B testing from design to interpretation is definitely needed. A/B tests provide a wealth of data to support decision-making. The biases highlighted by the Fast Company article are merely revealed by the testing - they are not caused by it.
Should A/B testing have to be effective? Should A/B testing or the people behind it have to be be ethical? What should ethical mean?
I reply yes to the first question. I confess to be not able yet to reply the last question.
If a company is interested in the number of applications, and if a A/B test shows that a choice yields the largest number of applications, such choice is the most effective.
What about if such choice yields a large number of male applications? It depends. Maybe the choice reflects a bias in the market. But maybe men are more interested in the offered jobs than women. Nobody can force men and women to choose equally all the jobs. The 50%-50% allocation is a pure utopia. And it is neither right. Men and women have to be at par, not equal.
So, until A/B testing doesn't yield a very disproportionate male application rate, I'm not ready to condemn it.
But.
I strongly think that choosing between two alternatives by the number of applications they yield is something foolish. Ten motivated and competent people is better than one thousand of unenthusiastic people.
In other words I question the premise (the first "if" i wrote above) that the endpoit is appropriate. Maybe its effects are unethical, but if so, it's because it's stupid.
These are my thoughts. Am I omitting something important?
Posted by: Antonio Rinaldi | 03/05/2019 at 04:32 AM
The postcode problem has been there for a lot longer than the internet. An example is that I moved to another area and my car insurance increased dramatically. It is due to the driving habits of the local population but is correlated to the ethnicity. As the decision is based on location not ethnicity then it is assumed OK.
Posted by: Ken | 03/09/2019 at 10:47 PM