What a lucky day I found time to catch up on some Gelman. He posted about the Facebook research ethics controversy, and I'm glad to see that he and I have pretty much the same attitude (my earlier post is here.). It's a storm in a teacup.
Gelman makes two other points about the Facebook study--unrelated to the ethics--which are very important.
First, he said:
if we happen to see an effect of +0.02 in one particular place at one particular time, it could well be -0.02 somewhere else. Don’t get me wrong—I’m not saying that this finding is empty, just that we have to be careful about out-of-sample generalization.
This statement is a reaction to learning that the measured response in the Facebook study, that is, the change in "emotions" of users due to the manipulation of positive/negative newsfeed items is tiny, on the order of hundredth of a standard deviation. Put differently, if the sample size of the study (~700K) were smaller, the effect would have been indistinguishable from background noise.
Sadly, this type of thing happens in A/B testing a lot. On a website, it seems as if there is an inexhaustible supply of experimental units. If the test has not "reached" significance, most analysts just keep it running. This is silly in many ways but the key issue is that if you need that many samples to reach significance, it is guaranteed that the measured effect size is tiny, which also means that the business impact is tiny.
For websites where the typical user does not sign on, tests are run based on using cookies to track unique users. It is often ignored how poorly cookies map to unique users (this ought to be a separate post). Let's say each user will show up at different times as different cookies. Then the iid assumption is violated; the correlation between units causes effect sizes to be over-estimated. The longer one runs a test, the more likely the same user shows up as multiple cookies.
Gelman's other observation is that studies with insignificant effects boostered by massive samples can get the p<0.05 and find their way to top journals. Add that to the pile of reasons why being published in a top journal is not necessarily a reason to trust a study.
On a separate note, I want to respond to a reader who asked me a question a while ago that I haven't answered. In one of my talks about the Netflix Prize, I remarked that the 10% targeted improvement was roughly equivalent to improving the accuracy of predicting the average rating by 1/10 of a star in the 5-star scale. Then, I forgot how I derived that number. Turns out it was in an earlier version of the talk. The accuracy metric used was RMSE which is on the same scale as the ratings data, i.e. the 5-star scale. The 10% improvement was roughly 0.1 on the RMSE scale, which is 1/10th of a star on the 5-star scale.
Pertinent to the tiny-effect issue discussed above, note that in the final phase of the Netflix Prize, several teams were just under the cusp of the 10-percent threshold. The eventual winners boosted their RMSE by 0.005 to get over the threshold. That is one half of 1/100th of a star on the 5-star scale. And it took 10 months.