Quick quiz. Who wrote this?
I did not test the data for irregularities, which after this painful lesson, I will start doing regularly.
Of course, you won't know. But which of the following people is the most likely to have said such a thing?
A) A Stat 101 student
B) A data scientist working in industry during the first year of employment
C) A graduate student research assistant
D) An assistant professor in the publish-or-perish game
E) A senior tenured professor with multiple best-selling books and tens of thousands of citations
***
If you think the least likely is the most likely, then you'd be right. The answer is (E).
The person who said the sentence (in 2021) is Dan Ariely, currently a Duke professor of economics but probably known to my readers as author of a series of best-sellers in behavioral economics, starting with Predictably Irrational. (You can find the quotation in this PDF.)
Andrew has documented a series of scandals in which several of his seminal studies have been called into question. The latest one involves data obtained from an insurance company by Ariely. He now claims to have no role in collecting, and processing the data.
Neither did the people who uncovered this potential research fraud. Hence, the ability to suss out data problems does not require first-hand involvement in collecting or processing data.
To read about how the fraud was detected, see here.
***
Next time you hear about a publication in a "peer-reviewed" scholarly journal, think about this case. It is entirely possible to publish data analysis in a peer-reviewed journal without "testing the data for irregularities". In fact, researchers with hundreds of peer-reviewed publications may have never once tested data for irregularities! Better late than never, right?
P.S. (1) If you need something light for the holidays, here is an advice column Andrew, hmm, wrote for the WSJ.
(2) One of Ariely's best-sellers is "The Honest Truth about Dishonesty". He is an expert on honesty.
Stuff like this makes me kind of miserable. I'm a PhD student and, unfortunately, being *inside* of academia has made me far more skeptical of scientific reporting than being on the outside ever did.
Anyway, I wanted to ask if you have any general advice on how to test for irregularities. I usually spend a lot of time creating and looking at visualisations, and sometimes irregularities pop out (always innocent errors) but I'm a bit paranoid that a clever enough fraudster (or an unlucky enough error) could make false data that otherwise looks fine in graphs.
If I remember correctly one of the indicators of irregularity in the recent Ariely scandal was that some of the data was written in a different font. I admit I probably never would have spotted that if I was the analyst, which worries me, but the idea of combing through a spreadsheet for inconsistencies like that seems too impractical to me. Do you have any advice on this matter?
Posted by: Anonny | 12/02/2021 at 07:00 AM
Anon: Great question. The link to Data Colada above gives full details on how they discovered the data problems. Just to be clear, the font stuff was needed in order to establish fraudulent behavior, it wasn't needed to learn there is something wrong with the data. The former is causal inference, the latter is anomaly detection. Usually you start with anomaly detection. You may decide to drop the data that have issues and focus on other variables. If the problematic variables are really important, then you'd need to trace the source of the problem. Most problems are not caused by fraud but arise inadvertently because data collection was not designed by the analyst, or the data was collected for other purposes.
I think you're giving me the idea of writing some posts about looking for data anomalies. In this short response, I'll say this all starts with having a healthy degree of skepticism - assuming that there is something wrong with every dataset you touch, as opposed to assuming that there is nothing wrong. i.e. guilty unless proven otherwise.
Then, over time, you accumulate knowledge about the common ways in which data could go wrong. Just off the top of my head, wrong formats, invalid values, missing value indicator is itself a valid value, time series data having gaps, etc.
Lastly, I'd say correlation is your friend. After you run some statistics on the data, you might be able to form intuitive statements that says "the count of X is 40 but the constraints of the data should be that the count is less than twice the number of rows, which cannot be larger than 35." The point is the errors don't reveal themselves in the univariate distributions but in the implications of those statistics on other variables or the constraints.
Posted by: Kaiser | 12/02/2021 at 10:50 AM
The first law of statistics should be "Trust no one". people will tell you all sorts of things about their study design, data collection and data checking, but it usually needs some sort of checking. One thing that often indicates fraud is that the data is too good. In most, but not all, analyses there are missing data. Almost all datasets have some outliers but that is less likely if the data has already been cleaned.
The thing about the fraudulent paper is that it is not difficult, using R, to create a data set that would pass most checks.
Posted by: Ken | 12/10/2021 at 09:53 PM