« Thirty percent unvaccinated in healthcare: less than meets the eye | Main | What's smelling fishy? Maybe your data »

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

Anonny

Stuff like this makes me kind of miserable. I'm a PhD student and, unfortunately, being *inside* of academia has made me far more skeptical of scientific reporting than being on the outside ever did.

Anyway, I wanted to ask if you have any general advice on how to test for irregularities. I usually spend a lot of time creating and looking at visualisations, and sometimes irregularities pop out (always innocent errors) but I'm a bit paranoid that a clever enough fraudster (or an unlucky enough error) could make false data that otherwise looks fine in graphs.

If I remember correctly one of the indicators of irregularity in the recent Ariely scandal was that some of the data was written in a different font. I admit I probably never would have spotted that if I was the analyst, which worries me, but the idea of combing through a spreadsheet for inconsistencies like that seems too impractical to me. Do you have any advice on this matter?

Kaiser

Anon: Great question. The link to Data Colada above gives full details on how they discovered the data problems. Just to be clear, the font stuff was needed in order to establish fraudulent behavior, it wasn't needed to learn there is something wrong with the data. The former is causal inference, the latter is anomaly detection. Usually you start with anomaly detection. You may decide to drop the data that have issues and focus on other variables. If the problematic variables are really important, then you'd need to trace the source of the problem. Most problems are not caused by fraud but arise inadvertently because data collection was not designed by the analyst, or the data was collected for other purposes.

I think you're giving me the idea of writing some posts about looking for data anomalies. In this short response, I'll say this all starts with having a healthy degree of skepticism - assuming that there is something wrong with every dataset you touch, as opposed to assuming that there is nothing wrong. i.e. guilty unless proven otherwise.

Then, over time, you accumulate knowledge about the common ways in which data could go wrong. Just off the top of my head, wrong formats, invalid values, missing value indicator is itself a valid value, time series data having gaps, etc.

Lastly, I'd say correlation is your friend. After you run some statistics on the data, you might be able to form intuitive statements that says "the count of X is 40 but the constraints of the data should be that the count is less than twice the number of rows, which cannot be larger than 35." The point is the errors don't reveal themselves in the univariate distributions but in the implications of those statistics on other variables or the constraints.

Ken

The first law of statistics should be "Trust no one". people will tell you all sorts of things about their study design, data collection and data checking, but it usually needs some sort of checking. One thing that often indicates fraud is that the data is too good. In most, but not all, analyses there are missing data. Almost all datasets have some outliers but that is less likely if the data has already been cleaned.

The thing about the fraudulent paper is that it is not difficult, using R, to create a data set that would pass most checks.

The comments to this entry are closed.

Get new posts by email:
Kaiser Fung. Business analytics and data visualization expert. Author and Speaker.
Visit my website. Follow my Twitter. See my articles at Daily Beast, 538, HBR, Wired.

See my Youtube and Flickr.

Search3

  • only in Big Data
Numbers Rule Your World:
Amazon - Barnes&Noble

Numbersense:
Amazon - Barnes&Noble

Junk Charts Blog



Link to junkcharts

Graphics design by Amanda Lee

Community