« New blog worth reading | Main | Revisiting Grouponomics »


Feed You can follow this conversation by subscribing to the comment feed for this post.


The problem with economics is that the models used in econometrics are all wrong. So it doesn't matter how much data that you have it is not going to help in predicting. There is a basic assumption that people are rational and have full information, so if house prices rise by 15% in a year there must be some rational reason. Those who bought an American house in 2005 are probably still looking for it.


In my experience (n=1) they type of people who set up the man vs machine dichotomy do so because they don't understand what the machine is doing and therefore don't trust it. It seems like magic to them, and therefore is scary.


It's fallacious to say that since "there are value choices all the way through, from construction to interpretation" that therefore all data are "cooked"--except in the most trivial sense that makes the question of 'cooked data' utterly pointless. Worse, it does damage to the possibility of distinguishing radically value-dependent interpretations from others. We can, in non-problematic cases, ascertain how the discretionary choices influence results so as to distinguish aspects of the source from the value choices. The choice of pounds in the U.S. is a "value choice" but does not prevent me from using a scale reading in pounds to ascertain (approximately) how much I weigh. This "subtracting out" need not be so for problematically biased tools. I discuss this on my blog. http://errorstatistics.com/2012/03/14/2752/


Mayo - Your blog post is provocative though I have a different take on the matter. In my post, I used "cooked" in order to speak to the people who prefer "raw" data. What I mean by "cooked" is "statistically adjusted" data, whether this is seasonal adjustments, smoothing, imputations, etc. I also include any data processing that changes "raw" data, even uncontroversial things like throwing out the occasional invalid values (such as an SAT score that shows up as 20,000). In my view, every such step reflects a subjective decision in the sense that different people can legitimately choose different ways to adjust the data, and these adjustments become part of the "model", something a lot of analysts don't seem to recognize. You're right that this formulation becomes trivial since this covers almost every statistical analysis. That's the point, and it's a reaction against those who regard any "adjustment" as a form of cheating.

I work primarily with business and social science data which means they are mostly observational data conveniently sampled. Most of this data is dirty. It is well known in Internet circles for example that if you set up multiple vendors to measure something as basic as the number of unique visitors to the website, that the systems will produce widely varying statistics, perhaps 20 or 30 percent apart. Before I do anything, I already have to pick one of the data sources, which is a subjective value judgment.

I do not agree that we can measure "how much noise the observational scheme is likely to produce". This certainly isn't the case in my world. It is more likely that I can't even articulate the observational scheme because someone else not even in my organization collected the data, and no one is available to explain anything. (Try to find someone to explain the details of how the statistics on Google Analytics are compiled, and you'll understand what I mean.)

If your argument is that conditional on all the subjective elements, one can generate an objective measurement, then I have no disagreement except that it is also a trivial statement.

If your point is that we shouldn't lump all analyses into one bucket called "subjective", then we are on the same page as I agree that there are good and bad assumptions. I'd agree that analyses of designed experiments are more "objective" than analyses of observational data. The challenge is how to distinguish between good and bad assumptions, and I suppose your blog post represents your thoughts around this question.

Michael L.

This is a response to your use of the phrase, "but experiments prove the effects exist." I've read a bit about the studies you're referring to and, as I recall, the findings were obtained using the analytical methods of "classical statistics." You of course know this but these methods don't prove conclusions in the sense that the term "prove" is used in mathematics. I know this is tangential to the main thrust of your post. I only make this point because I think it's important for statisticians/data scientists and quantitative physical or social scientists (like me) not to be misleading about what statistical methods can and can't accomplish.


Michael: I accept your point about the word choice. My feeling is that "prove" should have a statistical meaning in addition to the theorem-proof meaning. Otherwise, it has no meaning in statistics since even with randomized experiments, we have not "proven" anything... there is always an error bar whenever we generalize. This is as true in classical as it is in modern statistics. From a common usage perspective, I'd say almost everyone accepts that "smoking causes cancer", I'd say most people accept that science has "proven" this causal link, and yet most statisticians including me also accept that such "proof" is not in the theorem-proof sense. (And smoking-cancer is already a clearcut case.) On the other hand, I also agree that there are scores of published research out there that do not "prove" anything so we need language to distinguish between the good evidence from the merely publishable.

In this particular case, I learned about the priming studies from Kahneman's recent book, the whole concept sounded preposterous to me at first, but all the studies together have convinced me that this effect exists. So hopefully that clears the air for readers. (The implications of priming for all kinds of social science modeling should be talked about more!)

Jordan Goldmeier

I believe the problem we have with data (and information in general) is that we often treat it as first-principles. We give it the privileged status of a priori validity with little to no further scrutiny. Perhaps only a few years ago we were giving the same privilege to pundits, like Brooks. Following Kaiser's point, the backlash against big data is encouraging, but one wonders if Brook's argument is simply an appeal from the old guard punditry (people like him) to go back to how things used to be. In the last few years, punditry has tended away from experience-based opinion (often from former news reporters) and moved toward (ostensibly) data-driven opinion (think Karl Rove, Frank Luntz, Ezra Brooks).


Kaider: I'm guessing that getting varying statistics on "something as basic as the number of unique visitors to the website", indicates a systematic difference, and thus one could adjust accordingly. However, I agree that if you cannot say what it is you're intending to measure, even approximately, then there may be no hope for "subtracting out" the influence of discretionary choices. But in that case, it is not scientific measurement, at least not yet.

The comments to this entry are closed.

Get new posts by email:
Kaiser Fung. Business analytics and data visualization expert. Author and Speaker.
Visit my website. Follow my Twitter. See my articles at Daily Beast, 538, HBR, Wired.

See my Youtube and Flickr.


  • only in Big Data
Numbers Rule Your World:
Amazon - Barnes&Noble

Amazon - Barnes&Noble

Junk Charts Blog

Link to junkcharts

Graphics design by Amanda Lee

Next Events

Jan: 10 NYPL Data Science Careers Talk, New York, NY

Past Events

Aug: 15 NYPL Analytics Resume Review Workshop, New York, NY

Apr: 2 Data Visualization Seminar, Pasadena, CA

Mar: 30 ASA DataFest, New York, NY

See more here

Principal Analytics Prep

Link to Principal Analytics Prep