It is inevitable that all the hype around "Big Data" leads to a backlash. As someone who's been working in "data science" before the term existed, I am happy to see widespread validation of the field but also concerned about over-promise and under-deliver. Several recent articles went overboard in criticizing data science -- while their points are sometimes valid, the tone of these pieces misses the mark. I'll discuss one of these articles in this post, and some others in the next few days.
***
Andrew Gelman has a beef with David Brooks over his New York Times column called "What Data Can't Do". (link) I will get to Brooks's critique soon--my overall feeling is, he created a bunch of sound bites, and could have benefited from interviewing people like Andrew and myself, who are skeptical of Big Data claims but not maniacally dismissive.
The biggest issue with Brooks's column is the incessant use of the flawed man versus machine dichotomy. He warns: "It's foolish to swap the amazing machine in your skull for the crude machine on your desk." The machine he has in his mind is the science-fictional, self-sufficient, intelligent computer, as opposed to the algorithmic, dumb-and-dumber computer as it exists today and for the last many decades. A more appropriate analogy of today's computer (and of the foreseeable future) is a machine that the human brain creates to automate mechanical, repetitious tasks at scale. This machine cannot function without human piloting so it's man versus man-plus-machine, not man versus machine.
I use such an analogy in Chapter 2 of Numbers Rule Your World, to compare and contrast the credit-scoring algorithmic paradigm with the manual underwriting paradigm of the past. The point is that there is more similarity than difference between the automated and the manual methods; the automated methods are faster, better able to handle multiple threads, and unfazed by individual bias.
***
A major blind spot is ignoring the work of Kahneman and Tversky, and other behavioral psychologists, who have shown convincingly that the human brain is subject to all kinds of biases, and uses heuristics that lead to incorrect judgements.
A large body of work, for instance, points to the "priming" effect. Someone may walk into the supermarket and buy detergents just because he or she heard a story about cheating on the radio. Of course, people would deny such influences but experiments prove the effects exist. There's also the experiment that shows that subjects who are asked to hold a pencil in their mouth to activate "grin" muscles feel happier than those made to activate "growl" muscles.
It is comedic when Brooks tells us that "people are really good at telling stoires that weave together multiple causes and multiple contexts... data... cannot match the explanatory suppleness of even a mediocre novel". I mean, does he care if the "stories" and "novels" lead to correct decisions? Or is he just in it for entertainment?
***
While I agree with some of Brooks's diagnosis of the problems with data-driven analyses, it is often the case that the alternative of not using data or using the brain as he calls it does not create a demonstrably better outcome.
Under "Big Data has trouble with big problems," he complained: "We've had huge debates over the best economic stimulus, with mountains of data, and as far as I know not a single major player in this debate has been persuaded by data to switch sides."
Macroeconomics has always been a field that suffers from lack of data (in fact, a sample path of one). I don't know what "mountains of data" he's talking about, especially since the things we're doing like quantitative easing has not been done ever before. Nor do I understand why the proof of utility is side-switching. He may be right that the economists have not switched sides but this says more about the people--mind you, who have embarrassing records when it comes to managing the economy--than about the data. Given that Brooks doesn't think the economic decisions were based on data, then who should we blame the many economic failures on? The human brain?
"Data creates bigger haystack... and the needle we are looking for is still buried deep inside." This is definitely true. But why is data analysis the problem here? What's the alternative he has in mind? Most data analysts sooner or later realize that "data science" is as much art as science. We don't have to pick one or the other; we can have the best of both worlds.
***
Brooks made a really great point at the end of the piece, which I will paraphrase: any useful data is cooked. "The end result looks disinterested, but, in reality, there are value choices all the way through, from construction to interpretation." Instead of thinking about this as cause for concern, we should celebrate these "value choices" because they make the data more useful.
This brings me back to Gelman's reaction in which he differentiates between good analysis and bad analysis. Except for the simplest problems, any good analysis uses cooked data but an analysis using cooked data could be good or bad.
The problem with economics is that the models used in econometrics are all wrong. So it doesn't matter how much data that you have it is not going to help in predicting. There is a basic assumption that people are rational and have full information, so if house prices rise by 15% in a year there must be some rational reason. Those who bought an American house in 2005 are probably still looking for it.
Posted by: Ken | 02/21/2013 at 04:05 PM
In my experience (n=1) they type of people who set up the man vs machine dichotomy do so because they don't understand what the machine is doing and therefore don't trust it. It seems like magic to them, and therefore is scary.
Posted by: Rebecca | 02/21/2013 at 07:27 PM
It's fallacious to say that since "there are value choices all the way through, from construction to interpretation" that therefore all data are "cooked"--except in the most trivial sense that makes the question of 'cooked data' utterly pointless. Worse, it does damage to the possibility of distinguishing radically value-dependent interpretations from others. We can, in non-problematic cases, ascertain how the discretionary choices influence results so as to distinguish aspects of the source from the value choices. The choice of pounds in the U.S. is a "value choice" but does not prevent me from using a scale reading in pounds to ascertain (approximately) how much I weigh. This "subtracting out" need not be so for problematically biased tools. I discuss this on my blog. http://errorstatistics.com/2012/03/14/2752/
Posted by: Mayo | 02/21/2013 at 07:31 PM
Mayo - Your blog post is provocative though I have a different take on the matter. In my post, I used "cooked" in order to speak to the people who prefer "raw" data. What I mean by "cooked" is "statistically adjusted" data, whether this is seasonal adjustments, smoothing, imputations, etc. I also include any data processing that changes "raw" data, even uncontroversial things like throwing out the occasional invalid values (such as an SAT score that shows up as 20,000). In my view, every such step reflects a subjective decision in the sense that different people can legitimately choose different ways to adjust the data, and these adjustments become part of the "model", something a lot of analysts don't seem to recognize. You're right that this formulation becomes trivial since this covers almost every statistical analysis. That's the point, and it's a reaction against those who regard any "adjustment" as a form of cheating.
I work primarily with business and social science data which means they are mostly observational data conveniently sampled. Most of this data is dirty. It is well known in Internet circles for example that if you set up multiple vendors to measure something as basic as the number of unique visitors to the website, that the systems will produce widely varying statistics, perhaps 20 or 30 percent apart. Before I do anything, I already have to pick one of the data sources, which is a subjective value judgment.
I do not agree that we can measure "how much noise the observational scheme is likely to produce". This certainly isn't the case in my world. It is more likely that I can't even articulate the observational scheme because someone else not even in my organization collected the data, and no one is available to explain anything. (Try to find someone to explain the details of how the statistics on Google Analytics are compiled, and you'll understand what I mean.)
If your argument is that conditional on all the subjective elements, one can generate an objective measurement, then I have no disagreement except that it is also a trivial statement.
If your point is that we shouldn't lump all analyses into one bucket called "subjective", then we are on the same page as I agree that there are good and bad assumptions. I'd agree that analyses of designed experiments are more "objective" than analyses of observational data. The challenge is how to distinguish between good and bad assumptions, and I suppose your blog post represents your thoughts around this question.
Posted by: Kaiser | 02/22/2013 at 12:10 AM
This is a response to your use of the phrase, "but experiments prove the effects exist." I've read a bit about the studies you're referring to and, as I recall, the findings were obtained using the analytical methods of "classical statistics." You of course know this but these methods don't prove conclusions in the sense that the term "prove" is used in mathematics. I know this is tangential to the main thrust of your post. I only make this point because I think it's important for statisticians/data scientists and quantitative physical or social scientists (like me) not to be misleading about what statistical methods can and can't accomplish.
Posted by: Michael L. | 02/24/2013 at 01:33 PM
Michael: I accept your point about the word choice. My feeling is that "prove" should have a statistical meaning in addition to the theorem-proof meaning. Otherwise, it has no meaning in statistics since even with randomized experiments, we have not "proven" anything... there is always an error bar whenever we generalize. This is as true in classical as it is in modern statistics. From a common usage perspective, I'd say almost everyone accepts that "smoking causes cancer", I'd say most people accept that science has "proven" this causal link, and yet most statisticians including me also accept that such "proof" is not in the theorem-proof sense. (And smoking-cancer is already a clearcut case.) On the other hand, I also agree that there are scores of published research out there that do not "prove" anything so we need language to distinguish between the good evidence from the merely publishable.
In this particular case, I learned about the priming studies from Kahneman's recent book, the whole concept sounded preposterous to me at first, but all the studies together have convinced me that this effect exists. So hopefully that clears the air for readers. (The implications of priming for all kinds of social science modeling should be talked about more!)
Posted by: Kaiser | 02/25/2013 at 09:30 AM
I believe the problem we have with data (and information in general) is that we often treat it as first-principles. We give it the privileged status of a priori validity with little to no further scrutiny. Perhaps only a few years ago we were giving the same privilege to pundits, like Brooks. Following Kaiser's point, the backlash against big data is encouraging, but one wonders if Brook's argument is simply an appeal from the old guard punditry (people like him) to go back to how things used to be. In the last few years, punditry has tended away from experience-based opinion (often from former news reporters) and moved toward (ostensibly) data-driven opinion (think Karl Rove, Frank Luntz, Ezra Brooks).
Posted by: Jordan Goldmeier | 02/26/2013 at 01:36 PM
Kaider: I'm guessing that getting varying statistics on "something as basic as the number of unique visitors to the website", indicates a systematic difference, and thus one could adjust accordingly. However, I agree that if you cannot say what it is you're intending to measure, even approximately, then there may be no hope for "subtracting out" the influence of discretionary choices. But in that case, it is not scientific measurement, at least not yet.
Posted by: Mayo | 02/28/2013 at 02:47 PM