The New York Times took over 1,000 words to tell us that Big Data won't change the economy (or is it the economists' profession?) ("Is Big Data an economic Big Dud?") I'm less pessimistic; I think the collection of vast troves of observational data is ultimately beneficial but only if (a) we set a high bar for analytics, such as requiring multiple corroborating data sources pointing to the same conclusion; (b) we initiate efforts to collect specific data through well-designed surveys or experiments with the goal of providing a degree of quality control; and (c) we advance the methodologies of analyzing observational data.
The bar is not high, given the woeful state of research on the economy, exposed by the Great Recession. Even worse are the "consensus forecasts" issued about business performance. However, don't for a second fall into the trap of thinking that more data automatically leads to better research. More data does not automatically lift that low bar; quite the contrary, without meeting the preconditions listed above, more data could push that bar even lower than it is now. That is one of the central concerns of Numbersense.
Now, reporters take note. Please don't repeat the following fallacies that pollute most articles written about Big Data.
The NYT starts by telling us how fast the volume of data is growing: "The astounding rate of growth would make any parent proud." Give me a break. The simplest way to generate tons of data is to introduce bugs and mistakes into code. I will give you two examples which count as "data" generation.
First, when you import data to a data base, a log file is produced containing entries like "Import started.", "Read 50,000 entries." "Warning: Found date in non-standard format", etc. plus overhead entries on the time of execution, the name of the software, etc. Such log files are regarded as data, specifically "unstructured" data.
You can use say an ODBC connection to import an external file but in many such arrangements, the database only processes a small number of rows per instruction, say 500 rows. So, if your data comprises 20 million rows, the import requires 40,000 instructions for one upload. Even without errors or warnings, you've generated 1.2 to 2 million rows in the log file. If you have a column of "non-standard dates", you've generated at least 20 million rows of warnings. Your log file now has more rows than your data. You've truly contributed to the growth of data.
The second example is really an analogy. i am reminded of college days when I accidentally printed machine code in computer lab. The human-readable code might be 50 pages long but one typo and the printer started churning out thousands of pages of gibberish. We did the sensible thing in those days--we threw out the gibberish.
The second fallacy is encapsulated in this statement: "the economy is, at best, in the doldrums and has stayed there during the latest surge in Web traffic." Behind this kind of blather is the presumption of a simple, direct link between data and "the economy".
Maturity in data-analytic thinking starts with complex, multivariate, stochastic and dynamic systems. There will never be F=ma or E = mc^2 for the "economy". What does the "economy" mean anyway? Is it GDP growth? Is it employment level? Is it inflation? Is it happiness? Is it social wellbeing?
Such statements show a shallow appreciation for the factors leading to the economic collapse. Chief among these is the housing bubble, which was promoted by bad public policy, and in many cases, blatant fraud. Data cannot solve such problems; we need political will, ethics, and enforcement of the law. In fact, as I argue in Numbersense, data worsens these problems as more data allows more theorists to cite "evidence".
The third fallacy is causation creep, something I discuss here frequently. This involves first acknowledging correlation is not causation, and then slipping in the causal interpretation anyway. For example, this sentence contradicts itself: "The overall economic trends are complex, but an argument could be made that the slowdown began around 2005 — just when Big Data began to make its appearance." If you believe the antecedent, then you have to conclude that the second clause is too simplistic; conversely, if you believe the second clause, then the antecedent is falsified.
The point of Chapters 6 and 7 (on economic data) in Numbersense is to raise awareness of the complexities of developing metrics for something as amorphous as the "economy". One then realizes the challenges of interpreting such metrics. To get the full argument, get a copy of the book today.