« What is Numbersense about? | Main | Timid testers: evidence from a suppressed study »


Feed You can follow this conversation by subscribing to the comment feed for this post.


I've only skimmed the two articles so far, but I find it amazing that this is posed as some sort of debate. I suppose it is, but if statisticians abandon big data to the computer scientists the field is headed toward academic irrelevance.

Statistics is an applied field after all. An applied field that shrinks from the difficult questions involved in being applied stops being relevant to that area of application.

The analogy I'm using lately is that a statistic (like Stan Musial's lifetime batting average) is an object, like a bridge. A user of that statistic needs to know a few things in order to use it (like, don't drive off the side of the bridge). To design a bridge, you need a civil engineer (another type of applied mathematician) who can make the decisions about proper construction.

Even in the case of even a simple statistic like a batting average, there are questions of proper construction: do we include walks? what if you reach on an error? should doubles count more than singles? Proper construction of measurement is what applied statistics is all about -- not surprisingly, because a statistic is a measure.

Bridge technology has changed substantially over the past 300 years: there are new materials to make the bridges (no more iron trusses). There are new types of loads to be carried (the introduction of trains, the demands to carry heavy trucks, etc.).

While any individual engineer can (and should) decide that they are only going to deal with certain types of bridges, the field as a whole has had to broaden to deal with these challenges.

Similarly, individual statisticians can deal with big data or not, but the field as a whole shrinks its domain of relevance if the challenge of dealing with the new materials of big data are ignored. We will leave it to the computer scientists, the operations research practitioners, and the engineers.


I've now read David Walker's piece (the other side of the argument) twice. His objection to statisticians becoming involved in big data is because they will be used to help companies: "Big data is principally about taking more money off customers by (let us put it perjoratively) more effective snooping on their habits."

That's a narrow definition of what marketing is about, but certainly marketing is at least partly about that.

Having spent my entire professional career in statistical applications in marketing research, I can tell Walker that this ship sailed long ago. Marketing research is by no means the valedictorian of the statistical school, but I'm not quite sure why Walker seems to be contending that we should be expelled, or at least wants to pretend we don't exist.

Walker does have some good points. I'm fully on board with his McKinsey bashing, and he does point out the vital point that "big data" and "open data" aren't the same thing at all.


I have not read the original article, so my comment is based on the excerpt.

Re automation: Reacting to every "significant" result as if it were a valid signal will lead to "tampering", as Deming would have put it. Given the number of tests which are or will be run, there will be many many instances when a low probability event will be judged "significant" and action will be taken. Often, the proper action to take is to take no action.


Can you please explain the calculation that produced "10% progress that won the million-dollar prize is roughly worth one-tenth of one star on the five-star rating scale"?


Say some more about why the marriage is pending.
Is that change of definitions, as some of this is pretty old.

I was at Bell Labs in the 1970s/early 1980s, and Murray Hill Bldg 5 had statisticians who spent their time analyzing telephone records whose volumes certainly fit Big Data for the time, even if it wasn't called that. The Bell System tracked every trouble report, down to thing like squirrel bites and gunshots, and later we did expert systems for rummaging the data and looking for patterns.

In the 1990s, Silicon Graphics was selling supercomputers to telcos and others, both for marketing analytics and fraud detection, both of which needed much statistical analysis.

For some history, see:

See especially slides 22-24.

The comments to this entry are closed.

Get new posts by email:
Kaiser Fung. Business analytics and data visualization expert. Author and Speaker.
Visit my website. Follow my Twitter. See my articles at Daily Beast, 538, HBR, Wired.

See my Youtube and Flickr.


  • only in Big Data
Numbers Rule Your World:
Amazon - Barnes&Noble

Amazon - Barnes&Noble

Junk Charts Blog

Link to junkcharts

Graphics design by Amanda Lee

Next Events

Jan: 10 NYPL Data Science Careers Talk, New York, NY

Past Events

Aug: 15 NYPL Analytics Resume Review Workshop, New York, NY

Apr: 2 Data Visualization Seminar, Pasadena, CA

Mar: 30 ASA DataFest, New York, NY

See more here

Principal Analytics Prep

Link to Principal Analytics Prep