« Jared Diamond's fear of showering | Main | New blog worth reading »


Feed You can follow this conversation by subscribing to the comment feed for this post.

Rick Wicklin

Here's a humorous example of junk data. Some people attempt to estimate popularity based on the number of hits in a Google search. For example, some people record the number of hits for "R programming" and "SAS programming." Interestingly, documents such as "SAS(R) Programming Course Notes" and "Step-by-Step Programming with Base SAS(R) Software" appear in the results for "R programming"! The registered trademark symbol for a SAS publication makes the search engine think the document is about R.

Andy Palmer

Isn't this just another case of separating the signal from the noise? (in this case, the noise has been deliberately added)

With a large enough dataset, would fake entries have a way of identifying themselves (such as Benford's law?)


I don't think "junk data" is limited to "big data". I recently performed a data quality analysis of something you think you would be "good data" at this point, computer/data breaches: http://l.rud.is/W9wCqi

Generating any type of real analysis from it requires more caveats than there would be text in the report. There are components of the data set that are good enough for use in real analytics. Like Andy said, just need to figure out the best method for separating signal from noise.

Phil H

When you ask people a survey question, you know there is a gap between the answers you get and objective reality. Since you are again dealing with information people volunteer rather than information gathered on them objectively, you run into exactly the same sampling problem.

This is not a new thing, this is an old thing in a new dataset.

So you do what you always did; try to find a way to handle that gap, and apply it as best you can or just offer qualifiers on the results.


Rick: that's a good example. Related to that, search engines look for relevant matches, not exact matches so even if R doesn't appear in "SAS(R)", I'd think some SAS results may be considered somewhat relevant.

Andy: It is certainly a kind of noise but it is one that would hard to put a distribution on, unless you know how it is generated. It's easier if I know who CouchCachet is but the challenge is there may be a hundred other such outfits which the analyst may not know about. The "deliberate" part is also important.

Hrbrmstr: Good point, and great post. The kind of exploratory analysis you did there is so important to every analysis. The problem with Big Data is that if there are 1,000 variables, it's pretty difficult to do the type of exploration you demonstrated. But not doing it subjects the analysis to all kinds of potential errors.

Phil, Hrbrmstr & others: I think most data analysts realize that "Big Data" has not changed anything. All of the best practices of the old days still apply. I agree but would also add that Big Data make life much more difficult. When there are more dimensions and more cases, there is much more room for bad analyses, e.g. each one of those could be "faked".

Rod Lucas

I think that these data analysis software teams do so much good! They really can accomplish so much.

Android Database Download

Great article.
Thanks for sharing.. by the way this blog is very nice.
What template you are use?

The comments to this entry are closed.

Get new posts by email:
Kaiser Fung. Business analytics and data visualization expert. Author and Speaker.
Visit my website. Follow my Twitter. See my articles at Daily Beast, 538, HBR, Wired.

See my Youtube and Flickr.


  • only in Big Data
Numbers Rule Your World:
Amazon - Barnes&Noble

Amazon - Barnes&Noble

Junk Charts Blog

Link to junkcharts

Graphics design by Amanda Lee

Next Events

Jan: 10 NYPL Data Science Careers Talk, New York, NY

Past Events

Aug: 15 NYPL Analytics Resume Review Workshop, New York, NY

Apr: 2 Data Visualization Seminar, Pasadena, CA

Mar: 30 ASA DataFest, New York, NY

See more here

Principal Analytics Prep

Link to Principal Analytics Prep