« Jared Diamond's fear of showering | Main | New blog worth reading »


Feed You can follow this conversation by subscribing to the comment feed for this post.

Rick Wicklin

Here's a humorous example of junk data. Some people attempt to estimate popularity based on the number of hits in a Google search. For example, some people record the number of hits for "R programming" and "SAS programming." Interestingly, documents such as "SAS(R) Programming Course Notes" and "Step-by-Step Programming with Base SAS(R) Software" appear in the results for "R programming"! The registered trademark symbol for a SAS publication makes the search engine think the document is about R.

Andy Palmer

Isn't this just another case of separating the signal from the noise? (in this case, the noise has been deliberately added)

With a large enough dataset, would fake entries have a way of identifying themselves (such as Benford's law?)


I don't think "junk data" is limited to "big data". I recently performed a data quality analysis of something you think you would be "good data" at this point, computer/data breaches: http://l.rud.is/W9wCqi

Generating any type of real analysis from it requires more caveats than there would be text in the report. There are components of the data set that are good enough for use in real analytics. Like Andy said, just need to figure out the best method for separating signal from noise.

Phil H

When you ask people a survey question, you know there is a gap between the answers you get and objective reality. Since you are again dealing with information people volunteer rather than information gathered on them objectively, you run into exactly the same sampling problem.

This is not a new thing, this is an old thing in a new dataset.

So you do what you always did; try to find a way to handle that gap, and apply it as best you can or just offer qualifiers on the results.


Rick: that's a good example. Related to that, search engines look for relevant matches, not exact matches so even if R doesn't appear in "SAS(R)", I'd think some SAS results may be considered somewhat relevant.

Andy: It is certainly a kind of noise but it is one that would hard to put a distribution on, unless you know how it is generated. It's easier if I know who CouchCachet is but the challenge is there may be a hundred other such outfits which the analyst may not know about. The "deliberate" part is also important.

Hrbrmstr: Good point, and great post. The kind of exploratory analysis you did there is so important to every analysis. The problem with Big Data is that if there are 1,000 variables, it's pretty difficult to do the type of exploration you demonstrated. But not doing it subjects the analysis to all kinds of potential errors.

Phil, Hrbrmstr & others: I think most data analysts realize that "Big Data" has not changed anything. All of the best practices of the old days still apply. I agree but would also add that Big Data make life much more difficult. When there are more dimensions and more cases, there is much more room for bad analyses, e.g. each one of those could be "faked".

Rod Lucas

I think that these data analysis software teams do so much good! They really can accomplish so much.

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Your comment could not be posted. Error type:
Your comment has been posted. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.


Post a comment

Your Information

(Name is required. Email address will not be displayed with the comment.)

Business analytics and data visualization expert. Author and Speaker. Currently at Columbia. See my full bio.

Next Events

May: 2 RawHaus on Data Viz, New York City

May: 18 JMP Explorers on Data Viz , Vancouver, Canada

May: 24 JMP Explorers on Data Viz , Cary, NC

Past Events

See here

Future Courses (New York)

Summer: Statistical Reasoning & Numbersense, Principal Analytics Prep (4 weeks)

Summer: Applied Analytics Frameworks & Methods, Columbia (6 weeks)

Junk Charts Blog

Link to junkcharts

Graphics design by Amanda Lee


  • only in Big Data