One fundamental tenet in statistical thinking concerns the signal-to-noise ratio.
Given a mass of data, some portion of it (perhaps most of it) are "noise". Noise is nuisance. Imagine an ear-splitting bar, and trying to hear what your friend is saying standing only a feet away. Noise covers up the signal; noise is the greatest enemy of the statistician.
Noise is everywhere. I didn't explicitly mention this in the book but noise is also everywhere in its pages. When I spoke of "sporadic" E.coli cases making it difficult to know if an outbreak is occurring in Chapter 2, these sporadic cases are noise. When I explained why we ought not fear plane crashes happening in bunches in Chapter 5, it's because statisticians showed the coincidence is noise, nothing to be alarmed about. (One way to think about statistical tests is that they evaluate the signal-to-noise ratio.)
***
What motivated this post is Twitter. The Library of Congress plans to archive "every tweet". This generated a lot of buzz; the tech community saw this as a legitimization of Twitter.
I say blah. Because there is really very little new knowledge to be found in Twitter. Almost everything is regurgitation, much of it would not survive scrutiny, and a lot of the tweets are "re-tweets". Here's Jeff Miller describing this issue, what he calls "Twitter's Garbage Problem".
In particular, because you either subscribe to someone's Twitter stream or not, you have to take in both the nonsense and the brilliant stuff. Jeff considers how one might implement noise-filtering schemes.
But he pretty much nailed the impossibility of noise filtering earlier in the post when he stated: "One person's garbage is another person's gold."
Think about "spam" for a second. A marketing email from Macy's is not always "spam" even if unsolicited - if you've been shopping for a sofa, and Macy's sends you an email offering you 50% off on their furniture collection, you will most likely not regard the email as spam. This is why spam filters are far from perfect. And Twitter filters have a similar challenge.
***
For Junk Charts readers, "signal-to-noise" ratio is manifested in Tufte's data-ink ratio. The data is the signal; the ink is the noise, roughly speaking.
Comments
You can follow this conversation by subscribing to the comment feed for this post.