The following articles discuss the behind-the-scenes process of preparing data for analysis. It points to the "garbage in garbage out" problem. One should always be aware of the potential hazards.
"The murky world of student-loan statistics", Felix Salmon (link)
At the end of this post, Felix found it remarkable that the government would not have better access to the data. The same sentiment was expressed at a recent presentation by the data team at Bundle.com, in which they described the extraneous strenuous process by which they matched the names of merchants on credit card statements to a database of known merchants. One would think the credit card companies would be able to pass along the merchant identifiers but they don't or can't.
"European debt: the big picture", Simon Johnson (link)
Simon points out that while the New York Times did a fantastic job with this visualization of the European debt linkages, one should notice what wasn't present on the chart, namely, the murky world of derivatives and not knowing it denies us knowledge about the exposure of U.S. banks to this potentially devastating problem.


I have some data, which has been previously analysed using a clustering algorithm. One of my suspicions is that one of their classes is subjects with data entry errors. Unfortunately the data is from 40 years ago, so it is impossible to check. A more robust clustering method removes the cluster, so I expect that I am right.
Posted by: Ken | 10/27/2011 at 05:09 PM