This is the promised second post in reaction to Phil's piece on Andrew's blog about dealing with dirty, complex climate data. In a prior post, I considered the issue of a perverse incentive in data processing, and showed how it also affects credit reporting and scoring.
***
At the end of his post, Phil surfaces a topic that will clearly irk some -- when there is a gap between the data and the model, should one fix the data or fix the model?
Since I wrote about this topic here as it relates to predicting Olympic medals, in a post called "False belief in true models", you might understand that my first reaction was: fix the model, the data is reality! By contrast, Phil indicated that it is often prudent for climate scientists to fix the data to bring it closer to the model. How might one reconcile the two points of views?
The reason for my post on "false belief in true models" was my displeasure with many business and economics folks who talk incessantly about over or under performance relative to a "model". For example, the employment statistic did better than "expected" even though the growth in jobs did not keep up with population growth, meaning the nation was worse off. This type of statement is tantamount to saying the model is always true, but a statistical model can never be true.
Then, Phil brought up this scary prospect:
The models are close to being correct. In this case, gross discrepancies between data and models will indicate problems with the data. Fixing those problems will lead to data that are in better agreement with the models.
In effect, he is saying Data is Not Reality. Uh oh.
He explained further:
When the data are complicated ... then it's not necessarily a surprise to find problems with the data, and to find that when those problems are fixed, the result is better agreement with a model.
I fully understand what he means. The data environment faced by climate scientists is many orders of magnitude more complex than for say businesses. If I need to count the number of gadgets sold through a website, these transactions are recorded, and the data is relatively clean. Climate data is extremely hard to collect... Phil talked about thermometers installed on 3000 undersea robots, for example. The errors, such as forgetting to notice that satellites were moved closer to earth, are not easy to catch since presumably the data analysts were not the ones ordering the satellite locations to change.
***
In other words, adjusted data is reality. Unadjusted data is not. I should have made this clear. In statistical analysis, the first step is to inspect the data, and correct any errors. It is best if such data cleansing is completed prior to the analysis; what Phil appears to be saying is that some of these errors are so subtle that it is only when compared to a reasonable model that they would come to light.
At the end of his thoughtful piece, I again feel that climate scientists are not giving themselves enough credit. I don't think it is correct to describe the data cleansing activity as "bringing the data closer to the model". Instead, he should describe it as correcting obvious errors in the data, or reducing measurement error.
But are the climate scientists always correcting obvious errors in the data and reducing measurement error when they "clean" the data? In many--probably most--cases, they are, but the mismanagement of some of the data and the politicking in the climate research community does raise doubts--in my mind, at least. When you combine opacity of methods and lack of reproducibility in climate data management with sensitivity of climate models to inputs--to say nothing of the incentives resulting from politicization--you're leaning pretty hard on the integrity and the infallibility of the climate scientists.
Posted by: Mmanti | 04/07/2010 at 07:52 AM
Don Wheeler has made similar points many times in his books and articles. His advice, like yours above, is to first check the data for homogeneity (using process behavior charts). Coupled with this is his admonition that all outliers are evidence...though the evidence may be of problems in data collection rather than the process being studied.
I think that this first step in data analysis is under-appreciated, even among the scientific community. Perhaps the techniques for checking data are not taught in a rigorous way, and as a result, everyone cleans their data out of necessity, but no one is comfortable defending the process.
Posted by: Tom | 04/10/2010 at 02:31 AM