The coronavirus crisis has laid bare something that was at the core of my second book, Numbersense (link): the inutility of raw data.
Statisticians have sometimes been derided - by outsiders - of "manipulating" data. We call it statistical adjustment. The adjustment is designed to correct for biases or imperfection of the raw data. The adjusted data are easier to understand. The raw data on the other hand cannot be interpreted without attaching a list of caveats and assumptions.
I have written a bunch of posts on this issue, starting with the "Eight Unanswered Questions" post written before the U.S. took the crisis seriously where I called attention to the issue of testing, calling it the elephant in the room. The mainstream media has finally caught on to this.
But that is not the only issue. There is insufficient attention to several of the other issues brought up in that post. For example, people like to say the death rate is over-estimated because we are under-counting mild cases (due to triage testing). But they forget that death rate and infection rate are tightly coupled. One cannot count mild cases in computing the death rate but exclude them when calculating the infection rate. When mild cases are finally added to the statistics, the infection rate will increase.
Also, I pointed out the need to look not just at counts of the dead but also intermediate metrics, such as hospitalizations, use of ICU/ventilators, and so on.
Further concerns are test accuracy, and noise coming from the backdrop of common cold/influenza.
If you're not following my dataviz blog, I also want to point you to yesterday's post, where I outlined all the hidden factors that should inform our interpretation of case counts. The amount of testing is only the start. There are many other things to worry about.
For example, I explain why it is ridiculous to compare the number of tests done in South Korea and in the United States today. I heard this point made by a politician last night on CNN; somehow this comparison proved that the U.S. has finally equalled or surpassed South Korea's testing program. You can read about this and other factors here.
***
These are a few other old posts about statistical adjustments:
This post explains the idea of seasonal adjustments. Seasonal adjustments are extremely popular in economics data.
This post about age adjustment is very relevant to the current situation. Andrew Gelman has a wonderful example of how to do it.
This recent post explains survey weighting. All results from polls, surveys, and market research are not raw data. Thirty percent of people saying this and that usually does not literally mean thirty percent of the respondents to the poll said so.
In all these cases, the adjusted data allow straightforward interpretations. A public health warning may be necessary:
Comments