How to read this chart about coronavirus risk
Feb 03, 2020
In my just-published Long Read article at DataJournalism.com, I touched upon the subject of "How to Read this Chart".
Most data graphics do not come with directions of use because dataviz designers follow certain conventions. We do not need to tell you, for example, that time runs left to right on the horizontal axis (substitute right to left for those living in right-to-left countries). It's when we deviate from the norms that calls for a "How to Read this Chart" box.
***
A discussion over Twitter during the weekend on the following New York Times chart perfectly illustrates this issue. (The article is well worth reading to educate oneself on this red-hot public-health issue. I made some comments on the sister blog about the data a few days ago.)
Reading this chart, I quickly grasp that the horizontal axis is the speed of infection and the vertical axis represents the deadliness. Without being told, I used the axis labels (and some of you might notice the annotations with the arrows on the top right.) But most people will likely miss - at a glance - that the vertical axis utilizes a log scale while the horizontal axis is linear (regular).
The effect of a log scale is to pull the large numbers toward the average while spreading the smaller numbers apart - when compared to a linear scale. So when we look at the top of the coronavirus box, it appears that this virus could be as deadly as SARS.
The height of the pink box is 3.9, while the gap between the top edge of the box and the SARS dot is 6. Yet our eyes tell us the top edge is closer to the SARS dot than it is to the bottom edge!
There is nothing inaccurate about this chart - the log scale introduces such distortion. The designer has to make a choice.
Indeed, there were two camps on Twitter, arguing for and against the log scale.
***
I use log scales a lot in analyzing data, but tend not to use log scales in a graph. It's almost a given that using the log scale requires a "How to Read this Chart" message. And the NY Times crew delivers!
Right below the chart is a paragraph:
To make this even more interesting, the horizontal axis is a hidden "log" scale. That's because infections spread exponentially. Even though the scale is not labeled "log", think as if the large values have been pulled toward the middle.
Here is an over-simplified way to see this. A disease that spreads at a rate of fifteen people at a time is not 3 times worse than one that spreads five at a time. In the former case, the first sick person transmits it to 15, and then each of the 15 transmits the flu to 15 others, thus after two steps, 241 people have been infected (225 + 15 + 1). In latter case, it's 5x5 + 5 + 1 = 31 infections after two steps. So at this point, the number of infected is already 8 times worse, not 3 times. And the gap keeps widening with each step.
P.S. See also my post on the sister blog that digs deeper into the metrics.
"There is nothing inaccurate about this chart" -- well having a zero on a log scale can't be right. They've since removed it, but now the dots on the bottom of the chart seem suspect.
Posted by: Xan Gregg | Feb 03, 2020 at 02:18 PM
The graph is pretty and easily understandable but the numbers it is based on are currently highly suspect as they are only from China, which has a vested interest in undercounting cases.
We should know more in a few weeks, as it hits (or doesn't hit) the western world.
Posted by: Liz | Feb 03, 2020 at 05:31 PM
Liz: I added the link to my other blog post which addresses how to think about these numbers. And yes, I agree that given infections are happening outside China, what's happening to cases outside should be followed closely - that's the only way to resolve whether or not the Chinese data are underreported. We don't seem to have much information either on how severe the cases outside China are, how long have those people been sick, etc. but so far, it seems like almost all fatalities are in China.
Posted by: Kaiser | Feb 04, 2020 at 12:13 AM