The epidemic of simple comparisons

Another day, another Twitter user sent a sloppy chart featured on TV news. This CNN graphic comes from Hugo K. by way of Kevin T.

And it's another opportunity to apply the self-sufficiency test.


Like before, I removed the data printed on the graphic. In reading this chart, we like to know the number of U.S. reported cases of coronavirus relative to China, and Italy relative to the U.S.

So, our eyes trace these invisible lines:


U.S. cases are roughly two-thirds of China while Italian cases are 90% of U.S.

That's what the visual elements, the columns, are telling us. But it's fake news. Here is the chart with the data:


The counts of reported cases in all three countries were neck and neck around this time.

What this quick exercise shows is that anyone who correctly reads this chart is reading the data off the chart, and ignoring the contradictionary message sent by the relative column heights. Thus, the visual elements are not self-sufficient in conveying the message.


In a Trifecta Checkup, I'd be most concerned about the D corner. The naive comparison of these case counts is an epidemic of its own. It sometimes leads to poor decisions that can exacerbate the public-health problems. See this post on my sister blog.

The difference in case counts between different countries (or regions or cities or locales) is not a direct measure of the difference in coronavirus spread in these places! This is because there are many often-unobserved factors that will explain most if not all of the differences.

After a lot of work by epidemiologists, medical researchers, statisticians and the likes, we now realize that different places conduct different numbers of tests. No test, no positive. The U.S. has been slow to get testing ramped up.

Less understood is the effect of testing selection. Consider the U.S. where it is still hard to get tested. Only those who meet a list of criteria are eligible. Imagine an alternative reality in which the U.S. conducted the same number of tests but instead of selecting most likely infected people to be tested, we test a random sample of people. The incidence of the virus in a random sample is much lower than in the severely infected, therefore, in this new reality, the number of positives would be lower despite equal numbers of tests.

That's for equal number of tests. If test kits are readily available, then a targeted (triage) testing strategy will under-count cases since mild cases or asymptomatic infections escape attention. (See my Wired column for problems with triage.)

To complicate things even more, in most countries, the number of tests and the testing selection have changed over time so a cumulative count statistic obscures those differences.

Beside testing, there are a host of other factors that affect reported case counts. These are less talked about now but eventually will be.

Different places have different population densities. A lot of cases in a big city and an equal number of cases in a small town do not signify equal severity.  Clearly, the situation in the latter is more serious.

Because the virus affects age groups differently, a direct comparison of the case counts without adjusting for age is also misleading. The number of deaths of 80-year-olds in a college town is low not because the chance of dying from COVID-19 is lower there than in a retirement community; it's low because 80-year-olds are a small proportion of the population.

Next, the cumulative counts ignore which stage of the "epi curve" these countries are at. The following chart can replace most of the charts you're inundated with by the media:


(I found the chart here.)

An epi curve traces the time line of a disease outbreak. Every location is expected to move through stages, with cases reaching a peak and eventually the number of newly recovered will exceed the number of newly infected.

Notice that China, Italy and the US occupy different stages of this curve.  It's proper to compare U.S. to China and Italy when they were at a similar early phase of their respective epi curve.

In addition, any cross-location comparison should account for how reliable the data sources are, and the different definitions of a "case" in different locations.


Finally, let's consider the Question posed by the graphic designer. It is the morbid question: which country is hit the worst by coronavirus?

This is a Type DV chart. It's got a reasonable question, but the data require a lot more work to adjust for the list of biases. The visual design is hampered by the common mistake of not starting columns at zero.


If we report it, it's a fact

David Leonhardt wrote in the NYT of a shocking incident of statistical abuse committed by Lou Dobbs and the CNN crew.

On several recent occasions, while commenting on the red-hot immigration issue, Lou and company remarked that "there had been 7,000 cases of leprosy in this country over the previous three years, far more than in the past".  (Leprosy is a flesh-eating disease prevalent among immigrants, particularly of Asian or Latin American origin.)

Nyt_leprosyWhen asked about fact-checking, Lou reportedly said: "If we reported it, it's a fact."  A quick visit to the government's leprosy program web-site immediately reveals the time-series chart, shown on the left.  With annual rates at about 150 in the last 5 years or so, one is hard impressed to find the 7,000 alleged cases!

Furthermore, because this chart lacks comparability, we fail to see that 150 cases out of a population of 300 million represent a minuscule risk.

A slight downward trend is evident in the last 20 years or so; this record is even more impressive when we realize the population grew during this period.  These points can be made clearer in multivariate plots.

Source: "Truth, Fiction and Lou Dobbs", New York Times, May 30, 2007; U.S. National Hansen's Disease web-site.


Finding dots

Erik W. alerted me to this CNN map that shows FBI statistics about safety of American cities.  As Eric pointed out, this is prototypical of chartjunk a la Tufte.  A lot of ink is used to depict 12 points of data (top 3 cities in safety, crime, improvement and decline).

Cnn_safest Imagine the reader trying to find the 3rd most improved city.  She either has to find all the blue dots and then figure out which is #3; or she needs to find all the #3 dots and figure out which is blue.  As they say, it's "hard work".  In fact, finding the dots among the forest of large text is hard work by itself!

How would I re-make this chart?

  • Highlight only the states containing data (California, Michigan, Missouri, Ohio, Georgia, New Jersey, New York); gray out all other states and their boundaries
  • Separate the states from the cities; only write the State name once for each State; reduce the font size
  • Instead of dots, use numbers.  So the most dangerous city (St Louis) gets a red "1", Oakland gets a purple "3", etc.
  • Remove Mexico, Canada and water from the map

The map gives the false impression that crime is relevant only along the coasts and the lakes, when in fact, the map is just saying that most cities in the U.S. are located along the coasts and the lakes.  Using such a map to depict city-level statistics creates distortion because cities are not evenly distributed across America.

Beyond that, what is the point of this map?  Is it merely a geography class telling us where each city is located?  How is it better than a simple table listing the cities in order?   

Reference: "U.S. City Safety Rankings", CNN, 2006.