Why some dataviz fail
Jun 16, 2023
Maxim Lisnic's recent post should delight my readers (link). Thanks Alek for the tip. Maxim argues that charts "deceive" not merely by using visual tricks but by a variety of other non-visual means.
This is also the reasoning behind my Trifecta Checkup framework which looks at a data visualization project holistically. There are lots of charts that are well designed and constructed but fail for other reasons. So I am in agreement with Maxim.
He analyzed "10,000 Twitter posts with data visualizations about COVID-19", and found that 84% are "misleading" while only 11% of the 84% "violate common design guidelines". I presume he created some kind of computer program to evaluate these 10,000 charts, and he compiled some fixed set of guidelines that are regarded as "common" practice.
***
Let's review Maxim's examples in the context of the Trifecta Checkup.
The first chart shows Covid cases in the U.S. in July and August of 2021 (presumably the time when the chart was published) compared to a year ago (prior to the vaccination campaign).
Maxim calls this cherry-picking. He's right - and this is a pet peeve of mine, even with all the peer-reviewed scientific research. In my paper on problems with observational studies (link), my coauthors and I call for a new way forward: researchers should put their model calculations up on a website which is updated as new data arrive, so that we can be sure that the conclusions they published apply generally to all periods of time, not just the time window chosen for the publication.
Looking at the pair of line charts, readers can quickly discover its purpose, so it does well on the Q(uestion) corner of the Trifecta. The cherry-picking relates to the link between the Question and the Data, showing that this chart suffers from subpar analysis.
In addition, I find that the chart also misleads visually - the two vertical scales are completely different: the scale on the left chart spans about 60,000 cases while on the right, it's double the amount.
Thus, I'd call this a Type DV chart, offering opportunities to improve in two of the three corners.
***
The second chart cited by Maxim plots a time series of all-cause mortality rates (per 100,000 people) from 1999 to 2020 as columns.
The designer does a good job drawing our attention to one part of the data - that the average increase in all-cause mortality rate in 2020 over the previous five years was 15%. I also like the use of a different color for the pandemic year.
Then, the designer lost the plot. Instead of drawing a conclusion based on the highlighted part of the data, s/he pushed a story that the 2020 rate was about the same as the 2003 rate. If that was the main message, then instead of computing a 15% increase relative to the past five years, s/he should have shown how the 2003 and 2020 levels are the same!
On a closer look, there is a dashed teal line on the chart but the red line and text completely dominate our attention.
This chart is also Type DV. The intention of the designer is clear: the question is to put the jump in all-cause mortality rate in a historical context. The problem lies again with subpar analysis. In fact, if we take the two insights from the data, they both show how serious a problem Covid was at the time.
When the rate returned to the level of 2003, we have effectively gave up all the gains made over 17 years in a few months.
Besides, a jump in 15% from year to year is highly significant if we look at all other year-to-year changes shown on the chart.
***
The next section concerns a common misuse of charts to suggest causality when the data could only indicate correlation (and where the causal interpretation appears to be dubious). I may write a separate post about this vast topic in the future. Today, I just want to point out that this problem is acute with any Covid-19 research, including official ones.
***
I find the fourth section of Maxim's post to be less convincing. In the following example, the tweet includes two charts, one showing proportion of people vaccinated, and the other showing the case rate, in Iceland and Nigeria.
This data visualization is poor even on the V(isual) corner. The first chart includes lots of countries that are irrelevant to the comparison. It includes the unnecessary detail of fully versus partially vaccinated, unnecessary because the two countries selected are at two ends of the scale. The color coding is off sync between the two charts.
Maxim's critique is:
The user fails to account, however, for the fact that Iceland had a much higher testing rate—roughly 200 times as high at the time of posting—making it unreasonable to compare the two countries.
And the section is titled "Issues with Data Validity". It's really not that simple.
First, while the differential testing rate is one factor that should be considered, this factor alone does not account for the entire gap. Second, this issue alone does not disqualify the data. Third, if testing rate differences should be used to invalidate this set of data, then all of the analyses put out by official sources lauding the success of vaccination should also be thrown out since there are vast differences in testing rates across all countries (and also across different time periods for the same country).
One typical workaround for differential testing rate is to look at deaths rather than cases. For the period of time plotted on the case curve, Nigeria's cumulative death per million is about 1/8th that of Iceland. The real problem is again in the Data analysis, and it is about how to interpret this data casually.
This example is yet another Type DV chart. I'd classify it under problems with "Casual Inference". "Data Validity" is definitely a real concern; I just don't find this example convincing.
***
The next section, titled "Failure to account for statistical nuance," is a strange one. The example is a chart that the CDC puts out showing the emergence of cases in a specific county, with cases classified by vaccination status. The chart shows that the vast majority of cases were found in people who were fully vaccinated. The person who tweeted concluded that vaccinated people are the "superspreaders". Maxim's objection to this interpretation is that most cases are in the fully vaccinated because most people are fully vaccinated.
I don't think it's right to criticize the original tweeter in this case. If by superspreader, we mean people who are infected and out there spreading the virus to others through contacts, then what the data say is exactly that most such people are fully vaccinated. In fact, one should be very surprised if the opposite were true.
Indeed, this insight has major public health implications. If the vaccine is indeed 90% effective at stopping cases, we should not be seeing that level of cases. And if the vaccine is only moderately effective, then we may not be able to achieve "herd immunity" status, as was the plan originally.
I'd be reluctant to complain about this specific data visualization. It seems that the data allow different interpretations - some of which are contradictory but all of which are needed to draw a measured conclusion.
***
The last section on "misrepresentation of scientific results" could use a better example. I certainly agree with the message: that people have confirmation bias. I have been calling this "story-first thinking": people with a set story visualize only the data that support their preconception.
However, the example given is not that. The example shows a tweet that contains a chart from a scientific paper that apparently concludes that hydroxychloroquine helps treat Covid-19. Maxim adds this study was subsequently retracted. If the tweet was sent prior to the retraction, then I don't think we can grumble about someone citing a peer reviewed study published in Lancet.
***
Overall, I like Maxim's message. In some cases, I think there are better examples.