As I am preparing another blog post about the pandemic, I came across the following data graphic, recently produced by the CDC for a vaccine advisory board meeting:
This is not an example of effective visual communications.
For one thing, readers are directed to scour the footnotes to figure out what's going on. If we ignore those for the moment, we see clusters of bubbles that have remained pretty stable from December 2020 to August 2021. The data concern some measure of Americans' intent to take the COVID-19 vaccine. That much we know.
There may have been a bit of an upward trend between January and May, although if you were shown the clusters for December, February and April, you'd think the trend's been pretty flat.
But those colors? What could they represent? You'd surely have to fish this one out of the footnotes. Specifically, this obtuse sentence: "Surveys with multiple time points are shown with the same color bubble for each time point." I had to read it several times. I think it simply means "Color represents the pollster."
Then it adds: "Surveys with only one time point are shown in gray." which simply means "All pollsters who have only one entry in the dataset are grouped together and shown in gray."
Another problem with this chart is over-plotting. Look at the July cluster. It's impossible to tell how many polls were conducted in July because the circles pile on top of one another.
The appearance of the flat trend is a result of two unfortunate decisions made by the designer. If I retained the chart form, I'd have produced something that looks like this:
The first design choice is to expand the vertical axis to range from 0% to 100%. This effectively squeezes all the bubbles into a small range.
The second design choice is to enlarge the bubbles causing copious amount of overlapping.
In particular, this decision blows up the Pew poll (big pink bubble) that contained 10 times the sample size of most of the other polls. The Pew outcome actually came in at 70% but the top of the pink bubble extends to over 80%. Because of this, the outlier poll of December 2020 - which surprisingly printed the highest number of all polls in the entire time window - no longer looks special.
Now, let's see what else we can do to enhance this chart.
I don't like how bubble size is used to encode the sample size. It creates a weird sensation for anyone who's familiar with sampling errors, and confidence regions. The Pew poll with 10 times the sample size is the most reliable poll of them all. Reliability means the error bars around the Pew poll outcome is the smallest of them all. I tend to think of the area around a point estimate as showing the sampling error so the Pew poll would be a dot, showing the high precision of that estimate.
But that won't work because larger bubbles catch more of the reader's attention. So, in the following version, all dots have the same size. I encode reliability in the opacity of the color. The darker dots are polls that are more reliable, that have larger sample sizes.
Two of the pollsters have more frequent polling than others. In this next version, I highlighted those two, which reveals the trend better.