Tongue in cheek but a master stroke

Andrew jumped on the Benford bandwagon to do a tongue-in-cheek analysis of numbers in Hollywood movies (link). The key graphic is this:

Gelman_hollywood_benford_2-1024x683

Benford's Law is frequently invoked to prove (or disprove) fraud with numbers by examining the distribution of first digits. Andrew extracted movies that contain numbers in their names - mostly but not always sequences of movies with sequels. The above histogram (gray columns) are the number of movies with specific first digits. The red line is the expected number if Benford's Law holds. As typical of such analysis, the histogram is closely aligned with the red line, and therefore, he did not find any fraud. 

I'll blog about my reservations about Benford-style analysis on the book blog later - one quick point is: as with any statistical analysis, we should say there is no statistical evidence of fraud (more precisely, of the kind of fraud that can be discovered using Benford's Law), which is different from saying there is no fraud.

***

Andrew also showed a small-multiples chart that breaks up the above chart by movie groups. I excerpted the top left section of the chart below:

Gelman_smallmultiples_benford

The genius in this graphic is easily missed.

Notice that the red lines (which are the expected values if Benford Law holds) appear identical on every single plot. And then notice that the lines don't represent the same values.

It's great to have the red lines look the same everywhere because they represent the immutable Benford reference. Because the number of movies is so small, he's plotting counts instead of proportions. If you let the software decide on the best y-axis range for each plot, the red lines will look different on different charts!

You can find the trick in the R code from Gelman's blog.

First, the maximum value of each plot is set to the total number of observations. Then, the expected Benford proportions are converted into expected Benford counts. The first Benford count is then shown against an axis topping out at the total count, and thus, relatively, what we are seeing are the Benford proportions. Thus, every red line looks the same despite holding different values.

This is a master stroke.

 

 

 


A little stitch here, a great graphic is knitted

The Wall Street Journal used the following graphic to compare hurricanes Ida and Katrina (link to paywalled article).

Wsj_ida_katrina_hurricanes

This graphic illustrates the power of visual communications. Readers can learn a lot from it.

The paths of the storms can be compared. The geographical locations of the landfalls are shown. The strengthening of wind speeds as the hurricanes moved toward Louisiana is also displayed. Ida is clearly a lesser storm than Katrina: its wind speed never reached Category 5, and is generally lower at comparable time points.

The greatest feature of the WSJ graphic is how the designer stitches the two plots into one graphic. The anchors are two time points: when each storm attained enough wind speed to be classified as a hurricane (indicated by open dots), and when each storm made landfall in Louisiana. It is this little-noticed feature that makes it so easy to place each plot in context of the other.

Bravo!


Visually displaying multipliers

As I'm preparing a blog about another real-world study of Covid-19 vaccines, I came across the following chart (the chart title is mine).

React1_original

As background, this is the trend in Covid-19 cases in the U.K. in the last couple of months, courtesy of OurWorldinData.org.

Junkcharts_owid_uk_case_trend_july_august_2021

The React-1 Study sends swab kits to randomly selected people in England in order to assess the prevalence of Covid-19. Every month, there is a new round of returned swabs that are tested for Covid-19. This measurement method captures asymptomatic cases although it probably missed severe and hospitalized cases. Despite having some shortcomings, this is a far better way to measure cases than the hotch-potch assembling of variable-quality data submitted by different jurisdictions that has become the dominant source of our data.

Rounds 12 and 13 captured an inflection point in the pandemic in England. The period marked the beginning of the end of the belief that widespread vaccination will end the pandemic.

The chart I excerpted up top broke the data down by age groups. The column heights represent the estimated prevalence of Covid-19 during each round - also, described precisely in the paper as "swab positivity." Based on the study's design, one may generalize the prevalence to the population at large. About 1.5% of those aged 13-24 in England are estimated to have Covid-19 around the time of Round 13 (roughly early July).

The researchers came to the following conclusion:

We show that the third wave of infections in England was being driven primarily by the Delta variant in younger, unvaccinated people. This focus of infection offers considerable scope for interventions to reduce transmission among younger people, with knock-on benefits across the entire population... In our data, the highest prevalence of infection was among 12 to 24 year olds, raising the prospect that vaccinating more of this group by extending the UK programme to those aged 12 to 17 years could substantially reduce transmission potential in the autumn when levels of social mixing increase

***

Raise your hand if the graphics software you prefer dictates at least one default behavior you can't stand. I'm sure most hands are up in the air. No matter how much you love the software, there is always something the developer likes that you don't.

The first thing I did with today's chart is to get rid of all such default details.

Redo_react1_cleanup

For me, the bottom chart is cleaner and more inviting.

***

The researchers wanted readers to think in terms of Round 3 numbers as multiples of Round 2 numbers. In the text, they use statements such as:

weighted prevalence in round 13 was nine-fold higher in 13-17 year olds at 1.56% (1.25%, 1.95%) compared with 0.16% (0.08%, 0.31%) in round 12

It's not easy to perceive a nine-fold jump from the paired column chart, even though this chart form is better than several others. I added some subtle divisions inside each orange column in order to facilitate this task:

Redo_react1_multiples

I have recommended this before. I'm co-opting pictograms in constructing the column chart.

An alternative is to plot everything on an index scale although one would have to drop the prevalence numbers.

***

The chart requires an additional piece of context to interpret properly. I added each age group's share of the population below the chart - just to illustrate this point, not to recommend it as a best practice.

Redo_react1_multiples_popshare

The researchers concluded that their data supported vaccinating 13-17 year olds because that group experienced the highest multiplier from Round 12 to Round 13. Notice that the 13-17 year old age group represents only 6 percent of England's population, and is the least populous age group shown on the chart.

The neighboring 18-24 age group experienced a 4.5 times jump in prevalence in Round 13 so this age group is doing much better than 13-17 year olds, right? Not really.

While the same infection rate was found in both age groups during this period, the slightly older age group accounted for 50% more cases -- and that's due to the larger share of population.

A similar calculation shows that while the infection rate of people under 24 is about 3 times higher than that of those 25 and over, both age groups suffered over 175,000 infections during the Round 3 time period (the difference between groups was < 4,000).  So I don't agree that focusing on 13-17 year olds gives England the biggest bang for the buck: while they are the most likely to get infected, their cases account for only 14% of all infections. Almost half of the infections are in people 25 and over.

 


Working hard at clarity

As I am preparing another blog post about the pandemic, I came across the following data graphic, recently produced by the CDC for a vaccine advisory board meeting:

CDC_positivevaccineintent

This is not an example of effective visual communications.

***

For one thing, readers are directed to scour the footnotes to figure out what's going on. If we ignore those for the moment, we see clusters of bubbles that have remained pretty stable from December 2020 to August 2021. The data concern some measure of Americans' intent to take the COVID-19 vaccine. That much we know.

There may have been a bit of an upward trend between January and May, although if you were shown the clusters for December, February and April, you'd think the trend's been pretty flat. 

***

But those colors? What could they represent? You'd surely have to fish this one out of the footnotes. Specifically, this obtuse sentence: "Surveys with multiple time points are shown with the same color bubble for each time point." I had to read it several times. I think it simply means "Color represents the pollster." 

Then it adds: "Surveys with only one time point are shown in gray." which simply means "All pollsters who have only one entry in the dataset are grouped together and shown in gray."

Another problem with this chart is over-plotting. Look at the July cluster. It's impossible to tell how many polls were conducted in July because the circles pile on top of one another. 

***

The appearance of the flat trend is a result of two unfortunate decisions made by the designer. If I retained the chart form, I'd have produced something that looks like this:

Junkcharts_redo_cdcvaccineintent_sameform

The first design choice is to expand the vertical axis to range from 0% to 100%. This effectively squeezes all the bubbles into a small range.

Junkcharts_redo_cdcvaccineintent_startatzero

The second design choice is to enlarge the bubbles causing copious amount of overlapping. 

Junkcharts_redo_cdcvaccineintent_startatzero_bigdots

In particular, this decision blows up the Pew poll (big pink bubble) that contained 10 times the sample size of most of the other polls. The Pew outcome actually came in at 70% but the top of the pink bubble extends to over 80%. Because of this, the outlier poll of December 2020 - which surprisingly printed the highest number of all polls in the entire time window - no longer looks special. 

***

Now, let's see what else we can do to enhance this chart. 

I don't like how bubble size is used to encode the sample size. It creates a weird sensation for anyone who's familiar with sampling errors, and confidence regions. The Pew poll with 10 times the sample size is the most reliable poll of them all. Reliability means the error bars around the Pew poll outcome is the smallest of them all. I tend to think of the area around a point estimate as showing the sampling error so the Pew poll would be a dot, showing the high precision of that estimate. 

But that won't work because larger bubbles catch more of the reader's attention. So, in the following version, all dots have the same size. I encode reliability in the opacity of the color. The darker dots are polls that are more reliable, that have larger sample sizes.

Junkcharts_redo_cdcvaccineintent_opacity

Two of the pollsters have more frequent polling than others. In this next version, I highlighted those two, which reveals the trend better.

Junkcharts_redo_cdcvaccineintent_opacitywithlines