Choosing between individuals and aggregates

Friend/reader Thomas B. alerted me to this paper that describes some of the key chart forms used by cancer researchers.

It strikes me that many of the "new" charts plot granular data at the individual level. This heatmap showing gene expressions show one column per patient:

Jnci_genemap

This so-called swimmer plot shows one bar per patient:

Jnci_swimlanes

This spider plot shows the progression of individual patients over time. Key events are marked with symbols.

Jnci_spaghetti

These chart forms are distinguished from other ones that plot aggregated statistics: statistical averages, medians, subgroup averages, and so on.

One obvious limitation of such charts is their lack of scalability. The number of patients, the variability of the metric, and the timing of trends all drive up the amount of messiness.

I am left wondering what Question is being addressed by these plots. If we are concerned about treatment of an individual patient, then showing each line by itself would be clearer. If we are interested in the average trends of patients, then a chart that plots the overall average, or subgroup averages would be more accurate. If the interpretation of the individual's trend requires comparing with similar patients, then showing that individual's line against the subgroup average would be preferred.

When shown these charts of individual lines, readers are tempted to play the statistician - without using appropriate tools! Readers draw aggregate conclusions, performing the aggregation in their heads.

The authors of the paper note: "Spider plots only provide good visual qualitative assessment but do not allow for formal statistical inference." I agree with the second part. The first part is a fallacy - if the visual qualitative assessment is good enough, then no formal inference is necessary! The same argument is often made when people say they don't need advanced analysis because their simple analysis is "directionally accurate". When is something "directionally inaccurate"? How would one know?

Reference: Chia, Gedye, et. al., "Current and Evolving Methods to Visualize Biological Data in Cancer Research", JNCI, 2016, 108(8). (link)

***

Meteoreologists, whom I featured in the previous post, also have their own spider-like chart for hurricanes. They call it a spaghetti map:

Dorian_spaghetti

Compare this to the "cone of uncertainty" map that was featured in the prior post:

AL052019_5day_cone_with_line_and_wind

These two charts build upon the same dataset. The cone map, as we discussed, shows the range of probable paths of the storm center, based on all simulations of all acceptable models for projection. The spaghetti map shows selected individual simulations. Each line is the most likely trajectory of the storm center as predicted by a single simulation from a single model.

The problem is that each predictive model type has its own historical accuracy (known as "skill"), and so the lines embody different levels of importance. Further, it's not immediately clear if all possible lines are drawn so any reader making conclusions of, say, the envelope containing x percent of these lines is likely to be fooled. Eyeballing the "cone" that contains x percent of the lines is not trivial either. We tend to naturally drift toward aggregate statistical conclusions without the benefit of appropriate tools.

Plots of individuals should be used to address the specific problem of assessing individuals.


Too much of a good thing

Several of us discussed this data visualization over twitter last week. The dataviz by Aero Data Lab is called “A Bird’s Eye View of Pharmaceutical Research and Development”. There is a separate discussion on STAT News.

Here is the top section of the chart:

Aerodatalab_research_top

We faced a number of hurdles in understanding this chart as there is so much going on. The size of the shapes is perhaps the first thing readers notice, followed by where the shapes are located along the horizontal (time) axis. After that, readers may see the color of the shapes, and finally, the different shapes (circles, triangles,...).

It would help to have a legend explaining the sizes, shapes and colors. These were explained within the text. The size encodes the number of test subjects in the clinical trials. The color encodes pharmaceutical companies, of which the graphic focuses on 10 major ones. Circles represent completed trials, crosses inside circles represent terminated trials, triangles represent trials that are still active and recruiting, and squares for other statuses.

The vertical axis presents another challenge. It shows the disease conditions being investigated. As a lay-person, I cannot comprehend the logic of the order. With over 800 conditions, it became impossible to find a particular condition. The search function on my browser skipped over the entire graphic. I believe the order is based on some established taxonomy.

***

In creating the alternative shown below, I stayed close to the original intent of the dataviz, retaining all the dimensions of the dataset. Instead of the fancy dot plot, I used an enhanced data table. The encoding methods reflect what I’d like my readers to notice first. The color shading reflects the size of each clinical trial. The pharmaceutical companies are represented by their first initials. The status of the trial is shown by a dot, a cross or a square.

Here is a sketch of this concept showing just the top 10 rows.

Redo_aero_pharmard

Certain conditions attracted much more investment. Certain pharmas are placing bets on cures for certain conditions. For example, Novartis is heavily into research on Meningnitis, meningococcal while GSK has spent quite a bit on researching "bacterial infections."


Measles babies

Mona Chalabi has made this remarkable graphic to illustrate the effect of the anti-vaccine movement on measles cases in the U.S.: (link)

Monachalabi_measles

As a form of agitprop, the graphic seizes upon the fear engendered by the defacing red rash of the disease. And it's very effective in articulating its social message.

***

I wasn't able to find the data except for a specific year or two. So, this post is more inspired by the graphic than a direct response to it.

I think the left-side legend should say "1 case of measles in someone who was not vaccinated" (as opposed to 1 case of measles in aggregate).

The chart encodes the data in the density of the red dots. What does the density of the red dots signify? There are two possibilities: case counts or case rates.

2013 is a year in which I could find data. In 2013, the U.S. saw 187 cases of measles, only 4 of them in someone who was vaccinated. In other words, there are 49 times as many measles cases among the unvaccinated as the vaccinated.

But note that about 90 percent of the population (using 13-17 year olds as a proxy) are vaccinated. The chance of getting measles in the unvaccinated is 0.8 per million, compared to 0.002 per million in the vaccinated - 422 times higher.

The following chart shows the relative appearance of the dot densities. The bottom row which compares the relative chance of getting measles is the more appropriate metric, and it looks much worse.

Junkcharts_monachalabi_measles

***

Mona's instagram has many other provocative graphics.

 


This map steals story-telling from the designer

Stolen drugs is a problem at federal VA hospitals according to the following map.

Hospitals_losing_drugs

***

Let's evaluate this map from a Trifecta Checkup perspective.

VISUAL - Pass by a whisker. The chosen visual form of a map is standard for geographic data although the map snatches story-telling from our claws, just as people steal drugs from hospitals. Looking at the map, it's not clear what the message is. Is there one?

The 50 states plus DC are placed into five groups based on the reported number of incidents of theft. From the headline, it appears that the journalist conducted a Top 2 Box analysis, defining "significant" losses of drugs as 300 incidents or more. The visual design ignores this definition of "significance."

DATA - Fail. The map tells us where the VA hospitals are located. It doesn't tell us which states are most egregious in drug theft. To learn that, we need to compute a rate, based on the number of hospitals or patients or the amount of spending on drugs.

Looking more carefully, it's not clear they used a Top 2 Box analysis either. I counted seven states with the highest level of theft, followed by another seven states with the second highest level of theft. So the cutoff of twelve states awkwardly lands in between the two levels.

QUESTION - Fail. Drug theft from hospitals is an interesting topic but the graphic does not provide a good answer to the question.

***

Even if we don't have data to compute a rate, the chart is a bit better if proportions are emphasized, rather than counts.

Redo_hospitaldrugloss
 

The proportions are most easily understood from the base of four quarters making the whole. The first group is just over a quarter; the second group is exactly a quarter. The third group plus the first group roughly make up a half. The fourth and fifth groups together almost fills out a quarter.

In the original map, we are told about at least 400 incidents of theft in Texas but given no context to interpret this statistic. What proportion of the total thefts occur in Texas?

 

 


A chart Hans Rosling would have loved

I came across this chart from the OurWorldinData website, and this one would make the late Hans Rosling very happy.

MaxRoser_Two-centuries-World-as-100-people

If you went to Professor Rosling's talk, he was bitter that the amazing gains in public health, worldwide (but particularly in less developed nations) during the last few decades have been little noticed. This chart makes it clear: note especially the dramatic plunge in extreme poverty, rise in vaccinations, drop in child mortality, and improvement in education and literacy, mostly achived in the last few decades.

This set of charts has a simple but powerful message. It's the simplicity of execution that really helps readers get that powerful message.

The text labels on the left and right side of the charts are just perfect.

***

Little things that irk me:

I am not convinced by the liberal use of colors - I would make the "other" category of each chart consistently gray so 6 colors total. Having different colors does make the chart more interesting to look at.

Even though the gridlines are muted, I still find them excessive.

There is a coding bug in the Vaccination chart right around 1960.

 


A pretty good chart ruined by some naive analysis

The following chart showing wage gaps by gender among U.S. physicians was sent to me via Twitter:

Statnews_physicianwages

The original chart was published by the Stat News website (link).

I am most curious about the source of the data. It apparently came from a website called Doximity, which collects data from physicians. Here is a link to the PR release related to this compensation dataset. However, the data is not freely available. There is a claim that this data come from self reports by 36,000 physicians.

I am not sure whether I trust this data. For example:

Stat_wagegapdoctor_1

Do I believe that physicians in North Dakota earn the highest salaries on average in the nation? And not only that, they earn almost 30% more than the average physician in New York. Does the average physician in ND really earn over $400K a year? If you are wondering, the second highest salary number comes from South Dakota. And then Idaho.  Also, these high-salary states are correlated with the lowest gender wage gaps.

I suspect that sample size is an issue. They do not report sample size at the level of their analyses. They apparently published statistics at the level of MSAs. There are roughly 400 MSAs in the U.S. so at that level, on average, they have only 90 samples per MSA. When split by gender, the average sample size is less than 50. Then, they are comparing differences, so we should see the standard errors. And finally, they are making hundreds of such comparisons, for which some kind of multiple-comparisons correction is needed.

I am pretty sure some of you are doctors, or work in health care. Do those salary numbers make sense? Are you moving to North/South Dakota?

***

Turning to the Visual corner of the Trifecta Checkup (link), I have a mixed verdict. The hover-over effect showing the precise values at either axes is a nice idea, well executed.

I don't see the point of drawing the circle inside a circle.  The wage gap is already on the vertical axis, and the redundant representation in dual circles adds nothing to it. Because of this construct, the size of the bubbles is now encoding the male average salary, taking attention away from the gender gap which is the point of the chart.

I also don't think the regional analysis (conveyed by the colors of the bubbles) is producing a story line.

***

This is another instance of a dubious analysis in this "big data" era. The analyst makes no attempt to correct for self-reporting bias, and works as if the dataset is complete. There is no indication of any concern about sample sizes, after the analyst drills down to finer areas of the dataset. While there are other variables available, such as specialty, and other variables that can be merged in, such as income levels, all of which may explain at least a portion of the gender wage gap, no attempt has been made to incorporate other factors. We are stuck with a bivariate analysis that does not control for any other factors.

Last but not least, the analyst draws a bold conclusion from the overly simplistic analysis. Here, we are told: "If you want that big money, you can't be a woman." (link)

 

P.S. The Stat News article reports that the researchers at Doximity claimed that they controlled for "hours worked and other factors that might explain the wage gap." However, in Doximity's own report, there is no language confirming how they included the controls.

 


Depicting imbalance, straying from the standard chart

My friend Tonny M. sent me a tip to two pretty nice charts depicting the state of U.S. healthcare spending (link).

The first shows U.S. as an outlier:

FtotHealthExp_pC_USD_long-1-768x871

This chart is a replica of the Lane Kenworthy chart, with some added details, that I have praised here before. This chart remains one of the most impactful charts I have seen. The added time-series details allow us to see a divergence from about 1980.

Lanekenworthy-250wi

The second chart shows the inequity of healthcare spending among Americans. The top 10% spenders consume about 6.5 times as much as the average while the bottom 16% do not spend anything at all.

Ourworldindata_nihcm-spending-concentration-titled

This chart form is standard for depicting imbalance in scientific publications. But the general public finds this chart difficult to interpret, mostly because both axes operate on a cumulative scale. Further, encoding inequity in the bend of the curve is not particularly intuitive.

So I tried out some other possibilities. Both alternatives are based on incremental, not cumulative, metrics. I take the spend of the individual ten groups (deciles) and work with those dollars. Also, I provide a reference point, which is the level of spend of each decile if the spend were to be distributed evenly among all ten groups.

The first alternative depicts the "excess" or "deficient" spend as column segments. Redo_healthcarespend1

The second alternative shows the level of excess or deficient spending as slopes of lines. I am aiming for a bit more drama here.

Redo_healthcarespend2

Now, the interpretation of this chart is not simple. Since illness is not evenly spread out within the population, this distribution might just be the normal state of affairs. Nevertheless, this pattern can also result from the top spenders purchasing very expensive experimental treatments with little chance of success, for example.

 


NBC has a problem with bar lengths

Seems like reader Conor H. has found a pattern. He alerted us to the problem with bar lengths in the daily medals chart on NBC, which I blogged about the other day.

Through twitter (@andyn), I was sent the following, also courtesy of NBC:

Ejmaroun_nbczika

This one is much harder to understand.


Egregious chart brings back bad memories

My friend Alberto Cairo said it best: if you see bullshit, say "bullshit!"

He was very incensed by this egregious "infographic": (link to his post)

Aul_vs_pp

Emily Schuch provided a re-visualization:

Emilyschuch_pp

The new version provides a much richer story of how Planned Parenthood has shifted priorities over the last few years.

It also exposed what the AUL (American United for Life) organization distorted the story.

The designer extracted only two of the lines, thus readers do not see that the category of services that has really replaced the loss of cancer screening was STI/STD testing and treatment. This is a bit ironic given the other story that has circulated this week - the big jump in STD among Americans (link).

Then, the designer placed the two lines on dual axes, which is a dead giveaway that something awful lies beneath.

Further, this designer dumped the data from intervening years, and drew a straight line from the first to the last year. The straight arrow misleads by pretending that there has been a linear trend, and that it would go on forever.

But the masterstroke is in the treatment of the axes. Let's look at the axes, one at a time:

The horizontal axis: Let me recap. The designer dumped all but the starting and ending years, and drew a straight line between the endpoints. While the data are no longer there, the axis labels are retained. So, our attention is drawn to an area of the chart that is void of data.

The vertical axes: Let me recap. The designer has two series of data with the same units (number of people served) and decided to plot each series on a different scale with dual axes. But readers are not supposed to notice the scales, so they do not show up on the chart.

To summarize, where there are no data, we have a set of functionless labels; where labels are needed to differentiate the scales, we have no axes.

***

This is a tried-and-true tactic employed by propagandists. The egregious chart brings back some bad memories.

Here is a long-ago post on dual axes.

Here is Thomas Friedman's use of the same trick.


More chart drama, and data aggregation

Robert Kosara posted a response to my previous post.

He raises an important issue in data visualization - the need to aggregate data, and not plot raw data. I have no objection to that point.

What was shown in my original post are two extremes. The bubble chart is high drama at the expense of data integrity. Readers cannot learn any of the following from that chart:

  • the shape of the growth and subsequent decline of the flu epidemic
  • the beginning and ending date of the epidemic
  • the peak of the epidemic*

* The peak can be inferred from the data label, although there appears to be at least one other circle of approximately equal size, which isn't labeled.

The column chart is low drama but high data integrity. To retain some dramatic element, I encoded the data redundantly in the color scale. I also emulated the original chart in labeling specific spikes.

The designer then simply has to choose a position along these two extremes. This will involve some smoothing or aggregation of the data. Robert showed a column chart that has weekly aggregates, and in his view, his version is closer to the bubble chart.

Robert's version indeed strikes a balance between drama and data integrity, and I am in favor of it. Here is the idea (I am responsible for the added color).

Kosara_avianflu2

***

Where I depart from Robert is how one reads a column chart such as the one I posted:

Redo_avianflu2

Robert thinks that readers will perceive each individual line separately, and in so doing, "details hide the story". When I look at a chart like this, I am drawn to the envelope of the columns. The lighter colors are chosen for the smaller spikes to push them into the background. What might be the problem are those data labels identifying specific spikes; they are a holdover from the original chart--I actually don't know why those specific dates are labeled.

***

In summary, the key takeaway is, as Robert puts it:

the point of this [dataset] is really not about individual days, it’s about the grand totals and the speed with which the outbreak happened.

We both agree that the weekly version is the best among these. I don't see how the reader can figure out grand totals and speed with which the outbreak happened by staring at those dramatic but overlapping bubbles.