Is the visual serving the question?

The following chart concerns California's bullet train project.


Now, look at the bubble chart at the bottom. Here it is - with all the data except the first number removed:


It is impossible to know how fast the four other train systems run after I removed the numbers. The only way a reader can comprehend this chart is to read the data inside the bubbles. This chart fails the "self-sufficiency test". The self-sufficiency test asks how much work the visual elements on the chart are doing to communicate the data; in this case, the bubbles do nothing at all.

Another problem: this chart buries its lede. The message is in the caption: how California's bullet train rates against other fast train systems. California's train speed of 220 mph is only mentioned in the text but not found in the visual.

Here is a chart that draws attention to the key message:


In a Trifecta checkup, we improved this chart by bringing the visual in sync with the central question of the chart.

McKinsey thinks the data world needs more dataviz talent

Note about last week: While not blogging, I delivered four lectures on three topics over five days: one on the use of data analytics in marketing for a marketing class at Temple; two on the interplay of analytics and data visualization, at Yeshiva and a JMP Webinar; and one on how to live during the Data Revolution at NYU.

This week, I'm back at blogging.

McKinsey publishes a report confirming what most of us already know or experience - the explosion of data jobs that just isn't stopping.

On page 5, it says something that is of interest to readers of this blog: "As data grows more complex, distilling it and bringing it to life through visualization is becoming critical to help make the results of data analyses digestible for decision makers. We estimate that demand for visualization grew roughly 50 percent annually from 2010 to 2015." (my bolding)

The report contains a number of unfortunate graphics. Here's one:


I applied my self-sufficiency test by removing the bottom row of data from the chart. Here is what happened to the second circle, representing the fraction of value realized by the U.S. health care industry.


What does the visual say? This is one of the questions in the Trifecta Checkup. We see three categories of things that should add up to 100 percent. With a little more effort, we find the two colored categories are each 10% while the white area is 80%. 

But that's not what the data say, because there is only one thing being measured: how much of the potential has already been realized. The two colors is an attempt to visualize the uncertainty of the estimated proportion, which in this case is described as 10 to 20 percent underneath the chart.

If we have to describe what the two colored sections represent: the dark green section is the lower bound of the estimate while the medium green section is the range of uncertainty. The edge between the two sections is the actual estimated proportion (assuming the uncertainty bound is symmetric around the estimate)!

A first attempt to fix this might be to use line segments instead of colored arcs. 


The middle diagram emphasizes the mid-point estimate while the right diagram, the range of estimates. Observe how differently these two diagrams appear from the original one shown on the left.

This design only works if the reader perceives the chart as a "racetrack" chart. You have to see the invisible vertical line at the top, which is the starting line, and measure how far around the track has the symbol gone. I have previously discussed why I don't like racetracks (for example, here and here).


Here is a sketch of another design:


The center figure will have to be moved and changed to a different shape. This design conveys the sense of a goal (at 100%) and how far one is along the path. The uncertainty is represented by wave-like elements that make the exact location of the pointer arrow appear as wavering.





Plotted performance guaranteed not to predict future performance

On my flight back from Lyon, I picked up a French magazine, and found the following chart:

French interest rates chart small

A quick visit to Bing Translate tells me that this chart illustrates the rates of return of different types of investments. The headline supposedly says "Only the risk pays". In many investment brochures, after presenting some glaringly optimistic projections of future returns, the vendor legally protects itself by proclaiming "Past performance does not guarantee future performance."



Two unusual decisions set this chart apart:

1. The tree ring imagery, which codes the data in the widths of concentric rings around a common core

2. The placement of larger numbers toward the middle, and smaller numbers in the periphery.

When a reader takes in the visual design of this chart, what is s/he drawn to?

The designer evidently hopes the reader will focus on comparing the widths of the rings (A), while ignoring the areas or the circumferences. I think it is more likely that the reader will see one of the following:

(B) the relative areas of the tree rings

(C) the areas of the full circles bounded by the circumferences

(D) the lengths of the outer rings

(E) the lengths of the inner rings

(F) the lengths of the "middle" rings (defined as the average of the outer and inner rings)

Here is a visualization of six ways to "see" what is on the French rates of return chart:


Recall the Trifecta Checkup (link). This is an example where "What does the visual say" and "What does the data say" may be at variance. In case (A), if the reader is seeing the ring widths, then those two aspects are in sync. In every other case, the two aspects are disconcordant. 

The level of distortion is visualized in the following chart:


Here, I normalized everything to the size of the SCPI data. The true data is presented by the ring width column, represented by the vertical stripes on the left. If the comparisons are not distorted, the other symbols should stay close to the vertical stripes. One notices there is always distortion in cases (B)-(F). This is primarily due to the placement of the large numbers near the center and the small numbers near the edge. In other words, the radius is inversely proportional to the data!

 The amount of distortion for most cases ranges from 2 to 6 times. 

While the "ring area" (B) version is least distorted on average, it is perhaps the worst of the six representations. The level of distortion is not a regular function of the size of the data. The "sicav monetaries" (smallest data) is the least distorted while the data of medium value are the most distorted.


To improve this chart, take a hint from the headline. Someone recognizes that there is a tradeoff between risk and return. The data series shown, which is an annualized return, only paints the return part of the relationship. 




No Latin honors for graphic design

Paw_honors_2018This chart appeared on a recent issue of Princeton Alumni Weekly.

If you read the sister blog, you'll be aware that at most universities in the United States, every student is above average! At Princeton,  47% of the graduating class earned "Latin" honors. The median student just missed graduating with honors so the honors graduate is just above average! The 47% number is actually lower than at some other peer schools - at one point, Harvard was giving 90% of its graduates Latin honors.

Side note: In researching this post, I also learned that in the Senior Survey for Harvard's Class of 2018, two-thirds of the respondents (response rate was about 50%) reported GPA to be 3.71 or above, and half reported 3.80 or above, which means their grade average is higher than A-.  Since Harvard does not give out A+, half of the graduates received As in almost every course they took, assuming no non-response bias.


Back to the chart. It's a simple chart but it's not getting a Latin honor.

Most readers of the magazine will not care about the decimal point. Just write 18.9% as 19%. Or even 20%.

The sequencing of the honor levels is backwards. Summa should be on top.


Warning: the remainder of this post is written for graphics die-hards. I go through a bunch of different charts, exploring some fine points.

People often complain that bar charts are boring. A trendy alternative when it comes to count or percentage data is the "pictogram."

Here are two versions of the pictogram. On the left, each percent point is shown as a dot. Then imagine each dot turned into a square, then remove all padding and lines, and you get the chart on the right, which is basically an area chart.


The area chart is actually worse than the original column chart. It's now much harder to judge the areas of irregularly-shaped pieces. You'd have to add data labels to assist the reader.

The 100 dots is appealing because the reader can count out the number of each type of honors. But I don't like visual designs that turn readers into bean-counters.

So I experimented with ways to simplify the counting. If counting is easier, then making comparisons is also easier.

Start with this observation: When asked to count a large number of objects, we group by 10s and 5s.

So, on the left chart below, I made connectors to form groups of 5 or 10 dots. I wonder if I should use different line widths to differentiate groups of five and groups of ten. But the human brain is very powerful: even when I use the same connector style, it's easy to see which is a 5 and which is a 10.


On the left chart, the organizing principles are to keep each connector to its own row, and within each category, to start with 10-group, then 5-group, then singletons. The anti-principle is to allow same-color dots to be separated. The reader should be able to figure out Summa = 10+3, Magna = 10+5+1, Cum Laude = 10+5+4.

The right chart is even more experimental. The anti-principle is to allow bending of the connectors. I also give up on using both 5- and 10-groups. By only using 5-groups, readers can rely on their instinct that anything connected (whether straight or bent) is a 5-group. This is powerful. It relieves the effort of counting while permitting the dots to be packed more tightly by respective color.

Further, I exploited symmetry to further reduce the counting effort. Symmetry is powerful as it removes duplicate effort. In the above chart, once the reader figured out how to read Magna, reading Cum Laude is simplified because the two categories share two straight connectors, and two bent connectors that are mirror images, so it's clear that Cum Laude is more than Magna by exactly three dots (percentage points).


Of course, if the message you want to convey is that roughly half the graduates earn honors, and those honors are split almost even by thirds, then the column chart is sufficient. If you do want to use a pictogram, spend some time thinking about how you can reduce the effort of the counting!






Two views of earthquake occurrence in the Bay Area

This article has a nice description of earthquake occurrence in the San Francisco Bay Area. A few quantities are of interest: when the next quake occurs, the size of the quake, the epicenter of the quake, etc. The data graphic included in the article fails the self-sufficiency test: the only way to read this chart is to read out the entire data set - in other words, the graphical details have no utility.


The article points out the clustering of earthquakes. In particular, there is a 68-year "quiet period" between 1911 and 1979, during which no quakes over 6.0 in size occurred. The author appears to have classified quakes into three groups: "Largest" which are those at 6.5 or over; "Smaller but damaging" which are those between 6.0 and 6.5; and those below 6.0 (not shown).

For a more standard and more effective visualization of this dataset, see this post on a related chart (about avian flu outbreaks). The post discusses a bubble chart versus a column chart. I prefer the column chart.

image from

This chart focuses on the timing of rare events. The time between events is not as easy to see. 

What if we want to focus on the "quiet years" between earthquakes? Here is a visualization that addresses the question: when will the next one hit us?




The downside of discouraging pie charts

It's no secret most dataviz experts do not like pie charts.

Our disdain for pie charts causes people to look for alternatives.

Sometimes, the alternative is worse. Witness:


This chart comes from the Spring 2018 issue of On Investing, the magazine for Charles Schwab customers.

It's not a pie chart.


I'm forced to say the pie chart is preferred.

The original chart fails the self-sufficiency test. Here is the 2007 chart with the data removed.


It's very hard to figure out how large are those pieces, so any reader trying to understand this chart will resort to reading the data, which means the visual representation does no work!

Or, you can use a dot plot.


This version emphasizes the change over time.


Beauty is in the eyes of the fishes

Reader Patrick S. sent in this old gem from Germany.


He said:

It displays the change in numbers of visitors to public pools in the German city of Hanover. The invisible y-axis seems to be, um, nonlinear, but at least it's monotonic, in contrast to the invisible x-axis.

There's a nice touch, though: The eyes of the fish are pie charts. Black: outdoor pools, white: indoor pools (as explained in the bottom left corner).

It's taken from a 1960 publication of the city of Hanover called *Hannover: Die Stadt in der wir leben*.

This is the kind of chart that Ed Tufte made (in)famous. The visual elements do not serve the data at all, except for the eyeballs. The design becomes a mere vessel for the data table. The reader who wants to know the growth rate of swimmers has to do a tank of work.

The eyeballs though.

I like the fact that these pie charts do not come with data labels. This part of the chart passes the self-sufficiency test. In fact, the eyeballs contain the most interesting story in this chart. In those four years, the visitors to public pools switched from mostly indoor pools to mostly outdoor pools. These eyeballs show that pie charts can be effective in specific situations.

Now, Hanover fishes are quite lucky to have free admission to the public pools!

When design goes awry

One can't accuse the following chart of lacking design. Strong is the evidence of departing from convention but the design decisions appear wayward. (The original link on Money here)



The donut chart (right) has nine sections. Eight of the sections (excepting A) have clearly all been bent out of shape. It turns out that section A does not have the right size either. The middle gray circle is not really in the middle, as seen below.


The bar charts (left) suffer from two ills. Firstly, the full width of the chart is at the 50 percent mark, so readers are forced to read the data labels to understand the data. Secondly, only the top two categories are shown, thus the size of the whole is lost. A stacked bar chart would serve better here.

Here is a bardot chart; the "dot" part of it makes it easier to see a Top 2 box analysis.


I explain the bardot chart here.


 PS. Here is Jamie's version (from the comment below):




The visual should be easier to read than your data

A reader sent this tip in some time ago and I lost track of who he/she is. This graphic looks deceptively complex.


What's complex is not the underlying analysis. The design is complex and so the decoding is complex.

The question of the graphic is a central concern of anyone who's retired: how long will one's savings last? There are two related metrics to describe the durability of the stash, and they are both present on this chart. The designer first presumes that one has saved $1 million for retirement. Then he/she computes how many years the savings will last. That, of course, depends on the cost of living, which naively can be expressed as a projected annual expenditure. The designer allows the cost of living to vary by state, which is the main source of variability in the computations. The time-based and dollar-based metrics are directly linked to one another via a formula.

The design encodes the time metric in a grid of dots, and the dollar-metric in the color of the dots. The expenditures are divided into eight segments, given eight colors from deep blue to deep pink.

Thirteen of those dots are invariable, appearing in every state. Readers are drawn into a ranking of the states, which is nothing but a ranking of costs of living. (We don't know, but presume, that the cost of living computation is appropriate for retirees, and not averaged.) This order obscures any spatial correlation. There are a few production errors in the first row in which the year and month numbers are misstated slightly; the numbers should be monotonically decreasing. In terms of years and months, the difference between many states is immaterial. The pictogram format is more popular than it deserves: only highly motivated readers will count individual dots. If readers are merely reading the printed text, which contains all the data encoded in the dots, then the graphic has failed the self-sufficiency principle - the visual elements are not doing any work.


In my version, I surface the spatial correlation using maps. The states are classified into sensible groups that allow a story to be told around the analysis. Three groups of states are identified and separately portrayed. The finer variations between states within each state group appear as shades.


Data visualization should make the underlying data easier to comprehend. It's a problem when the graphic is harder to decipher than the underlying dataset.




A pretty good chart ruined by some naive analysis

The following chart showing wage gaps by gender among U.S. physicians was sent to me via Twitter:


The original chart was published by the Stat News website (link).

I am most curious about the source of the data. It apparently came from a website called Doximity, which collects data from physicians. Here is a link to the PR release related to this compensation dataset. However, the data is not freely available. There is a claim that this data come from self reports by 36,000 physicians.

I am not sure whether I trust this data. For example:


Do I believe that physicians in North Dakota earn the highest salaries on average in the nation? And not only that, they earn almost 30% more than the average physician in New York. Does the average physician in ND really earn over $400K a year? If you are wondering, the second highest salary number comes from South Dakota. And then Idaho.  Also, these high-salary states are correlated with the lowest gender wage gaps.

I suspect that sample size is an issue. They do not report sample size at the level of their analyses. They apparently published statistics at the level of MSAs. There are roughly 400 MSAs in the U.S. so at that level, on average, they have only 90 samples per MSA. When split by gender, the average sample size is less than 50. Then, they are comparing differences, so we should see the standard errors. And finally, they are making hundreds of such comparisons, for which some kind of multiple-comparisons correction is needed.

I am pretty sure some of you are doctors, or work in health care. Do those salary numbers make sense? Are you moving to North/South Dakota?


Turning to the Visual corner of the Trifecta Checkup (link), I have a mixed verdict. The hover-over effect showing the precise values at either axes is a nice idea, well executed.

I don't see the point of drawing the circle inside a circle.  The wage gap is already on the vertical axis, and the redundant representation in dual circles adds nothing to it. Because of this construct, the size of the bubbles is now encoding the male average salary, taking attention away from the gender gap which is the point of the chart.

I also don't think the regional analysis (conveyed by the colors of the bubbles) is producing a story line.


This is another instance of a dubious analysis in this "big data" era. The analyst makes no attempt to correct for self-reporting bias, and works as if the dataset is complete. There is no indication of any concern about sample sizes, after the analyst drills down to finer areas of the dataset. While there are other variables available, such as specialty, and other variables that can be merged in, such as income levels, all of which may explain at least a portion of the gender wage gap, no attempt has been made to incorporate other factors. We are stuck with a bivariate analysis that does not control for any other factors.

Last but not least, the analyst draws a bold conclusion from the overly simplistic analysis. Here, we are told: "If you want that big money, you can't be a woman." (link)


P.S. The Stat News article reports that the researchers at Doximity claimed that they controlled for "hours worked and other factors that might explain the wage gap." However, in Doximity's own report, there is no language confirming how they included the controls.