The visual should be easier to read than your data

A reader sent this tip in some time ago and I lost track of who he/she is. This graphic looks deceptively complex.


What's complex is not the underlying analysis. The design is complex and so the decoding is complex.

The question of the graphic is a central concern of anyone who's retired: how long will one's savings last? There are two related metrics to describe the durability of the stash, and they are both present on this chart. The designer first presumes that one has saved $1 million for retirement. Then he/she computes how many years the savings will last. That, of course, depends on the cost of living, which naively can be expressed as a projected annual expenditure. The designer allows the cost of living to vary by state, which is the main source of variability in the computations. The time-based and dollar-based metrics are directly linked to one another via a formula.

The design encodes the time metric in a grid of dots, and the dollar-metric in the color of the dots. The expenditures are divided into eight segments, given eight colors from deep blue to deep pink.

Thirteen of those dots are invariable, appearing in every state. Readers are drawn into a ranking of the states, which is nothing but a ranking of costs of living. (We don't know, but presume, that the cost of living computation is appropriate for retirees, and not averaged.) This order obscures any spatial correlation. There are a few production errors in the first row in which the year and month numbers are misstated slightly; the numbers should be monotonically decreasing. In terms of years and months, the difference between many states is immaterial. The pictogram format is more popular than it deserves: only highly motivated readers will count individual dots. If readers are merely reading the printed text, which contains all the data encoded in the dots, then the graphic has failed the self-sufficiency principle - the visual elements are not doing any work.


In my version, I surface the spatial correlation using maps. The states are classified into sensible groups that allow a story to be told around the analysis. Three groups of states are identified and separately portrayed. The finer variations between states within each state group appear as shades.


Data visualization should make the underlying data easier to comprehend. It's a problem when the graphic is harder to decipher than the underlying dataset.




A pretty good chart ruined by some naive analysis

The following chart showing wage gaps by gender among U.S. physicians was sent to me via Twitter:


The original chart was published by the Stat News website (link).

I am most curious about the source of the data. It apparently came from a website called Doximity, which collects data from physicians. Here is a link to the PR release related to this compensation dataset. However, the data is not freely available. There is a claim that this data come from self reports by 36,000 physicians.

I am not sure whether I trust this data. For example:


Do I believe that physicians in North Dakota earn the highest salaries on average in the nation? And not only that, they earn almost 30% more than the average physician in New York. Does the average physician in ND really earn over $400K a year? If you are wondering, the second highest salary number comes from South Dakota. And then Idaho.  Also, these high-salary states are correlated with the lowest gender wage gaps.

I suspect that sample size is an issue. They do not report sample size at the level of their analyses. They apparently published statistics at the level of MSAs. There are roughly 400 MSAs in the U.S. so at that level, on average, they have only 90 samples per MSA. When split by gender, the average sample size is less than 50. Then, they are comparing differences, so we should see the standard errors. And finally, they are making hundreds of such comparisons, for which some kind of multiple-comparisons correction is needed.

I am pretty sure some of you are doctors, or work in health care. Do those salary numbers make sense? Are you moving to North/South Dakota?


Turning to the Visual corner of the Trifecta Checkup (link), I have a mixed verdict. The hover-over effect showing the precise values at either axes is a nice idea, well executed.

I don't see the point of drawing the circle inside a circle.  The wage gap is already on the vertical axis, and the redundant representation in dual circles adds nothing to it. Because of this construct, the size of the bubbles is now encoding the male average salary, taking attention away from the gender gap which is the point of the chart.

I also don't think the regional analysis (conveyed by the colors of the bubbles) is producing a story line.


This is another instance of a dubious analysis in this "big data" era. The analyst makes no attempt to correct for self-reporting bias, and works as if the dataset is complete. There is no indication of any concern about sample sizes, after the analyst drills down to finer areas of the dataset. While there are other variables available, such as specialty, and other variables that can be merged in, such as income levels, all of which may explain at least a portion of the gender wage gap, no attempt has been made to incorporate other factors. We are stuck with a bivariate analysis that does not control for any other factors.

Last but not least, the analyst draws a bold conclusion from the overly simplistic analysis. Here, we are told: "If you want that big money, you can't be a woman." (link)


P.S. The Stat News article reports that the researchers at Doximity claimed that they controlled for "hours worked and other factors that might explain the wage gap." However, in Doximity's own report, there is no language confirming how they included the controls.


Round things, square things

The following chart traces the flow of funds into AI (artificial intelligence) startups.


I found it on this webpage and it is attributed to Financial Times.

Here, I apply the self-sufficiency test to show that the semicircles are playing no role in the visualization. When the numbers are removed, readers cannot understand the data at all. So the visual elements are toothless.


Actually, it's worse. The data got encoded in the diameters of the semicircles, but not the areas. Thus, anyone courageously computing the ratios of the areas finds their effort frustrated.

Here is a different view that preserves the layout:


The two data series in the original chart show the current round of funding and the total funds raised. In the junkcharts version, I decided to compare the new funds versus the previously-raised funds so that the total area represents the total funds raised.

Counting the Olympic medals

Reader Conor H. sent in this daily medals table at the NBC website:


He commented that the bars are not quite the right lengths. So even though China and Russia both won five total medals that day, the bar for China is slightly shorter.

One issue with the stacked bar chart is that the reader's attention is drawn to the components rather that the whole. However, as is this case, the most important statistic is the total number of medals.

Here is a different view of the data:




A Tufte fix for austerity

Trish, who attended one of my recent data visualization workshops, submitted a link to the Illinois Atlas of Austerity.

Atlas_il_austerity_clientsShown on the right is one of the charts included in the presentation.

This is an example of a chart that fails my self-sufficiency test.

There is no chance that readers are getting any information out of the graphical elements (the figurines of 100 people each).

Everyone who tries to learn something from this chart will be reading the data labels directly.

The entire dataset is printed on the chart itself. If you cover up the data, the chart becomes unreadable!


Here is a simple fix that resettles the figurines onto a bar chart:


Tufte would not be amused by this composition. The figurines are purely decorative.

This version is more likely to delight Tufte:


It is the edges of the bars in the bar chart that make all the difference.


Aside from the visual problems, there is also a data issue. They should have controlled by the size of different programs.



Scorched by the heat in Arizona

Reader Jeffrey S. saw this graphic inside a Dec 2 tweet from the National Weather Service (NWS) in Phoenix, Arizona.


In a Trifecta checkup (link), I'd classify this as Type QV.

The problems with the visual design are numerous and legendary. The column chart where the heights of the columns are not proportional to the data. The unnecessary 3D effect. The lack of self-sufficiency (link). The distracting gridlines. The confusion of year labels that do not increment from left to right.

The more hidden but more serious issue with this chart is the framing of the question. The main message of the original chart is that the last two years have been the hottest two years in a long time. But it is difficult for readers to know if the differences of less than one degree from the first to the last column are meaningful since we are not shown the variability of the time series.

The green line makes an assertion that 1981 to 2010 represents the "normal". It is unclear why that period is normal and the years from 2011-5 are abnormal. Maybe they are using the word normal in a purely technical way to mean "average." If true, it is better to just say average.

For this data, I prefer to see the entire time series from 1981 to 2015, which allows readers to judge the variability as well as the trending of the average temperatures. In the following chart, I also label the five years with the highest average temperatures.


A data visualization that is invariant to the data

This map appeared in Princeton Alumni Weekly:


Here is another map I created:


If you think they look basically the same, you got the point. Now look at the data on the maps. The original map displays the proportion of graduates who ended up in different regions of the country. The second map displays the proportion of land mass in different regions of the country.

The point is that this visual design is not self-sufficient. If you cover up the data printed on the map, there is nothing else to see. Further, if you swap in other data series (anything at all), nothing on the map changes. Yes, this map is invariant to the data!

This means the only way to read this map is to read the data directly.


Maps also have the other issue. The larger land areas draw the most attention. However, the sizes of the regions are in inverse proportion to the data being depicted. The smaller the values, the larger the areas on the map. This is the scatter plot of the proportion of graduates (the data) versus the proportion of land mass:



One quick fix is to use a continuous color scale. In this way, the colors encode the data. For example:


The dark color now draws attention to itself.

Of course, one should think twice before using a map.


One note of curiosity: Given the proximity to NYC, it is not surprising that NYC is the most popular destination for Princeton graduates. Strangely enough, a move from Princeton to New York is considered out of region, by the way the regions are defined. New Jersey is lumped with Pennsylvania, Maryland, Virginia, etc. into the Mid-Atlantic region while New York is considered Northeast.


Taking care of a German pie chart while enjoying German kuchen

I was enjoying this yummy piece of German cake the other day.


I started flipping through the recent issue of Stern magazine, and came across this German pie chart that probably presents results from a poll. In particular, it draws attention to changes between the current and the prior poll, I think.



When a pie chart is used to handle data with more than three or four categories, we frequently encounter objects with a rainbow of colors, and a jumble of text labels. In this case, the order of the labels in the legend doesn't match the order of the pie sectors. 

In addition, such pie charts almost always fail the self-sufficiency test. All of the data are printed on the chart itself, inviting readers to ignore the visual presentation.

A bumps-style chart works well for this type of data. I tried something different here:


The challenge is to elegantly handle the current data plus the change from the last poll.


How to tell if your graphic is underpowered?

Some time ago, this chart showed up in a NYT Magazine (it's about sex):


In this composition, the visual element (the circles) has no utility. A self-sufficiency test makes this point clear.

All the data (four numbers) are printed on the original graphic. When removed, the reader loses all ability to understand the data.



Redo_nytm_circles_1Even when the first number is revealed, it is impossible to know the values of the others.

If one knows the second (and largest) pink circle represents 58 percent, it is still impossible to guess that the adjacent circle is 40 percent.

Even both those numbers are provided, it is still impossible to infer the rest without a calculation.

In order to understand this graphic, readers must look at the data labels.




I made a couple of other versions for comparison.

The first uses the pie chart, which is almost readable without the data labels. 


The second uses the bar chart, which requires only an axis.







Boxes or lines: showing the trend in US adoptions

Time used a pair of area charts (a form of treemap) to illustrate the trend in Americans adopting babies of foreign origin. The data consist of the number of babies labeled by country of birth in 1999 and in 2013.


This type of chart fails the self-sufficiency test. The entire dataset is faithfully reproduced on the printed page, and that is because readers cannot figure out the relative sizes by their own eyes. (Try imagining the charts without the numbers.)

This need to present all of the data creates an additional design challenge: how to place country names where several boxes crowd onto one another. The designer here adopts an expand-from-the-middle approach, which might require getting used to.


In addition, the amount of distance placed between the pair of dates is vast, and that is not optimal for a graphic whose primary goal is to elicit a trend.


Here is the Bumps-style chart. These charts are great except where the data are tightly clustered. Recently I have been experimenting with small-multiples as a way to split up the data, which alleviates the labeling challenge.


In this version, the countries are shown as four groups. The countries that show up as significant enough in each year to merit individual labels are shown in the middle, themselves split into two groups: those that have seen its share of adoptions increase versus those that have seen a decrease. The remaining countries show up in only one of the two years. Presumably this means in the other year, there were zero adoptions from those countries. (However, it is also possible that in the missing year, the numbers were so tiny that they were included in the "Rest of the World" category.)

I also switched to graphing shares of adoptions rather than number of adoptions. The total number of adoptions dropped drastically during that period. It is often the share, not the absolute numbers, that is of interest.