Diverging paths for rich and poor, infographically

Ray Vella (link) asked me to comment on a chart about regional wealth distribution, which I wrote about here. He also asked students in his NYU infographics class to create their own versions.

This effort caught my eye:


This work is creative, and I like the concept of using two staircases to illustrate the diverging fortunes of the two groups. This is worlds away from the original Economist chart.

The infographic does have a serious problem. In one of my dataviz talks, I talk about three qualifications of work called "data visualization." The first qualification is that the data visualization has to display the data. This is an example of an infographic that is invariant to the data.

Is it possible to salvage the concept? I tried. Here is an idea:


I abandoned the time axis so the data plotted are only for 2015, and the countries are shown horizontally from most to least equal. I'm sure there are ways to do it even better.

Infographics can be done while respecting the data. Ray is one of the designers who appreciate this. And thanks Ray for letting me blog about this.




Fifty-nine intersections supporting forty dots of data

My friend Ray V. asked how this chart can be improved:


Let's try to read this chart. The Economist is always the best at writing headlines, and this one is simple and to the point: the rich get richer. This is about inequality but not just inequality - the growth in inequality over time.

Each country has four dots, divided into two pairs. From the legend, we learn that the line represents the gap between the rich and the poor. But what is rich and what is poor? Looking at the sub-header, we learn that the population is divided by domicile, and the per-capita GDP of the poorest and richest regions are drawn. This is a indirect metric, and may or may not be good, depending on how many regions a country is divided into, the dispersion of incomes within each region, the distribution of population between regions, and so on.

Now, looking at the axis labels, it's pretty clear that the data depicted are not in dollars (or currency), despite the reference to GDP in the sub-header. The numbers represent indices, relative to the national average GDP per head. For many of the countries, the poorest region produces about half of the per-capita GDP as the richest region.

Back to the orginal question. A growing inequality would be represented by a longer line below a shorter line within each country. That is true in some of these countries. The exceptions are Sweden, Japan, South Korea.

It doesn't jump out that the key task requires comparing the lengths of the two lines. Another issue is the outdated convention of breaking up a line (Britian) when the line is of extreme length - particularly unwise given that the length of the line encodes the key metric in the chart.

Further, it has low data-ink ratio a la Tufte. The gridlines, reference lines, and data lines weave together in a complex pattern creating 59 intersections in a chart that contains only 40  36 numbers.


 I decided to compute a simpler metric - the ratio of rich to poor.  For example, in the UK, the richest area produces about 20 times as much GDP per capita as the poorest one in 2015.  That is easier to understand than an index to the average region.

I had fun making the following chart, although many standard forms like the Bumps chart (i.e. slopegraph) or paired columns and so on also work.


This chart is influenced by Ed Tufte, who spent a good number of pages in his first book advocating stripping even the standard column chart to its bare essence. The chart also acknowledges the power of design to draw attention.



PS. Sorry I counted incorrectly. The chart has 36 dots not 40. 

Confuse, confuses, confused, confusing

Via Twitter, @Stoltzmaniac sent me this chart, from the Economist (link to article):


There is simply too much going on on the right side of the chart. The designer seems not to be able to decide which metric is more important, the cumulative growth rate of vehicles in use from 2005 to 2014, or the vehicles per 1,000 people in 2014. So both set of numbers are placed on the chart, regrettably in close proximity.

In the meantime, the other components of the chart, such as the gridlines and the red line indicating 2005 = 100 are only relevant to the cumulative vehicle growth metric. Perhaps noticing the imbalance, the designer then paints the other data series in rainbow-colored boxes, and prints the label for this data series in a big white box. This decision tilts the chart towards the vehicle per capita metric, as our eyes now cannot help but stare at the white box.


There are really three trends: the growth in population, the growth in vehicles, and the resultant growth in vehicle per capita. They are all be accommodated in a small-multiples setting, as follows:


There are some curious angular trends revealed here. The German population somehow dipped into negative territory around 2007-8 but since then has turned around. Nigeria's vehicle growth declined sharply after 2006 so that the density of vehicles has stabilized.


If Clinton and Trump go to dinner, do they sit face to face, or side by side?

One of my students tipped me to an August article in the Economist, published when last the media proclaimed Donald Trump's campaign in deep water. The headline said "Donald Trump's Media Advantage Falters."

Who would have known, judging from the chart that accompanies the article?


There is something very confusing about the red line, showing "Trump August 2015 = 1." The data are disaggregated by media channel, and yet the index is hitched to the total of all channels. It is also impossible to figure out how Clinton is doing relative to Trump in each channel.

Here is a small-multiples rendering that highlights the key comparisons:


Alternatively, one can plot the Clinton advantage versus Trump in each channel, like this:


One sees that Clinton has caught up in the last month (July 2016), primarily through more coverage by "online news."

Imagine Mr. Trump and Mrs. Clinton dining at a restaurant. Are they seated side by side (Economist) or face to face (junkcharts)?

Graphical inequity ruins the chart

This Economist chart has a great concept but I find it difficult to find the story: (link)


I am a fan of color-coding the text as they have done here so that part is good.

The journalist has this neat idea of comparing those who are apathetic ("don't care about whether Britain is in or out") and those who are passionate ("strongly prefer" that Britain is either in or out).

The chosen format suffers because of graphical inequity: the countries are sorted by decreasing apathy, which means it is challenging to figure out the degree of passion.

This chosen order is unrelated to the question at hand. One possible way of interpreting the chart is to compare individual countries against the European average. The journalist also recognizes this, and highlighted the Euro average.

The problem is that there are two different averages and no good way to decide whether a particular country is above or below average.

Here is my version of the chart:


The biggest change is to create the new metric: how many people say they really care about Brexit/Bremain for every person who say they don't care. In Britain, over four people really care for each one who doesn't while in Slovenia, you can only find fewer than half a person who really cares for each one who doesn't.



Confusion is not limited to complex dataviz

This chart looks simple and harmless but I find it disarming.


I usually love the cheeky titles in the Economist but this title is very destructive to the data visualization. The chart has nothing to do with credit scores. In fact, credit scoring is associated with consumers while countries have credit ratings.

Also, I am not a fan of the Economist way of labeling negative axes. The negative sign situated between 0 and 1 looks like a stray hyphen that the editor missed.

A line chart would have brought out the pattern more sharply:


The pairing of columns in the original chart signals that readers should compare GDP growth to population growth. A good point, since GDP scales with population.

Controlling for population size can be accomplished by the per-capita GDP growth rate.


The last three years are clearly different. By this metric, different in a good way.

This chart creates a problem for the journalist. The article is about the deal to "save" Puerto Rico which some has criticized as colonial. Presumably, the territory has been in dire straits. There are plenty of metrics to illustrate this point but GDP growth is not one of them.

Batmen not as interesting as it seems

When this post appears, I will be on my way to Seattle. Maybe I will meet some of you there. You can still register here.

I held onto this tip from a reader for a while. I think it came from Twitter:

20160326_woc432_1 batman

The Economist found a fun topic but what's up with the axis not starting at zero?

The height x weight gimmick seems cool but on second thought, weight is not the same as girth so it doesn't make much sense!

In the re-design, I use bubbles to indicate weight and vertical location to indicate height. The data aren't as interesting as one might think. All the actors pretty much stayed true to the comic-book ideal, with Adam West being the closest. I also changed the order of the actors.


I left out the Lego, as it creates a design challenge that does not justify the effort.



Raw data and the incurious

The following chart caught my eye when it appeared in the Wall Street Journal this month:


This is a laborious design; much sweat has been poured into it. It's a chart that requires the reader to spend time learning to read.

A major difficulty for any visualization of this dataset is keeping track of the two time scales. One scale, depicted horizontally, traces the dates of Fed meetings. These meetings seem to occur four times a year except in 2012. The other time scale is encoded in the colors, explained above the chart. This is the outlook by each Fed committee member of when he/she expects a rate hike to occur.

I find it challenging to understand the time scale in discrete colors. Given that time has an order, my expectation is that the colors should be ordered. Adding to this mess is the correlation between the two time scales. As time treads on, certain predictions become infeasible.

Part of the problem is the unexplained vertical scale. Eventually, I realize each cell is a committee member, and there are 19 members, although two or three routinely fail to submit their outlook in any given meeting.

Contrary to expectation, I don't think one can read across a row to see how a particular member changes his/her view over time. This is because the patches of color would be less together otherwise.


After this struggle, all I wanted is some learning from this dataset. Here is what I came up with:


There is actually little of interest in the data. The most salient point is that a shift in view occurred back in September 2012 when enough members pushed back the year of rate hike that the median view moved from 2014 to 2015. Thereafter, there is a decidedly muted climb in support for the 2015 view.


This is an example in which plotting elemental data backfires. Raw data is the sanctuary of the incurious.



Circular but insufficient

One of my students analyzed the following Economist chart for her homework.


I was looking for it online, and found an interactive version that is a bit different (link). Here are three screen shots from the online version for years 2009, 2013 and 2018. The first and last snapshots correspond to the years depicted in the print version.


The online version is the self-sufficiency test for the print version. In testing self-sufficiency, we want to see if the visual elements (i.e. the circular sectors on the print version) pull their own weights. The quick answer is no. The reader can't tell how much sales are represented in each sector, nor can they reliably estimate the relative scales of print versus ebook (pink/red vs yellow/orange) or year-to-year growth rates.

As usual, when we see the entire data set printed on the chart itself, it is giveaway that the visual elements are mere ornaments.

The online version does not have labels unless you hover over the hemispheres. But again it is a challenge to learn anything from the picture.

In the Trifecta checkup, this is a Type V chart.


This particular dataset is made for the bumps-style chart:





The missing Brazil effect, and BYOC charts

Announcement: I'm giving a free public lecture on telling and finding stories via data visualization at NYU on 7/15/2014. More information and registration here.


The Economist states the obvious, that the current World Cup is atypically high-scoring (or poorly defended, for anyone who've never been bothered by the goal count). They dubiously dub it the Brazil effect (link).

Perhaps in a sly vote of dissent, the graphic designer came up with this effort:


(Thanks to Arati for the tip.)

The list of problems with this chart is long but let's start with the absence of the host country and the absence of the current tournament, both conspiring against our ability to find an answer to the posed question: did Brazil make them do it?


Turns out that without 2014 on the chart, the only other year in which Brazil hosted a tournament was 1950. But 1950 is not even comparable to the modern era. In 1950, there was no knock-out stage. They had four groups in the group stage but divided into two groups of four, one group of three and one group of two. Then, four teams were selected to play a round-robin final stage. This format is so different from today's format that I find it silly to try to place them on the same chart.

This data simply provide no clue as to whether there is a Brazil effect.


The chosen design is a homework assignment for the fastidious reader. The histogram plots the absolute number of drawn matches. The number of matches played has tripled from 16 to 48 over those years so the absolute counts are highly misleading. It's worse than nothing because the accompanying article wants to make the point that we are seeing fewer draws this World Cup compared to the past. The visual presents exactly the opposite message! (Hint: Trifecta Checkup)

Unless you realize this is a homework assignment. You can take the row of numbers listed below the Cup years and compute the proportion of draws yourself. BYOC (Bring Your Own Calculator). Now, pay attention because you want to use the numbers in parentheses (the number of matches), not the first number (that of teams).

Further, don't get too distracted by the typos: in both 1982 and 1994, there were 24 teams playing, not 16 or 32. The number of matches (52 in each case) is correctly stated.


Wait, the designer provides the proportions at the bottom of the chart, via this device:


As usual, the bubble chart does a poor job conveying the data. I deliberately cropped out the data labels to demonstrate that the bubble element cannot stand on its own. This element fails my self-sufficiency test.


I find the legend challenging as well. The presentation should be flipped: look at the proportion of ties within each round, instead of looking at the overall proprotion of ties and then breaking those ties by round.

The so-called "knockout round" has many formats over the years. In early years, there were often two round-robin stages, followed by a smaller knockout round. Presumably the second round-robin stage has been classified as "knockout stage".

Also notice the footnote, stating that third-place games are excluded from the histogram. This is exactly how I would do it too because the third-place match is a dead rubber, in which no rational team would want to play extra-time and penalty shootout.

The trouble is inconsistency. The number of matches shown underneath the chart includes that third-place match so the homework assignment above actually has a further wrinkle: subtract one from the numbers in parentheses. The designer gets caught in this booby trap. The computed proportion of draws displayed at the bottom of the chart includes the third-place match, at odds with the histogram.


Here is a revised version of the chart:



A few observations are in order:

  • The proportion of ties has been slowly declining over the last few Cups.
  • The drop in proportion of ties in 2014 is not drastic.
  • While the proportion of ties has dropped in the 2014 World Cup, the proportion of 0-0 ties has increased. (The gap between the two lines shows the ties with goals.)
  • In later rounds, since the 1980s, the proportion of ties has been fairly stable, between 20 and 35 percent.

Another reason for separate treatment is that the knockout stage has not started yet in 2014 when this chart was published. Instead of removing all of 2014, as the Economist did, I can include the group stage for 2014 but exclude 2014 from the knockout round analysis.

In the Trifecta Checkup, this is Type DV. The data do not address the question being posed, and the visual conveys the wrong impression.


Finally, there is one glaring gap in all of this. Some time ago (the football fans can fill in the exact timing), FIFA decided to award three points for a win instead of two. This was a deliberate effort to increase the point differential between winning and drawing, supposedly to reduce the chance of ties. Any time-series exploration of the frequency of ties would clearly have to look into this issue.