People flooded this chart presented without comment with lots of comments

The recent election in Italy has resulted in some dubious visual analytics. A reader sent me this Excel chart:

Italy_elections_RDC-M5S

In brief, an Italian politician (trained as a PhD economist) used the graph above to make a point that support of the populist Five Star party (M5S) is highly correlated with poverty - the number of people on RDC (basic income). "Senza commento" - no comment needed.

Except a lot of people noticed the idiocy of the chart, and ridiculed it.

The chart appeals to those readers who don't spend time understanding what's being plotted. They notice two lines that show similar "trends" which is a signal for high correlation.

It turns out the signal in the chart isn't found in the peaks and valleys of the "trends".  It is tempting to observe that when the blue line peaks (Campania, Sicilia, Lazio, Piedmonte, Lombardia), the orange line also pops.

But look at the vertical axis. He's plotting the number of people, rather than the proportion of people. Population varies widely between Italian provinces. The five mentioned above all have over 4 million residents, while the smaller ones such as Umbira, Molise, and Basilicata have under 1 million. Thus, so long as the number of people, not the proportion, is plotted, no matter what demographic metric is highlighted, we will see peaks in the most populous provinces.

***

The other issue with this line chart is that the "peaks" are completely contrived. That's because the items on the horizontal axis do not admit a natural order. This is NOT a time-series chart, for which there is a canonical order. The horizontal axis contains a set of provinces, which can be ordered in whatever way the designer wants.

The following shows how the appearance of the lines changes as I select different metrics by which to sort the provinces:

Redo_italianelections_m5srdc_1

This is the reason why many chart purists frown on people who use connected lines with categorical data. I don't like this hard rule, as my readers know. In this case, I have to agree the line chart is not appropriate.

***

So, where is the signal on the line chart? It's in the ratio of the heights of the two values for each province.

Redo_italianelections_m5srdc_2

Here, we find something counter-intuitive. I've highlighted two of the peaks. In Sicilia, about the same number of people voted for Five Star as there are people who receive basic income. In Lombardia, more than twice the number of people voted for Five Star as there are people who receive basic income. 

Now, Lombardy is where Milan is, essentially the richest province in Italy while Sicily is one of the poorest. Could it be that Five Star actually outperformed their demographics in the richer provinces?

***

Let's approach the politician's question systematically. He's trying to say that the Five Star moement appeals especially to poorer people. He's chosen basic income as a proxy for poverty (this is like people on welfare in the U.S.). Thus, he's divided the population into two groups: those on welfare, and those not.

What he needs is the relative proportions of votes for Five Star among these two subgroups. Say, Five Star garnered 30% of the votes among people on welfare, and 15% of the votes among people not on welfare, then we have a piece of evidence that Five Star differentially appeals to people on welfare. If the vote share is the same among these two subgroups, then Five Star's appeal does not vary with welfare.

The following diagram shows the analytical framework:

Redo_italianelections_m5srdc_3

What's the problem? He doesn't have the data needed to establish his thesis. He has the total number of Five Star voters (which is the sum of the two yellow boxes) and he has the total number of people on RDC (which is the dark orange box).

Redo_italianelections_m5srdc_4

As shown above, another intervening factor is the proportion of people who voted. It is conceivable that the propensity to vote also depends on one's wealth.

So, in this case, fixing the visual will not fix the problem. Finding better data is key.


Putting a final touch on Bloomberg's terrific chart of social movements

My friend Rhonda D. wins a prize for submitting a good chart. This is Bloomberg's take on the current Supreme Court case on gay marriage (link). Their designer places this movement in the context of prior social movements such as women's suffrage and inter-racial marriage.

Bloomberg_pace_socialchange

Previously, I mentioned New York Times' coverage using "tile maps." While the Times places geography front and center, Bloomberg prefers to highlight the time scale. (In the bottom section of Bloomberg's presentation, they use tile maps as well.)

These are the little things I love about the graphic shown above:

  • The very long time horizon really allows us to see our own lifetime as a small section of the history of the nation
  • The gray upper envelope showing the size of the union is essential background data presented subtly
  • The inclusion of "prohibition" representing a movement that failed (I wish they had included more examples of movements that do not succeed)
  • The open circle and arrow indicators to differentiate between ongoing and settled issues

They should have let the movements finish by connecting the open circles to the upper envelope. Like this:

Redo_bloomberg_pace_socialchange_added2

This makes the steepness of the lines jump out even more. In addition, it makes a distinction between the movements that succeeded and the movement that failed. (Prohibition was repealed in 1933. The line between 1920 and 1933 could be more granular if such data are available.)

 


Playing with orientation and style

I saw this nifty chart in the Wall Street Journal last week. The Post Office is competing with Fedex and UPS on pricing. The nice feature about this small dataset is that the story is very clear. In almost every setting, the old USPS prices were higher than those of Fedex and UPS, but have been reduced to below those levels.

Wsj-postage

Below are a couple of different looks. I like the vertical scale for prices better. Long-time readers will know I prefer the second version with lines.

Redo_wsjpostage1

Redo_wsjpostage2


Exquisite chart by-of-for academics

This chart published in Harvard Magazine has won my heart.

Harvard_proftime

It is well executed in many ways. The chart illustrates a study of time spent by assistant and associate professors. It focuses specifically on time spent working versus time spent on household chores. One of the obvious questions of the study is whether female professors are disadvantaged when they have family obligations.

The general visual framework is the profile chart. Four segments of professors are arranged left to right from single with no children to married, with children and both parents working or single parent. The chart makes these points clear:

  • Having children adds about 15-30 hours to time spent on household duties, per partner
  • Household duties are not evenly split by gender, with the expected bias. (Of course, this observation must be carefully vetted. The men and women are not married to each other, even on the right side of the chart. But I presume the usual interpretation should hold.)
  • Male professors with kids do spend more time on household chores than those without but not as much as female professors with kids

In the meantime, the amount of time spent working is about the same for all four segments, raising a side question: what other activities got displaced? The juxtaposition of the lines allows us to see that the displaced hours are almost 50 percent of the total time spent working! What did they do less of?

I especially like the explicit depiction and labeling of the "gender gap" (the orange vertical lines). Also, the use of median hours instead of average hours.

My one little complaint is that the designer forgot to tell us the hours are off a weekly basis (I'm guessing here). Just adding "per week" after "median hours" would have fixed this. 

***

One simple chart cannot address all possible questions on such a complicated subject. I like the restraint the designer exercised in not saddling the chart with too many questions.

I will just mention one tricky statistical issue. Getting tenure and making babies are both activities that occur within some time window in a professor's life, if at all. So there is a survivorship bias. The professors who receive tenure drops out of the picture. If you are older, and still in the pool, you probably are less "accomplished" from the perspective of the tenure-granting process. The longer you stay in that pool, the more likely you will have gotten married and/or have children--thus, there is an age bias going from left to right, as well as a survivorship bias. This implies that the characteristics of the professors in the four groups are likely to be different not just on their marital and child-rearing statuses but also on age and probability of tenure.


Visualizing uneven distributions

Jeff, a reader of the blog, asks for comment on this blog post of his (link).

The highlight of the post is this chart, which shows an uneven distribution.

Chandoo_Did-you-just-chart_more-segmentation-Excel-2010

The message of the chart is that a large amount of donations (about 25%) came from the top 3 percent of donors. This is a long-tailed distribution, and quite typical of much data that have to do with financial matters. Thus, it is a general problem as many of us encounter this type of data.

One of the insights from Jeff's post is that with some tricks, one can generate a chart that looks like the above using Excel. This is pretty impressive, and he credits Peltier for the pointer.

***

Now, let's see if there are other ways to present this data. One issue I have with the chart is that the most important statistics are found in the text labels. These are of the form: "X% of customers contribute Y% of revenues". So, in effect, there are two relevant data series, one of the share of people and then the share of revenues.

The following is a stacked column chart:

Redo_chandoo1

Here, the information is primarily encoded in the dotted guide lines between the two columns. It has the advantage of showing both the absolute share of people as well as of revenues, plus showing the uneven distribution between the two data series.

But it is also less fun to look at. The advantage of the original chart is that one can imagine that all the donors are being lined up along the horizontal axis from those who gave the least to those who gave the most. That's a pretty powerful mental picture. The weakness of the original is that few of us can mentally tally up the strangely shaped areas to learn the share of revenues.

***

The next version is a kind of profile chart:

Redo_chandoo2

I like this one because it places the two data series on equal footing, and allows for efficient comparison of the two sets of proportions. It also has the feature of showing all the shares, just like the stacked columns.

 PS. Jeff has taken some of his readers' comments into account, and has evolved his original design to this one:

Chandoo_Did you just chart_Redux 3

I can see these changes:

  • customers ordered with the most important on the left and the least on the right.  To me, a neutral change
  • The vertical axis is labelled "subscription value" instead of "How much do we get for each subscription". This is a slight improvement, using fewer words to convey the same point.
  • The breakpoints have been set differently to split the revenues into five  so that each segment now accounts for exactly 20% of the revenues. I actually prefer the original segmentation -- that one visually picks out the breakpoints in the data, thus it is empirical rather than canonical. Look at the split between the gray and the yellow segments in the new chart. Does it make sense to split customers with the same subscription value into two groups?

Seats half full or half empty

Kevin Drum shows the following graphic (link) to illustrate where the House stood on authorizing force in Syria.

What interests me is whether the semi-circle concept adds to the chart. It evokes the physical appearance of a chamber, presumably where such a debate has taken place -- although most televised hearings tend to exhibit lots of empty seats.

Kdrum_syria

The half-filled circles in particular do not make peace with me.

Here is a tree map of the same data.

Redo_drumsyria

Notice that legend boxes are unnecessary.

A pie chart with appropriate labeling acts similarly.

***

A profile chart produces mixed results:

Redo_drumyria_2b

This version has the advantage of stacking the voting variable. It doesn't do a good job describing future scenarios.


Hate the defaults

One piece of  advice I give for those wanting to get into data visualization is to trash the defaults (see the last part of this interview with me). Jon Schwabish, an economist with the government, gives a detailed example of how this is done in a guest blog on the Why Axis.

Here are the highlights of his piece.

***

He starts with a basic chart, published by the Bureau of Labor Statistics. You can see the hallmarks of the Excel chart using the Excel defaults. The blue, red, green color scheme is most telling.

Schwabish_bls1

 

Just by making small changes, like using tints as opposed to different colors, using columns instead of bars, reordering the industry categories, and placing the legend text next to the columns, Schwabish made the chart more visually appealing and more effective.

Redo_schwabishbls1

 The final version uses lines instead of columns, which will outrage some readers. It is usually true that a grouped bar chart should be replaced by overlaid line charts, and this should not be limited to so-called discrete data.

Redo_schwabishbls2

Schwabish included several bells and whistles. The three data points are not evenly spaced in time. The year-on-year difference is separately plotted as a bar chart on the same canvass. I'd consider using a line chart here as well... and lose the vertical axis since all the data are printed on the chart (or else, lose the data labels). 

This version is considerably cleaner than the original.

***

I noticed that the first person to comment on the Why Axis post said that internal BLS readers resist more innovative charts, claiming "they don't understand it". This is always a consideration when departing from standard chart types.

Another reader likes the "alphabetical order" (so to speak) of the industries. He raises another key consideration: who is your audience? If the chart is only intended for specialist readers who expect to find certain things in certain places, then the designer's freedom is curtailed. If the chart is used as a data store, then the designer might as well recuse him/herself.

 


Beautiful spider loses its way 2

A double post today.

In the previous post, I talked about NFL.com's visualization of football player statistics. In this post, I offer a few different views of the same data.

***

The first is a dot plot arranged in small multiples.

Redo_nflspider

Notice that I have indiced every metric against the league average. This is shown in the first panel. I use a red dot to warn readers that the direction of this metric is opposite to the others (left of center is a good thing!)

You can immediately make a bunch of observations:

  • Alex Smith was quite poor, except for interceptions.
  • Colin Kaepernick had similar passing statistics as Smith. His only advantage over Smith was the rushing.
  • Joe Flacco, as we noted before, is as average as it goes (except for rushing yards).
  • Tyrrod Taylor is here to remind us that we have to be careful about backup players being included in the same analysis.

***

The second version is a heatmap.

This takes inspiration from the fact that any serious reader of the spider chart will be reading the eight spokes (dimensions) separately. Why not plot these neatly in columns and use color to help us find the best and worst?

Redo_nfl_stats2

Imagine this to be a large table with as many rows as there are quarterbacks. You will able to locate the red (hot) zones quickly. You can also scan across a row to understand that player's performance relative to the average, on every metric.

I like this visualization best, primarily because it scales beautifully.

***

The final version is a profile chart, or sometimes called a parallel coordinates plot. While I am an advocate of profile charts, they really only work when you have a small number of things to compare.

  Redo_nflspider3

 

 

 


More power brings more responsibility

Nick C. on Twitter sent us to the following chart of salaries in Major League Soccer. (link)

Mlbsalaries

This chart is hosted at Tableau, which is one of the modern visualization software suites. It appears to be a user submission. Alas, more power did not bring more responsibility.

Sorting the bars by total salary would be a start.

The colors and subsections of the bars were intended to unpack the composition of the total salaries, namely, which positions took how much of the money. I'm at a loss to explain why those rectangles don't seem to be drawn to scale, or what it means to have rectangles stacked on top of each other. Perhaps it's because I don't know much about how the cap works.

Combined with the smaller chart (shown below), the story seems to be that while all teams have similar cap numbers, the actual salaries being paid could differ by multiples.

***

This is the standard stacked bar chart showing the distribution of salary cap usage by team:

Tableau_mlbsalaries

 I have never understood the appeal of stacking data. It's not easy to compare the middle segments.

After quite a bit of work, I arrived at the following:

Redo_mlbsalaries

The MLS teams are divided into five groups based on how they used the salary cap. Salary cap figures are converted into proportion of total cap. For example, the first cluster includes Chicago, Los Angeles, New York, Seattle and Toronto, and these teams spread the wealth among the D, F, and M players while not spending much on goalie and "others". On the other hand, Groups 2 and 3, especially Group 3 allocated 30-45% of the cap on the midfield. 

Three teams form their own clusters. CLB spends more of its cap on "others" than any other team (others are mostly hyphenated positions like D-F, F-M, etc.) DAL and VAN spend a lot less on midfield players than other teams. VAN spends a lot on defense.

My version has many fewer data points (although the underlying data set is the same) but it's easier to interpret.

***

I tried various chart types like bar charts, and even pie charts. I still like the profile (line) charts best.

Redo_mlbsalaries_bar

In a modern software (I'm using JMP's Graph Builder here), it's only one click to go from line to bar, and one click to go to pie.

Redo_mlbsalaries_pie

 


Figuring out the location (of the data)

When we visualize data, we want to expose the information contained within, or to use the terminology Nate Silver popularized, to expose the signal and leave behind the noise.

When graphs are not done right, sometimes they manage to obscure the information.

Reader John H. found a confusing bar chart while studying a paper (link to PDF) in which the authors compared two algorithms used to determine the position of Wi-Fi access points under various settings.

Wcl_error
The first reaction is maybe the researchers are telling us there is no information here. The most important variable on this chart is what they call "datanum", and it goes from left to right across the page. A casual glance across the page gives the idea that nothing much is going on.

Then you look at the row labels, and realize that this dataset is very well structured. The target variables (AP Position Error) is compared along four dimensions: datanum, the algorithm (WCL, or GPR+WCL), the number of access points, and the location of these access points (inner, boundary, all).

When the data has a nice structure, there should be better ways to visualize it.

John submitted a much improved version, which he created using ggplot2.

Redo_wcl_facetted-version

This is essentially a small multiples chart. The key differences between the two charts are:

  • Giving more dimensions a chance to shine
  • Spacing the "datanum" proportional to the sample size (we think "datanum" means the number of sample readings taken from each access point)
  • Using a profile chart, which also allows the y-axis to start from 2
  • One color versus six colors, and no chartjunk
  • Using fewer decimal points

When you read this chart, you finally realize that the experiment has yielded several insights:

  1. Increasing sample size does not affect aggregate WCL error rate but it does reduce the error rate of aggregate GPR+WCL.
  2. The improvement of GPR+WCL comes only from the inner access points.
  3. The WCL algorithm performs really well in inner access points but poorly in outer access points.
  4. The addition of GPR to the WCL algorithm improves the performance of outer access points but deteriorates the performance of inner access points. (In aggregate, it improves the performance... this is only because there are almost two outer access points to every inner access point.)

Now, I don't know anything about this position estimation problem. The chart leaves me thinking why they don't just use WCL on inner access points. The performance under that setting is far and away the best of all the tested settings.

***

The researchers described their metric as AP Position Error (2drms, 95% confidence). I'm not sure what they mean by that because when I see 95% confidence, I am expecting to see the confidence band around the point estimates being shown above.

And yet, the data table shows only point estimates -- in fact, estimates to two decimal places of precision. In statistics, the more precision you have, the less confidence.