Putting a final touch on Bloomberg's terrific chart of social movements

My friend Rhonda D. wins a prize for submitting a good chart. This is Bloomberg's take on the current Supreme Court case on gay marriage (link). Their designer places this movement in the context of prior social movements such as women's suffrage and inter-racial marriage.

Bloomberg_pace_socialchange

Previously, I mentioned New York Times' coverage using "tile maps." While the Times places geography front and center, Bloomberg prefers to highlight the time scale. (In the bottom section of Bloomberg's presentation, they use tile maps as well.)

These are the little things I love about the graphic shown above:

  • The very long time horizon really allows us to see our own lifetime as a small section of the history of the nation
  • The gray upper envelope showing the size of the union is essential background data presented subtly
  • The inclusion of "prohibition" representing a movement that failed (I wish they had included more examples of movements that do not succeed)
  • The open circle and arrow indicators to differentiate between ongoing and settled issues

They should have let the movements finish by connecting the open circles to the upper envelope. Like this:

Redo_bloomberg_pace_socialchange_added2

This makes the steepness of the lines jump out even more. In addition, it makes a distinction between the movements that succeeded and the movement that failed. (Prohibition was repealed in 1933. The line between 1920 and 1933 could be more granular if such data are available.)

 


Playing with orientation and style

I saw this nifty chart in the Wall Street Journal last week. The Post Office is competing with Fedex and UPS on pricing. The nice feature about this small dataset is that the story is very clear. In almost every setting, the old USPS prices were higher than those of Fedex and UPS, but have been reduced to below those levels.

Wsj-postage

Below are a couple of different looks. I like the vertical scale for prices better. Long-time readers will know I prefer the second version with lines.

Redo_wsjpostage1

Redo_wsjpostage2


Exquisite chart by-of-for academics

This chart published in Harvard Magazine has won my heart.

Harvard_proftime

It is well executed in many ways. The chart illustrates a study of time spent by assistant and associate professors. It focuses specifically on time spent working versus time spent on household chores. One of the obvious questions of the study is whether female professors are disadvantaged when they have family obligations.

The general visual framework is the profile chart. Four segments of professors are arranged left to right from single with no children to married, with children and both parents working or single parent. The chart makes these points clear:

  • Having children adds about 15-30 hours to time spent on household duties, per partner
  • Household duties are not evenly split by gender, with the expected bias. (Of course, this observation must be carefully vetted. The men and women are not married to each other, even on the right side of the chart. But I presume the usual interpretation should hold.)
  • Male professors with kids do spend more time on household chores than those without but not as much as female professors with kids

In the meantime, the amount of time spent working is about the same for all four segments, raising a side question: what other activities got displaced? The juxtaposition of the lines allows us to see that the displaced hours are almost 50 percent of the total time spent working! What did they do less of?

I especially like the explicit depiction and labeling of the "gender gap" (the orange vertical lines). Also, the use of median hours instead of average hours.

My one little complaint is that the designer forgot to tell us the hours are off a weekly basis (I'm guessing here). Just adding "per week" after "median hours" would have fixed this. 

***

One simple chart cannot address all possible questions on such a complicated subject. I like the restraint the designer exercised in not saddling the chart with too many questions.

I will just mention one tricky statistical issue. Getting tenure and making babies are both activities that occur within some time window in a professor's life, if at all. So there is a survivorship bias. The professors who receive tenure drops out of the picture. If you are older, and still in the pool, you probably are less "accomplished" from the perspective of the tenure-granting process. The longer you stay in that pool, the more likely you will have gotten married and/or have children--thus, there is an age bias going from left to right, as well as a survivorship bias. This implies that the characteristics of the professors in the four groups are likely to be different not just on their marital and child-rearing statuses but also on age and probability of tenure.


Visualizing uneven distributions

Jeff, a reader of the blog, asks for comment on this blog post of his (link).

The highlight of the post is this chart, which shows an uneven distribution.

Chandoo_Did-you-just-chart_more-segmentation-Excel-2010

The message of the chart is that a large amount of donations (about 25%) came from the top 3 percent of donors. This is a long-tailed distribution, and quite typical of much data that have to do with financial matters. Thus, it is a general problem as many of us encounter this type of data.

One of the insights from Jeff's post is that with some tricks, one can generate a chart that looks like the above using Excel. This is pretty impressive, and he credits Peltier for the pointer.

***

Now, let's see if there are other ways to present this data. One issue I have with the chart is that the most important statistics are found in the text labels. These are of the form: "X% of customers contribute Y% of revenues". So, in effect, there are two relevant data series, one of the share of people and then the share of revenues.

The following is a stacked column chart:

Redo_chandoo1

Here, the information is primarily encoded in the dotted guide lines between the two columns. It has the advantage of showing both the absolute share of people as well as of revenues, plus showing the uneven distribution between the two data series.

But it is also less fun to look at. The advantage of the original chart is that one can imagine that all the donors are being lined up along the horizontal axis from those who gave the least to those who gave the most. That's a pretty powerful mental picture. The weakness of the original is that few of us can mentally tally up the strangely shaped areas to learn the share of revenues.

***

The next version is a kind of profile chart:

Redo_chandoo2

I like this one because it places the two data series on equal footing, and allows for efficient comparison of the two sets of proportions. It also has the feature of showing all the shares, just like the stacked columns.

 PS. Jeff has taken some of his readers' comments into account, and has evolved his original design to this one:

Chandoo_Did you just chart_Redux 3

I can see these changes:

  • customers ordered with the most important on the left and the least on the right.  To me, a neutral change
  • The vertical axis is labelled "subscription value" instead of "How much do we get for each subscription". This is a slight improvement, using fewer words to convey the same point.
  • The breakpoints have been set differently to split the revenues into five  so that each segment now accounts for exactly 20% of the revenues. I actually prefer the original segmentation -- that one visually picks out the breakpoints in the data, thus it is empirical rather than canonical. Look at the split between the gray and the yellow segments in the new chart. Does it make sense to split customers with the same subscription value into two groups?

Seats half full or half empty

Kevin Drum shows the following graphic (link) to illustrate where the House stood on authorizing force in Syria.

What interests me is whether the semi-circle concept adds to the chart. It evokes the physical appearance of a chamber, presumably where such a debate has taken place -- although most televised hearings tend to exhibit lots of empty seats.

Kdrum_syria

The half-filled circles in particular do not make peace with me.

Here is a tree map of the same data.

Redo_drumsyria

Notice that legend boxes are unnecessary.

A pie chart with appropriate labeling acts similarly.

***

A profile chart produces mixed results:

Redo_drumyria_2b

This version has the advantage of stacking the voting variable. It doesn't do a good job describing future scenarios.


Hate the defaults

One piece of  advice I give for those wanting to get into data visualization is to trash the defaults (see the last part of this interview with me). Jon Schwabish, an economist with the government, gives a detailed example of how this is done in a guest blog on the Why Axis.

Here are the highlights of his piece.

***

He starts with a basic chart, published by the Bureau of Labor Statistics. You can see the hallmarks of the Excel chart using the Excel defaults. The blue, red, green color scheme is most telling.

Schwabish_bls1

 

Just by making small changes, like using tints as opposed to different colors, using columns instead of bars, reordering the industry categories, and placing the legend text next to the columns, Schwabish made the chart more visually appealing and more effective.

Redo_schwabishbls1

 The final version uses lines instead of columns, which will outrage some readers. It is usually true that a grouped bar chart should be replaced by overlaid line charts, and this should not be limited to so-called discrete data.

Redo_schwabishbls2

Schwabish included several bells and whistles. The three data points are not evenly spaced in time. The year-on-year difference is separately plotted as a bar chart on the same canvass. I'd consider using a line chart here as well... and lose the vertical axis since all the data are printed on the chart (or else, lose the data labels). 

This version is considerably cleaner than the original.

***

I noticed that the first person to comment on the Why Axis post said that internal BLS readers resist more innovative charts, claiming "they don't understand it". This is always a consideration when departing from standard chart types.

Another reader likes the "alphabetical order" (so to speak) of the industries. He raises another key consideration: who is your audience? If the chart is only intended for specialist readers who expect to find certain things in certain places, then the designer's freedom is curtailed. If the chart is used as a data store, then the designer might as well recuse him/herself.

 


Beautiful spider loses its way 2

A double post today.

In the previous post, I talked about NFL.com's visualization of football player statistics. In this post, I offer a few different views of the same data.

***

The first is a dot plot arranged in small multiples.

Redo_nflspider

Notice that I have indiced every metric against the league average. This is shown in the first panel. I use a red dot to warn readers that the direction of this metric is opposite to the others (left of center is a good thing!)

You can immediately make a bunch of observations:

  • Alex Smith was quite poor, except for interceptions.
  • Colin Kaepernick had similar passing statistics as Smith. His only advantage over Smith was the rushing.
  • Joe Flacco, as we noted before, is as average as it goes (except for rushing yards).
  • Tyrrod Taylor is here to remind us that we have to be careful about backup players being included in the same analysis.

***

The second version is a heatmap.

This takes inspiration from the fact that any serious reader of the spider chart will be reading the eight spokes (dimensions) separately. Why not plot these neatly in columns and use color to help us find the best and worst?

Redo_nfl_stats2

Imagine this to be a large table with as many rows as there are quarterbacks. You will able to locate the red (hot) zones quickly. You can also scan across a row to understand that player's performance relative to the average, on every metric.

I like this visualization best, primarily because it scales beautifully.

***

The final version is a profile chart, or sometimes called a parallel coordinates plot. While I am an advocate of profile charts, they really only work when you have a small number of things to compare.

  Redo_nflspider3

 

 

 


More power brings more responsibility

Nick C. on Twitter sent us to the following chart of salaries in Major League Soccer. (link)

Mlbsalaries

This chart is hosted at Tableau, which is one of the modern visualization software suites. It appears to be a user submission. Alas, more power did not bring more responsibility.

Sorting the bars by total salary would be a start.

The colors and subsections of the bars were intended to unpack the composition of the total salaries, namely, which positions took how much of the money. I'm at a loss to explain why those rectangles don't seem to be drawn to scale, or what it means to have rectangles stacked on top of each other. Perhaps it's because I don't know much about how the cap works.

Combined with the smaller chart (shown below), the story seems to be that while all teams have similar cap numbers, the actual salaries being paid could differ by multiples.

***

This is the standard stacked bar chart showing the distribution of salary cap usage by team:

Tableau_mlbsalaries

 I have never understood the appeal of stacking data. It's not easy to compare the middle segments.

After quite a bit of work, I arrived at the following:

Redo_mlbsalaries

The MLS teams are divided into five groups based on how they used the salary cap. Salary cap figures are converted into proportion of total cap. For example, the first cluster includes Chicago, Los Angeles, New York, Seattle and Toronto, and these teams spread the wealth among the D, F, and M players while not spending much on goalie and "others". On the other hand, Groups 2 and 3, especially Group 3 allocated 30-45% of the cap on the midfield. 

Three teams form their own clusters. CLB spends more of its cap on "others" than any other team (others are mostly hyphenated positions like D-F, F-M, etc.) DAL and VAN spend a lot less on midfield players than other teams. VAN spends a lot on defense.

My version has many fewer data points (although the underlying data set is the same) but it's easier to interpret.

***

I tried various chart types like bar charts, and even pie charts. I still like the profile (line) charts best.

Redo_mlbsalaries_bar

In a modern software (I'm using JMP's Graph Builder here), it's only one click to go from line to bar, and one click to go to pie.

Redo_mlbsalaries_pie

 


Figuring out the location (of the data)

When we visualize data, we want to expose the information contained within, or to use the terminology Nate Silver popularized, to expose the signal and leave behind the noise.

When graphs are not done right, sometimes they manage to obscure the information.

Reader John H. found a confusing bar chart while studying a paper (link to PDF) in which the authors compared two algorithms used to determine the position of Wi-Fi access points under various settings.

Wcl_error
The first reaction is maybe the researchers are telling us there is no information here. The most important variable on this chart is what they call "datanum", and it goes from left to right across the page. A casual glance across the page gives the idea that nothing much is going on.

Then you look at the row labels, and realize that this dataset is very well structured. The target variables (AP Position Error) is compared along four dimensions: datanum, the algorithm (WCL, or GPR+WCL), the number of access points, and the location of these access points (inner, boundary, all).

When the data has a nice structure, there should be better ways to visualize it.

John submitted a much improved version, which he created using ggplot2.

Redo_wcl_facetted-version

This is essentially a small multiples chart. The key differences between the two charts are:

  • Giving more dimensions a chance to shine
  • Spacing the "datanum" proportional to the sample size (we think "datanum" means the number of sample readings taken from each access point)
  • Using a profile chart, which also allows the y-axis to start from 2
  • One color versus six colors, and no chartjunk
  • Using fewer decimal points

When you read this chart, you finally realize that the experiment has yielded several insights:

  1. Increasing sample size does not affect aggregate WCL error rate but it does reduce the error rate of aggregate GPR+WCL.
  2. The improvement of GPR+WCL comes only from the inner access points.
  3. The WCL algorithm performs really well in inner access points but poorly in outer access points.
  4. The addition of GPR to the WCL algorithm improves the performance of outer access points but deteriorates the performance of inner access points. (In aggregate, it improves the performance... this is only because there are almost two outer access points to every inner access point.)

Now, I don't know anything about this position estimation problem. The chart leaves me thinking why they don't just use WCL on inner access points. The performance under that setting is far and away the best of all the tested settings.

***

The researchers described their metric as AP Position Error (2drms, 95% confidence). I'm not sure what they mean by that because when I see 95% confidence, I am expecting to see the confidence band around the point estimates being shown above.

And yet, the data table shows only point estimates -- in fact, estimates to two decimal places of precision. In statistics, the more precision you have, the less confidence.


Visualization as an analysis tool

Visualizing data has many uses. We often explore how charts can be used to convey data insights and tell stories. We talk less on this blog about how slicing and dicing data helps us form impressions about the structure of the data sets we're analyzing.

I have been digging around some payroll employment data recently. (You can find the data at the Bureau of Labor Statistics website.) I thought the following two charts are quite instructive.

The first one surfaces one type of recurring patterns: there is a seasonal pattern running from January to December that repeats every year. I use a small-multiples setup, with each chartlet indiced by year.

Seasonalfactor_monthly_by yeargroup

The second chart shows a different kind of regularity: there is a cyclical pattern running from 2002 to 2012, no matter which month we're looking at. Again, we have a small-multiples setup, this time with each chartlet indiced by a month of year.

Unadj_yeartoyeartrend bymonth

This second chart is a simple form of "seasonal adjustment". The data used in this plot are unadjusted. The chart shows that there is a larger cyclical pattern during the period of 2002-2012 that affects every month of the year.

I already hear grumbling about using a line chart when there is no continuity from one dot to the next. In this chart, in fact, time runs left to right, top to bottom, then starts again at the first chartlet, and so on. This is a profile chart. As the name suggests, we should be focused on the shape of the line. It doesn't have to have physical meaning; we are only looking for regularity.

***

Statisticians love to find this kind of regular patterns because they are easy to describe. Of course, most data are much messier.