Carl Bialik used to be the Numbers Guy at Wall Street Journal - he's now with FiveThirtyEight. Apparently, he left a huge void. John Eppley sent me to this set of charts via Twitter.
This chart about Citibike is very disappointing.
Using the Trifecta checkup, I first notice that it addresses a stale question and produces a stale answer. The caption below the chart says "the peak times ... seem to be around 9 am and 6 pm." What a shock!
I sense a degree of meekness in usnig "seem to be". There is not much to inspire confidence in the data: rather than the full statistics which you'd think someone at Citibike has, the chart is based on "a two-day sample last autumn". The number of days is less concerning than the question of whether those two autumn days are representative of the year. Curious readers might want to know what data was collected, how it was collected, and the sample size.
Finally, the graph makes a mess of the data. While the black line appears to be data-rich, it is not. In fact, the blue dots might as well be randomly scattered and connected. As you can see from the annotations below, the scale of the chart makes no sense.
Plus, the execution is sloppy, with a missing data label.
The next chart is not much better.
The biggest howler is the choice of pie charts to illustrate three numbers that are not that different.
But I have to say the chart raises more questions than it answers. I am not an expert in pregnancy but doesn't a pregnant woman's weight include the weight of the baby she's carrying? So the more weight the woman gains, on average, the heavier is her baby. What a shock!
The last and maybe the least is this chart about basketball players in the playoff.
It's the dreaded bubble chart. The players are arranged in a perplexing order. I wonder if there is a natural numbering system for basketball positions (center = #1, etc.), like there is in soccer. Even if there is such a natural numbering system, I still question the decision to confound that system with a complicated ranking of current-year playoff players against all-time players.
Above all, the question being asked is uninteresting, and so the chart is uninformative. A more interesting question to me is whether the best players are playing in this year's playoff. To answer this question, the designer should be comparing only currently active players, and showing the all-time ranks of those players who are playing in the playoffs versus those who aren't.
Dual axes are almost always a bad idea. But there is one situation under which I'd use it.
Last week, Alberto Cairo (link) engaged in a Twitter/blogging debate about a chart that first appeared in Reuters concerning the state of the woman CEO in the Fortune 500 companies. Here is the chart under discussion:
This chart already is cleaner and more useful than the original original, which came from a research report from Catalyst (link):
Jonathan Keller re-made the Reuters chart as follows:
Then Chris Moore, responding to Cairo, created this view and also left some insightful comments:
What's at stake here? There are really three related topics of discussion.
First, there is the matter of the upper limit of the vertical axis. Three solutions were suggested: 100 percent, 50 percent, and 4 percent. (Cairo at one point suggested 25 percent, which can be wrapped into the 50 percent bucket.) In reality, this is an argument over which of two key messages should be emphasized. The first message is that women still comprises a pathetically small proportion of Fortune 500 CEOs. The second message is more hopeful, that the growth in this proportion has been quite rapid since 1995.
All versions of the chart actually display both messages. In the Reuters chart (as well as Moore and Cairo), the message about the absolute proportion of women is given as an annotation while the Keller and Voila versions extend the vertical axis, thus encoding this message directly to the chart. Conversely, the Keller and Voila versions deemphasize the growth in proportions, and so I'd have preferred to see a note about that growth when using their versions.
Voila selectes a 50% upper limit because the 50/50 split has an intuitive meaning in the context of gender balance. Because the resulting chart is so visually arresting, and so biased to one of the two key messages, I'd only consider it if the point of the display is to draw attention to the female deficit.
The second disagreement is in using absolute counts versus relative proportions. Moore chose absolute counts. I am in this camp as well. This is primarily because we are talking about Fortune 500 and the 500 number is an idee fixe. In Moore's version, I find the data labels distracting since all the numbers are small and insignificant.
Finally, the linkage between the absolute and the relative numbers also produces multiple solutions. Cairo's post pinpoints this issue. His solution is to include an inset pie chart with an arrow to explicitly link the two views. Moore likes the inset idea, but experimented with a donut chart or a partition in place of the pie chart. He also removes the explicit guiding arrow.
It turns out this dataset is perfectly made for the dual axes. The absolute counts and relative proportions are in one to one correspondence because it's really only one data series expressed twice. This happy situation leads to one line that can be cross-referenced on two axes, one side showing counts and the other side showing proportions. This is shown in my version below (the orange line).
In addition to having two axes, I have plotted two related data series. The second series (in red) shows the incremental change in the number of women CEOs from the previous year (also shown in both counts and proportions).
The first series (the same one everyone plotted) draws attention to the first message, that the growth rate of women CEOs is quite strong since 1995. The second series is a bit of a downer on that message, suggesting that from the absolute count perspective, the progress (only one or two additions per year) has been painfully slow, and not that impressive.
Thanks again to Alberto for making me aware of this discussion. This has been fun!
PS. I have left out the other chart and may return to it in a future post.
Kevin Drum shows the following graphic (link) to illustrate where the House stood on authorizing force in Syria.
What interests me is whether the semi-circle concept adds to the chart. It evokes the physical appearance of a chamber, presumably where such a debate has taken place -- although most televised hearings tend to exhibit lots of empty seats.
The half-filled circles in particular do not make peace with me.
Sometimes I wonder if I should just become a chart doctor. Andrew recently wrote that journals should have graphical editors. Businesses also need those, judging from this submission through Twitter (@francesdonald). Link is here.
You don't know whether to laugh or cry at this pie chart:
The author of the article complains that all the tall buildings around the world are cheats: vanity height is defined as the height above which the floors are unoccupied. The sample proportions aren't that different between countries, ranging from 13% to 19% (of the total heights). Why are they added together to make a whole?
The following boxplot illustrates both the average and the variation in vanity heights by region, and tells a more interesting story:
Recall that in a boxplot, the gray box contains the middle 50% of the data and the white line inside the box indicates the median value. UAE has a tendency to inflate the heights more while the other three regions are not much different.
The other graphic included in the same article is only marginally better, despite a much more attractive exterior:
This chart misrepresents the actual heights of the buildings. At first glance, I thought there must be a physical limit to the number of occupied floors since the grayed out sections are equal heights. If the decision has been made to focus on the vanity height, then just don't show the rest of the buildings.
Also, it's okay to assume a minimal intelligence on the part of readers - I mean, is there a need to repeat the "non-occupiable height" label 10 times? Similarly, the use of 10 sets of double asterisks is rather extravagant.
Reader Steph G. didn't like the effort by WRAL (North Carolina) to visualize the demographics of protestors in Raleigh. It sounds like the citizens of NC are making their voices heard. Maybe my friends in Raleigh can give us some background.
There are definitely problems with the choice of charts. But I rate this effort a solid B. In the Trifecta Checkup, they did a good job describing the central question, as well as compiled an appropriate dataset. I love it when people go out to collect the right data rather than use whatever they could grab. The issue was the execution of the charts.
The first was a map showing where the arrested protestors came from.
Maps are typically used to show geographical distribution. The chosen color scheme (two levels of green and gray) compresses the data so much that we learn almost nothing about distribution. I clicked on Wake County to learn that there were 178 arrests there. The neighboring Randolph County had only 1 arrest but you can't tell from the colors.
The next chart shows the trend of arrests over time. I like the general appearance (except for the shadows). The problem is the even spacing of the columns when the gaps between the arrests are uneven.
Here's a quick redo, with proper spacing:
The final set of charts is inspired. They compare the demographics of those arrested protestors against the average North Carolina resident. For example:
For categories like Age with quite a few levels, the pie chart isn't a good choice. It's also hard to compare across pie charts. A column or dot chart works better.
Reader omegatron came back with another shocking instance of a pie chart:
Here is the link to the AVERT organization in the U.K. that published the chart and several others.
For the umpteenth time, the pie chart plots proportions. All proportions are percentages but some percentages are not proportions. The data here would appear to be "rate of diagnosis" rather than proportion of diagnoses by age.
The data came from Table 3a of this CDC report (link), and they are clearly labelled "Rate". The footnote even disclosed that the "Rate" is measured per 100,000 people so they are being mislabeled as percentages.
Let's summarize. The percentages add up to much more than 100%, they are clearly not proportions, they are not even percentages, they are rates per 100,000.
omegatron even got confused by the colors. You'd think that the slices would be arranged by age group but no! The order of the slices is by size of the pie slices, with one exception--the lime green slice of 11.4%, which I cannot explain. In practice, this means the order goes from Under 13 to 13-14 to Over 65 to 60-64 to 50-54, etc.
A smarter use of color here would be to stick to one color while varying the tinge acccording to the rate of diagnosis. Using 13 colors for 13 age groups is distracting.
As a teacher, it's shocking that such pie charts continue to see the light of day. It's very disappointing, as I'd assume every teacher who teaches the pie chart will have pointed out the pitfalls. Why is this happening?
With this chart, I'm mostly baffled by the top corner of the Trifecta Checkup. What is the point of this data? If I understand the "per 100,000 population" definition, these rates are computed as the number of diagnosed divided by the population in each age group. So the diagnosis rate is a function of how many people in each age group are actually infected, and how effective is the diagnosis procedures, and whether that effectiveness varies with age. Plus, the completeness of reporting by age group (the footnote acknowledged that the mathematical model does not account for incomplete reporting. To call a spade a spade, that means the model assumes complete reporting.)
The rate of diagnosis can be low because the rate of infection is low or the proportion of the infected who gets diagnosed is low. I just can't conceive of a use of data that confound these factors.
A time series treatment would be interesting althought that addresses a different question.
Nick C. on Twitter sent us to the following chart of salaries in Major League Soccer. (link)
This chart is hosted at Tableau, which is one of the modern visualization software suites. It appears to be a user submission. Alas, more power did not bring more responsibility.
Sorting the bars by total salary would be a start.
The colors and subsections of the bars were intended to unpack the composition of the total salaries, namely, which positions took how much of the money. I'm at a loss to explain why those rectangles don't seem to be drawn to scale, or what it means to have rectangles stacked on top of each other. Perhaps it's because I don't know much about how the cap works.
Combined with the smaller chart (shown below), the story seems to be that while all teams have similar cap numbers, the actual salaries being paid could differ by multiples.
This is the standard stacked bar chart showing the distribution of salary cap usage by team:
I have never understood the appeal of stacking data. It's not easy to compare the middle segments.
After quite a bit of work, I arrived at the following:
The MLS teams are divided into five groups based on how they used the salary cap. Salary cap figures are converted into proportion of total cap. For example, the first cluster includes Chicago, Los Angeles, New York, Seattle and Toronto, and these teams spread the wealth among the D, F, and M players while not spending much on goalie and "others". On the other hand, Groups 2 and 3, especially Group 3 allocated 30-45% of the cap on the midfield.
Three teams form their own clusters. CLB spends more of its cap on "others" than any other team (others are mostly hyphenated positions like D-F, F-M, etc.) DAL and VAN spend a lot less on midfield players than other teams. VAN spends a lot on defense.
My version has many fewer data points (although the underlying data set is the same) but it's easier to interpret.
I tried various chart types like bar charts, and even pie charts. I still like the profile (line) charts best.
In a modern software (I'm using JMP's Graph Builder here), it's only one click to go from line to bar, and one click to go to pie.