Dot plots are under-valued, that's all
Aug 30, 2016
Bar charts are over-used and over-rated. Just casually, I found this example at US News:
Are you comparing bar widths? Or the printed data?
Here is a dot plot:
« July 2016 | Main | September 2016 »
Bar charts are over-used and over-rated. Just casually, I found this example at US News:
Are you comparing bar widths? Or the printed data?
Here is a dot plot:
The Times did a great job making this graphic (this snapshot is just the top half):
A lot of information is packed into a small space. It's easy to compose the story in our heads. For example, Lee Chong Wai, the Malaysian badminton silver medalist, was suspended for doping for a short time during 2015, and he was second twice before the doping incident.
They sorted the athletes according to the recency of the latest suspension. This is very smart as it helps make the chart readable. Other common ordering such as alphabetically by last name, by sport, by age, and by number of medals will result in a bit of a mess.
I'm curious about the athletes who also had doping suspensions but did not win any medals in 2016.
Catching a dose of Alberto Cairo the other day. He has a good post about various Brexit/Bremain maps.
The story started with an editor of The Spectator, who went on twitter to make the claim that the map on the right is better than someone else's map on the left:
There are two levels at which we should discuss these maps: the scaling of the data, and the mapping of colors.
The raw data are percentages based on counts of voters so the scale is decimal. In general, we discretize the decimal data in order to improve comprehension. Discretizing means we lose granularity. This is often a good thing. The binary map on the left takes the discretization to its logical extreme. Every district is classified as either Brexit (> 50% in favor) or Bremain (> 50% opposed). The map on the right uses six total groups (so three subgroups of Brexit and three subgroups of Bremain.
Then we deal with mapping of numbers to colors. The difference between these two maps is the use of hues versus shades. The binary map uses two hues, which is probably most people's choice since we are representing two poles. The map on the right uses multiple shades of one hue. Alternatively, Alberto favors a "diverging" color scheme in which we use three shades of two hues.
The editor of The Spectator claims that his map is more "true to the data." In my view, his statement applies in these two senses: the higher granularity in the scaling, and also, the fact that there is only one data series ("share of vote for Brexit") and therefore only one color.
The second point relates to polarity of the scale. I wrote about this issue before - related to a satisfaction survey designed (not too well) by SurveyMonkey, one of the major online survey software services. In that case, I suggested that they use a bipolar instead of unipolar scale. I'd rather describe my mood as somewhat dissatisfied instead of a little bit satisfied.
I agree with Alberto here in favor of bipolarity. It's quite natural to underline the Brexit/Bremain divide.
***
Given what I just said, why complain about the binary map?
We agree with the editor that higher granularity improves comprehension. We just don't agree on how to add graularity. Alberto tells his readers he likes the New York Times version:
This is substantively the same map as The Spectator's, except for 8 groups instead of 6, and two hues instead of one.
Curiously enough, I gave basically the same advice to the Times regarding their maps showing U.S. Presidential primary results. I noted that their use of two hues with no shades in the Democratic race obscures the fact that none of the Democratic primiaries was a winners-take-all contest. Adding shading based on delegate votes would make the map more "truthful."
That said, I don't believe that the two improvements by the Times are sufficient. Notice that the Brexit referendum is one-person, one-vote. Thus, all of the maps above have a built-in distortion as the sizes of the regions are based on (distorted) map areas, rather than populations. For instance, the area around London is heavily Bremain but appears very small on this map.
The Guardian has a cartogram (again, courtesy of Alberto's post) which addresses this problem. Note that there is a price to pay: the shape of Great Britain is barely recognizable. But the outsized influence of London is properly acknowledged.
This one has two hues and four shades. For me, it is most "truthful" because the sizes of the colored regions are properly mapped to the vote proportions.
Seems like reader Conor H. has found a pattern. He alerted us to the problem with bar lengths in the daily medals chart on NBC, which I blogged about the other day.
Through twitter (@andyn), I was sent the following, also courtesy of NBC:
This one is much harder to understand.
Reader Conor H. sent in this daily medals table at the NBC website:
He commented that the bars are not quite the right lengths. So even though China and Russia both won five total medals that day, the bar for China is slightly shorter.
One issue with the stacked bar chart is that the reader's attention is drawn to the components rather that the whole. However, as is this case, the most important statistic is the total number of medals.
Here is a different view of the data:
In a comment to the previous post, Evan pointed to this Washington Post graphic: (link to article)
This chart doesn't render properly in Firefox, nor in Safari. But we can see the designer's intention. It has an added dimension of gender.
This wasn't the chart that caught my eye before. The one I saw has the shades of blue that I used - I basically used the same design with a different set of data.
The chart above can be deconstructed in a similar fashion. It represents a set of collapsed histograms - two histograms, one for each gender, on each row. In other words:
I don't think adding the gender variable adds much to the chart. (Note: the dataset I used did not have gender. I assigned gender randomly for illustrative purposes.)
The other day, a chart about the age distribution of Olympic athletes caught my attention. I found the chart on Google but didn't bookmark it and now I couldn't retrieve it. From my mind's eye, the chart looks like this:
This chart has the form of a stacked bar chart but it really isn't. The data embedded in each bar segment aren't proportions; rather, they are counts of athletes along a standardized age scale. For example, the very long bar segment on the right side of the bar for alpine skiing does not indicate a large proportion of athletes in that 30-50 age group; it's the opposite: that part of the distribution is sparse, with an outlier at age 50.
The easiest way to understand this chart is to transform it to histograms.
In a histogram, the counts for different age groups are encoded in the heights of the columns. Instead, encode the counts in a color scale so that taller columns map to darker shades of blue. Then, collapse the columns to the same heights. Each stacked bar chart is really a collapsed histogram.
***
The stacked bar chart reminds me of boxplots that are loved by statisticians.
In a boxplot, the box contains the middle 50% of the athletes in each sport (this directly maps to the dark blue bar segments from the chart above). Outlier values are plotted individually, which gives a bit more information about the sparsity of certain bar segments, such as the right side of alpine skiing.
The stacked bar chart can be considered a nicer-looking version of the boxplot.