Various ways of showing distributions
Aug 04, 2016
The other day, a chart about the age distribution of Olympic athletes caught my attention. I found the chart on Google but didn't bookmark it and now I couldn't retrieve it. From my mind's eye, the chart looks like this:
This chart has the form of a stacked bar chart but it really isn't. The data embedded in each bar segment aren't proportions; rather, they are counts of athletes along a standardized age scale. For example, the very long bar segment on the right side of the bar for alpine skiing does not indicate a large proportion of athletes in that 30-50 age group; it's the opposite: that part of the distribution is sparse, with an outlier at age 50.
The easiest way to understand this chart is to transform it to histograms.
In a histogram, the counts for different age groups are encoded in the heights of the columns. Instead, encode the counts in a color scale so that taller columns map to darker shades of blue. Then, collapse the columns to the same heights. Each stacked bar chart is really a collapsed histogram.
***
The stacked bar chart reminds me of boxplots that are loved by statisticians.
In a boxplot, the box contains the middle 50% of the athletes in each sport (this directly maps to the dark blue bar segments from the chart above). Outlier values are plotted individually, which gives a bit more information about the sparsity of certain bar segments, such as the right side of alpine skiing.
The stacked bar chart can be considered a nicer-looking version of the boxplot.
This might be the original:
http://www.washingtonpost.com/wp-srv/special/sports/profiles-in-speed/age/sports-by-age.html
Posted by: Evan | Aug 05, 2016 at 08:34 AM
And then there's the violin plot: https://en.wikipedia.org/wiki/Violin_plot
Posted by: Chris Pudney | Aug 08, 2016 at 01:25 AM
Personally, I like the boxplots the best, I think they're the easiest to understand, the fastest to ingest at a glance, and the most informative for deeper review.
For a technical audience, this is a very familiar presentation that needs little explanation.
For a non-technical audience, a single explanatory set of labels would be appropriate, as was done on the top chart that spawned the discussion. The short discussion about how to interpret boxplots would also be good practice, as was done above.
Posted by: Doug Dame | Aug 08, 2016 at 09:40 PM
Great post! Each chart has its own merits and limitations.
Maybe it's a good idea to use all of them together if space permits.
I had this combination of overlaying boxplot over histogram chart.
http://vizdiff.blogspot.com/2015/11/overlaying-histogram-with-box-and.html
Posted by: Alexander Mou | Aug 09, 2016 at 02:13 PM