The Times did a great job making this graphic (this snapshot is just the top half):
A lot of information is packed into a small space. It's easy to compose the story in our heads. For example, Lee Chong Wai, the Malaysian badminton silver medalist, was suspended for doping for a short time during 2015, and he was second twice before the doping incident.
They sorted the athletes according to the recency of the latest suspension. This is very smart as it helps make the chart readable. Other common ordering such as alphabetically by last name, by sport, by age, and by number of medals will result in a bit of a mess.
I'm curious about the athletes who also had doping suspensions but did not win any medals in 2016.
The other day, a chart about the age distribution of Olympic athletes caught my attention. I found the chart on Google but didn't bookmark it and now I couldn't retrieve it. From my mind's eye, the chart looks like this:
This chart has the form of a stacked bar chart but it really isn't. The data embedded in each bar segment aren't proportions; rather, they are counts of athletes along a standardized age scale. For example, the very long bar segment on the right side of the bar for alpine skiing does not indicate a large proportion of athletes in that 30-50 age group; it's the opposite: that part of the distribution is sparse, with an outlier at age 50.
The easiest way to understand this chart is to transform it to histograms.
In a histogram, the counts for different age groups are encoded in the heights of the columns. Instead, encode the counts in a color scale so that taller columns map to darker shades of blue. Then, collapse the columns to the same heights. Each stacked bar chart is really a collapsed histogram.
The stacked bar chart reminds me of boxplots that are loved by statisticians.
In a boxplot, the box contains the middle 50% of the athletes in each sport (this directly maps to the dark blue bar segments from the chart above). Outlier values are plotted individually, which gives a bit more information about the sparsity of certain bar segments, such as the right side of alpine skiing.
The stacked bar chart can be considered a nicer-looking version of the boxplot.
My summer course on analytical methods is already at the midway point. I was doing some research on recommendation systems the other day, and came across the following chart:
Ouch. This is from the Park, et. al. (2012) survey of research papers on this subject. It's the 21st century, people. The column chart copies the older-generation Excel design made infamous by Tufte, and since abandoned. Looking more closely, I suspect that the chart was hand-crafted, not made in Excel.
There are several challenges of reading this chart.
The gaps between columns are narrower than the columns. Only in the last two years do the eight categories all count. So a key task is to learn which column stands for which type of application. Having one's eyes flip back and forth between the columns and the legend below the chart is a big hassle. As readers, we tend to learn a short cut, which is to memorize the order of the categories (first column is book, second column is document, etc.). The incorrect width of zero-valued columns thwarts this simple strategy.
The designer creates another obstacle by sorting the categories alphabetically. Shopping and movies are two of the most important applications and that message is buried.
The key to cleaning up this graphic is to bring the visual design closer to the question being addressed. The question of the chart is how interest in various applications has changed over time.
The answer is that applications are getting more diversified (the rise of the Other), and that Documents, Shopping and Movie applications were growing while research on Image, Music, TV Program and Book stagnated during the study period.
A reader didn't like this graphic in the Wall Street Journal:
One could turn every panel into a bar chart but unfortunately, the situation does not improve much. Some charts just can't be fixed by altering the visual design.
The chart is frustrating to read: typically, colors are used to signify objects that should be compared. Focus on the brown wedges for a moment: Basic EDA 46%, Data cleaning 31%, Machine learning 27%, etc. Those are proportions of respondents who said they spent 1 to 3 hours a day on the respective tasks. That is one weird way of describing time use. The people who spent 1 to 3 hours a day on EDA do not necessarily overlap with those who spent 1 to 3 hours a day on data cleaning. In addition, there is no summation formula that lets us know how any individual, or the average data scientist, spends his or her time during a typical day.
But none of this is the graphics designer's fault.
The trouble with the chart is in the D corner of the Trifecta checkup. The survey question was poorly posed. The data came from a study by O'Reilly Media. They asked questions of this form:
How much time did you spend on basic exploratory data analysis on average?
A. Less than 1 hour a week B. 1 to 4 hours a week C. 1 to 3 hours a day D. 4 or more hours a day
It is not obvious that those four levels are mutually exhaustive. In fact, they aren't. One hour a day for five working days is a total of 5 hours a week. Those who spent between 4 and 5 hours a week have nowhere to go.
Further, if one had access to individual responses, it's likely that many respondents either worked too many hours or too few hours.
The panels are separate questions which bear no relationship to each other, even though the tasks are clearly related by the fact that there are only so many working hours in a day.
To fix this chart, one must first fix the data. To fix the data, one must ask the right questions.
A friend asked me to comment on the following chart:
Specifically, he points out the challenge of trying to convey both absolute and relative metrics for a given data series.
This chart presents projections of growth in the U.S. mobile display advertising market. It is specifically pointing out that the programmatic segment of this market is growing rapidly (visualized as the black columns).
The blue and red lines then make a mess of the situation. Even though both of these lines espress percentages, they report to different scales. The red line represents growth rates while the blue line represents share of market.
Both of these metrics are relative metrics useful for interpreting the trend. The growth rates (red) interpret the dollar values on the basis of past values while the market shares (blue) interpret the dollar values on the basis of the total market.
It is rarely a good idea to have many scales on the same canvas. Looking at the blue line for the moment, it is shocking to find that the values depicted almost doubled from one end to the other end. The blue line appears much too gentle.
In the makeover, I expressed everything in the same scale (billions of dollars). I used side-by-side charts (small multiples) to isolate each trend that is found in the data. I allow readers to look at each individual segment of the market, and then examine how the individual trends affect the total market.
One might argue that the stacked column chart by itself is sufficient. If there is a severe space limitation, I'd let go of the other two panels. However, having those panels makes the messages easier to obtain. This is particularly true of the steady growth assumption behind the programmatic spending trend (the orange columns).
Old-timer Chris P. sent me to this Bloomberg article about Vanguard ETFs and low-cost funds (link). The article itself is interesting, and I will discuss it on the sister blog some time in the future.
Chris is impressed with this table included with the article:
This table indeed presents the insight clearly. Those fund sectors in which Vanguard does not compete have much higher costs than the fund sectors in which Vanguard is a player. The author calls this the "Vanguard effect."
This is a case where finding a visual design to beat this table is hard.
For a certain type of audience, namely financial, the spreadsheet is like rice or pasta; you simply can't live without it. The Bloomberg spreadsheet does one better: the bands of blue contrast with the white cells, which neatly divides those funds into two groups.
If you use spreadsheets a lot, you should definitely look into in-cell charts. Perhaps Tufte's sparkline is the most famous but use your imagination. I also wish vendors would support in-cell charts more eagerly.
Here is a vision of what in-cell technology can do with the above spreadsheet. (The chart is generated in R.)
It's called the MLB pipeline. The text at the top helpfully tells us what the chart is about: how the playoff teams in baseball are built. That's the good part.
It then took me half a day to understand what is going on below. There are four ways for a player to be on a team: homegrown, trades and free agents, wherein homegrown includes drafted players or international players.
Each row is a type of player. You can look up which teams have exactly X players of a specific type. It gets harder if you want to know how many players team Y has of a given type. It is even harder if you don't know the logos of every team (e.g. Toronto Blue Jays).
Some fishy business is going on with the threesomes and foursomes. Here is the red threesome:
Didn't know baseball employs half a player. The green section has a different way to play threesomes:
The blue section takes inspiration from both and shows us a foursome:
I was stuck literally in the middle for quite a while:
Eventually, I realized that this is a summary of the first two sections on the page. I still don't understand why there is no gap between 11 and 14 but then the 14 and 15 arrows are twice as large as 9, 10 and 11 even though every arrow contains exactly one team.
The biggest problem in the above chart is the hidden base: each team's roster has a total of 25 players.
Here is a different view of the data:
With this chart, I want to emphasize two points: first, addressing the most interesting question of which team(s) emphasize which particular player acquisition tactic; second, providing the proper reference level to interpret the data.
Regarding the vertical, reference lines: take the top left chart about players arriving through trade. If every team equally emphasizes this tactic, then each team should have the same number of traded players on the 25-person roster. This would mean every team has approximately 11 traded players. This is clearly not the case. Several teams, especially Cubs and Blue Jays, utilized trades more often than teams like Mets and Royals.
The reason for the infrequent posting is my travel schedule. I spent the past week in Seattle at JSM. This is an annual meeting of statisticians. I presented some work on fantasy football data that I started while writing Numbersense.
For my talk, I wanted to present the ubiquitous league table in a more useful way. The league table is a table of results and relevant statistics, at the team level, in a given sports league, usually ordered by the current winning percentage. Here is an example of ESPN's presentation of the NFL end-of-season league table from 2014.
If you want to know weekly results, you have to scroll to each team's section, and look at this format:
For the graph that I envisioned for the talk, I wanted to show the correlation between Points Scored and winning/losing. Needless to say, the existing format is not satisfactory. This format is especially poor if I want my readers to be able to compare across teams.
The graph that I ended up using is this one:
The teams are sorted by winning percentage. One thing should be pretty clear... the raw Points Scored are only weakly associated with winning percentage. Especially in the middle of the Points distribution, other factors are at play determining if the team wins or loses.
The overlapping dots present a bit of a challenge. I went through a few other drafts before settling on this.
The same chart but with colored dots, and a legend:
Only one line of dots per team instead of two, and also requiring a legend:
Jittering is a popular solution to separating co-located dots but the effect isn't very pleasing to my eye:
Small multiples is another frequently prescribed solution. Here I separated the Wins and Losses in side-by-side panels. The legend can be removed.
As usual, sketching is one of the most important skills in data visualization; and you'd want to have a tool that makes sketching painless and quick.