What makes this work is that the picture of the running back serves a purpose here, in organizing the data. Contrast this to the airplane from Consumer Reports (link), which did a poor job of providing structure. An alternative of using a bar chart is clearly inferior and much less engaging.
I went ahead and experimented with it:
I fixed the self-sufficiency issue, always present when using bubble charts. In this case, I don't think it matters whether the readers know the exact number of injuries so I removed all of the data from the chart.
Here are three temptations that I did not implement:
Not include the legend
Not include the text labels, which are rendered redundant by the brilliant idea of using the running guy
I am mystified by the intention behind this chart, published in NYT Magazine (Sept 14, 2014).
It is not a data visualization since the circles were not placed to scale. The 650 and 660 should have been further to the right on a horizontal time scale. And if we were to take the radial time axis literally, the 390 circle would be closest to the center.
It is not a work of art. It doesn’t look particularly appealing. Sometimes, designers are inspired by imagery. The accompanying article concerns windshield wipers, and I’m not seeing the imagery.
The arrangement of the circles actually interfere with the reader’s comprehension. Here is a straightforward version of the data as a column chart.
Now, let’s turn it on the side, with time running vertically instead of horizontal (the convention).
Then, we need to invert convention once again by making the vertical axis run in reverse so that time runs from up to down, instead of down to up.
Finally, distort the frequency axis, replace the bars with circles, and you have essentially replicated the original.
The point is each step obscures the pattern more. In this case, following conventions makes a better chart.
I have a pet peeve about presenting partial data next to complete data, even if it is labeled correctly. On this chart, the number 390 cannot be compared against any of the other numbers because we are not even half way into the decade of the 2010s. Instead of plotting total number of patents per decade, it would have been more useful to plot number of patents per year in each decade. 43, 26, 65, 41, etc. For the 2010s, I am assuming they have data for 3.5 years.
A simple column chart looks like this:
The per-year view shows that the 2010s is unusual. Of course, I should add a footnote to the chart to make it clear that we only have partial data for 2010, and that the assumption behind the averaging is that the pace of patents will remain the same on average for the remainder of the decade.
Carl Bialik used to be the Numbers Guy at Wall Street Journal - he's now with FiveThirtyEight. Apparently, he left a huge void. John Eppley sent me to this set of charts via Twitter.
This chart about Citibike is very disappointing.
Using the Trifecta checkup, I first notice that it addresses a stale question and produces a stale answer. The caption below the chart says "the peak times ... seem to be around 9 am and 6 pm." What a shock!
I sense a degree of meekness in usnig "seem to be". There is not much to inspire confidence in the data: rather than the full statistics which you'd think someone at Citibike has, the chart is based on "a two-day sample last autumn". The number of days is less concerning than the question of whether those two autumn days are representative of the year. Curious readers might want to know what data was collected, how it was collected, and the sample size.
Finally, the graph makes a mess of the data. While the black line appears to be data-rich, it is not. In fact, the blue dots might as well be randomly scattered and connected. As you can see from the annotations below, the scale of the chart makes no sense.
Plus, the execution is sloppy, with a missing data label.
The next chart is not much better.
The biggest howler is the choice of pie charts to illustrate three numbers that are not that different.
But I have to say the chart raises more questions than it answers. I am not an expert in pregnancy but doesn't a pregnant woman's weight include the weight of the baby she's carrying? So the more weight the woman gains, on average, the heavier is her baby. What a shock!
The last and maybe the least is this chart about basketball players in the playoff.
It's the dreaded bubble chart. The players are arranged in a perplexing order. I wonder if there is a natural numbering system for basketball positions (center = #1, etc.), like there is in soccer. Even if there is such a natural numbering system, I still question the decision to confound that system with a complicated ranking of current-year playoff players against all-time players.
Above all, the question being asked is uninteresting, and so the chart is uninformative. A more interesting question to me is whether the best players are playing in this year's playoff. To answer this question, the designer should be comparing only currently active players, and showing the all-time ranks of those players who are playing in the playoffs versus those who aren't.
The question on the table is motivated by the extraordinary performance of a young baseball player Mike Trout. The early success can be interpreted either as evidence of future potential or as evidence of a future drought. As an analogy, someone wins a lottery. You can argue that the odds are so low that winning again is impossible. Or you can argue that winning once indicates that this person is "lucky" and lucky people might win again.
The chart shows the proportion of players who performed even better after the initial success, given the age at which they first broke out. One way to read this chart is to mentally replace the bubbles with dots (or columns), and then interpret the size of the bubbles as the statistical significance of the corresponding probability estimate. The legend says number of players, which is the sample size, which governs the error bar associated with that particular number.
This bubble chart is no different from others: it is impossible to judge the relative sizes of bubbles. Even though the legend provides us two reference points (a nice enough idea on its own), it is still impossible to know, for example, what proportion of players did better later in life when they first peaked at age 24. The bubble for age 23 looks like it's exactly five players but I still cannot figure out how many players the adjacent bubble represents.
The designer should have just replaced each bubble with an error bar, and the chart is instantly more readable. (I have another version of this at the end of the post.)
The rest of the design elements are clean and well-done, particularly use of notes to point out interesting aspects of the data.
From a Trifecta checkup perspective, I am uncertain about how the nature of the data used to investigate the interesting question posed above.
Readers should note the concept of "early success" and "later success" are not universally defined. The author here selects two proxies. Reaching an early peak is equated to "batters first posting 15+ WAR over two seasons". Next, reversion to the mean is defined as not having a better two-year span subsequent to the aforementioned early peak.
Why two seasons? Why WAR and not a different metric? Why 15 as the cutoff? These are all design decisions made while working with the data.
One can make reasonable arguments to justify the above two questions. A bigger head-scratcher relates to the horizontal axis, which identifies the first time a player reaches his "early peak," as defined above. The way the above chart is set up, it is almost preordained to exhibit a negative slope. The older the player is when he reaches the first peak, the fewer years left in his playing career to try to emulate or surpass that feat.
This last point is nicely illustrated in the next chart of the article:
This chart is excellent on many levels. It's not clear, though, whether it says anything other than aging.
Near the end of the post, the author rightfully pointed out that "there’s not really enough data to demonstrate this effect". Going back to the first chart, it appears that no single bubble contains a double-digit count of players. So every sample size is between one and, say, seven. We should be wary of conclusions based on so little data.
It's always fun to find examples of the Law of Small Numbers, courtesy of Kahneman & Tversky.
Here is a sketch of how I might re-make the first chart (I made up data; see the note below).
While making this chart, I realize another issue with the original bubble chart. When the proportion of players improving on their early peak is zero percent, how many players did not make it is quite hidden. In the revised chart, this data is clearly seen (look at age 22).
Note: I wonder if I totally missed the point of the original chart.... I actually had trouble eyeballing the data so I ended up making up numbers. The bubble at age 22 looks like it should stand for 5 players and yet it sits at precisely 50%, which would map to 2.5 players. If I assume the 22 bubble to be 4 players, then I don't know what the 26 bubble is. If it is 4 players also, then the minimum non-zero proportion should have been 1/4, but the bubble clearly lies below 25%. If it is 3 players, the minimum non-zero proportion is 1/3, which should be at 33%.
My twitter followers have been sending in several howlers.
Twitter (link) made a bunch of bold claims about its own influence by using the number of tweets about the Oscars as fodder. They also adopt the euphenism common to the digital marketing universe, the so-called "view", which credit to them, they define as "how many times tweets are displayed to users". Yes, you read that right, displaying is the same as viewing in this world - and Twitter is just a follower not a trend setter here.
In the meantime, @wilte found this unfortunate donut chart, created by PWC in the Netherlands.
Both designers basically used appropriated a graphical form and deprived it of data. In one, the designer threw the concept of scale to the wind. In the other, the designer dumped the law of total probability. In either case, the fundamental rationale for the particular graphical form is sacrificed.
Both are examples that fail our self-sufficiency test. This test says if a visual display cannot be understood unless the entire data set is printed on the chart, then why create a visual display? In both charts, if you block out the numbers, you are left with nothing!
The PWC chart was submitted by @graphomate, who also submitted the following KPMG chart:
The complaint was the total adding up to 101%. I'm not really bothered by this as it is a rounding issue. That said, I like to "hide" such rounding issues. I have never understood why it is necessary to display the imperfection. Flip a coin and remove the decimals from one of the categories!
Jens M., a long-time reader, submits a good graphic! This small-multiples chart (via Quartz) compares the consumption of liquor from selected countries around the world, showing both the level of consumption and the change over time.
Ordered the countries by the most recent data point rather than alphabetically
Scale labels are found only on outer edge of the chart area, rather than one set per panel
Only used three labels for the 11 years on the plot
Did not overdo the vertical scale either
The nicest feature was the XL scale applied only to South Korea. This destroys the small-multiples principle but draws attention to the top left corner, where the designer wants our eyes to go. I would have used smaller fonts throughout.
Having done so much work to simplify the data and expose the patterns, it's time to look at whether we can add some complexity without going overboard. I'd suggest using a different color to draw attention to curves that are strangely shaped -- the Ukraine comes to mind, so does Brazil.
I'd also consider adding the top liquor in each country... the writeup made a big deal out of the fact that most of the drinking in South Korea is of Soju.
One way to appreciate the greatness of the chart is to look at alternatives.
Here, the Economist tries the lazy approach of using a map: (link)
For one thing, they have to give up the time dimension.
A variation is a cartogram in which the physical size and shape of countries are mapped to the underlying data. Here's one on Worldmapper (link):
One problem with this transformation is what to do with missing data.
Wikipedia has a better map with variations of one color (link):
The Atlantic realizes that populations are not evenly distributed on the map so instead of coloring countries, thay put bubbles on top of the map (link):
Unfortunately, they scaled the bubbles to the total consumption rather than the per-capita consumption. You guess it, China gets the biggest bubble and much larger than anywhere else but from a per-capita standpoint, China is behind many other countries depicted on the map.
PS. A note on submissions. I welcome submissions, especially if you have a good chart to offer. Please ping me if I don't reply within a few weeks. I may have just missed your email. Also, realize that submissions take even more time to research since it is likely in the area I have little knowledge about, and mostly because you sent it to me since you hope I'll research it. Sometimes I give up since it's taking too much time. If you ping me again, I'll let you know if I'm working on it.
The above does not apply to emails from people who are building traffic for their infographics.
PPS. Andrew Gelman chimes in with his take on small multiples.
At the NY Tech Meetup, Andrei Scheinkman showed off some work his team at Huffington Post did relating to gun violence in America.
Interactive version is here. The animation shows day by day, where the victims of gun violence were located. The table below contains the details of each victim, and links to the news story covering the event.
What is not seen on the chart is even more impressive. Andrei described how they looked around for databases that would provide them the raw materials for creating this chart but no timely source exists. This means that a team of 15 (if I heard correctly) spent a month or so manually collecting all the data on a spreadsheet.
It's also the reason why they cannot continue the map indefinitely, as people have other things to do.
Andrei also contrasted this visualization with a text article that describes the state of gun violence in words. You guessed it, the visual presentation is hands-down more compelling.