The Times did a great job making this graphic (this snapshot is just the top half):
A lot of information is packed into a small space. It's easy to compose the story in our heads. For example, Lee Chong Wai, the Malaysian badminton silver medalist, was suspended for doping for a short time during 2015, and he was second twice before the doping incident.
They sorted the athletes according to the recency of the latest suspension. This is very smart as it helps make the chart readable. Other common ordering such as alphabetically by last name, by sport, by age, and by number of medals will result in a bit of a mess.
I'm curious about the athletes who also had doping suspensions but did not win any medals in 2016.
At first glance, this Wall Street Journal chart seems unlikely to impress as it breaks a number of "rules of thumb" frequently espoused by dataviz experts. The inconsistency of mixing a line chart and a dot plot. The overplotting of dots. The ten colors...
However, I actually like this effort. The discontinuity of chart forms nicely aligns with the split between the actual price movements on the left side and the projections on the right side.
The designer also meticulously placed the axis labels with monthly labels for actual price movements and quarterly labels for projections.
Even the ten colors are surprisingly manageable. I am not sure we need to label all those banks; maybe just the ones at the extremes. If we clear out some of these labels, we can make room for a median line.
How good are these oil price predictions? It is striking that every bank shown is predicting that oil prices have hit a bottom, and will start recovering in the next few quarters. Contrast this with the left side of the chart, where the line is basically just tumbling down.
Step back six months earlier, to September 2015. The same chart looks like this:
Again, these analysts were calling a bottom in prices and predicting a steady rise over the next quarters.
The track record of these oil predictions is poor:
The median analyst predicted oil prices to reach $50 by Q1 of 2016. Instead, prices fell to $30.
Given this track record, it's shocking that these predictions are considered newsworthy. One wonders how these predictions are generated, and how did the analysts justify ignoring the prevailing trend.
Old-timer Chris P. sent me to this Bloomberg article about Vanguard ETFs and low-cost funds (link). The article itself is interesting, and I will discuss it on the sister blog some time in the future.
Chris is impressed with this table included with the article:
This table indeed presents the insight clearly. Those fund sectors in which Vanguard does not compete have much higher costs than the fund sectors in which Vanguard is a player. The author calls this the "Vanguard effect."
This is a case where finding a visual design to beat this table is hard.
For a certain type of audience, namely financial, the spreadsheet is like rice or pasta; you simply can't live without it. The Bloomberg spreadsheet does one better: the bands of blue contrast with the white cells, which neatly divides those funds into two groups.
If you use spreadsheets a lot, you should definitely look into in-cell charts. Perhaps Tufte's sparkline is the most famous but use your imagination. I also wish vendors would support in-cell charts more eagerly.
Here is a vision of what in-cell technology can do with the above spreadsheet. (The chart is generated in R.)
My talk at Parsons seemed like a success, based on the conversation it generated, and the fact that people stuck around till the end. One of my talking points is that one should not pick a tool before having a design.
Then, last night on Twitter, I found an example to illustrate this. Jim Fonseca tweeted about this chart from Business Insider: (link)
The style is clean and crisp, which I credit them for. Jim was not happy about the length of the columns. It seems that no matter how many times we repeat the start-at-zero rule, people continue to ignore it.
So here we go again. The 2015 column is about double the height of the 2013 column but 730 is nowhere near double the value of 617.
The standard remedy for this is to switch to a line chart, or a dot plot. Something like this can be quickly produced in any software:
Is this the best we can do?
Not if we are willing to free ourselves from the tool. Think about the message: NFL referees have been calling more penalties this year. Compared to what?
I want to leave readers no doubt as to what my message is. So I sketched this version:
This version cannot be produced directly from a tool (without contorting your body in various painful locations).
The lesson is: Make your design, then find a way to execute it.
I discussed the rose chart used in the Environmental Performance Index (EPI) report last week. This type of data is always challenging to visualize.
One should start with an objective. If the goal is a data dump, that is to say, all you want is to deliver the raw data in its full glory to the user, then you should just print a set of data tables. This has traditionally been the delivery mechanism of choice.
If, on the other hand, your interest is communicating insights, then you need to ask some interesting questions. One such question is how do different regions and/or countries compare with each other, not just in the overall index but also in the major sub-indices?
Learning to ask such a question requires first understanding the structure of the data. As described in the previous post, the EPI is a weighted average of a bunch of sub-indices. Each sub-index measures "distance to a target," which is then converted into a scale from 0 to 100. This formula guarantees that at the aggregate level, the EPI is not going to be 0 or 100: a country would have to score 100 on all sub-indices to attain EPI perfection!
Here is a design sketch to address the question posed above:
For a print version, I chose several reference countries listed at the bottom that span the range of common values. In the final product, hovering over a stripe should disclose a country and its EPI. Then the reader can construct comparisons of the type: "Thailand has a value of 53, which places it between Brazil and China."
The chart reveals a number of insights. Each region stakes out its territory within the EPI scale. There are no European countries with EPI lower than 45 while there are no South Asian countries with EPI higher than 50 or so. Within each region, the distribution is very wide, and particularly so in the East Asia and Pacific region. Europe is clearly the leading region, followed by North America.
The same format can be replicated for every sub-index.
This type of graph addresses a subset of the set of all possible questions and it does so in a clear way. Modesty in your goals often helps.
The reason for the infrequent posting is my travel schedule. I spent the past week in Seattle at JSM. This is an annual meeting of statisticians. I presented some work on fantasy football data that I started while writing Numbersense.
For my talk, I wanted to present the ubiquitous league table in a more useful way. The league table is a table of results and relevant statistics, at the team level, in a given sports league, usually ordered by the current winning percentage. Here is an example of ESPN's presentation of the NFL end-of-season league table from 2014.
If you want to know weekly results, you have to scroll to each team's section, and look at this format:
For the graph that I envisioned for the talk, I wanted to show the correlation between Points Scored and winning/losing. Needless to say, the existing format is not satisfactory. This format is especially poor if I want my readers to be able to compare across teams.
The graph that I ended up using is this one:
The teams are sorted by winning percentage. One thing should be pretty clear... the raw Points Scored are only weakly associated with winning percentage. Especially in the middle of the Points distribution, other factors are at play determining if the team wins or loses.
The overlapping dots present a bit of a challenge. I went through a few other drafts before settling on this.
The same chart but with colored dots, and a legend:
Only one line of dots per team instead of two, and also requiring a legend:
Jittering is a popular solution to separating co-located dots but the effect isn't very pleasing to my eye:
Small multiples is another frequently prescribed solution. Here I separated the Wins and Losses in side-by-side panels. The legend can be removed.
As usual, sketching is one of the most important skills in data visualization; and you'd want to have a tool that makes sketching painless and quick.
Via Twitter, Andrew B. (link) asked if I could comment on the following chart, published by PC Magazine as part of their ISP study. (link)
This chart is decent, although it can certainly be improved. Here is a better version:
A couple of little things are worth pointing out. The choice of red and green to indicate down and up speed respectively is baffling. Red and green are loaded terms which I often avoid. A red dot unfortunately signifies STOP, but ISP users would definitely not want to stop on their broadband superhighway!
In terms of plot symbols, up and down arrows are natural for this data.
Using the Trifecta checkup (link), I am most concerned about the D(ata) corner.
The first sign of trouble is the arbitrary construction of an "Index". This index isn't really an index because there is no reference level. The s0-called index is really a weighted average of the download and upload speeds, with 80% weight given to the former. In reality, the download speeds are even weighted higher because download speeds are multiples of the upload speeds, in their original units.
Besides, putting these ISPs side by side gives an impression that they are comparable things. But direct comparison here is an invitation to trouble. For example, Verizon is represented only by its FIOS division (fiber optics). We have Comcast and Cox which are cable providers. The geographical footprints of these providers are also different.
This is not a trivial matter. Midcontinent operates primarily in North and South Dakota. Some other provider may do better than Midcontinent on average but within those two states, the other provider may perform much worse.
Note that the data came from the Speedtest website (over 150,000 speed tests). In my OCCAM framework (link), this dataset is Observational, without Controls, seemgingly Complete, and Adapted (from speed testing for technical support).
Here is the author's disclosure, which should cause concern:
We require at least 50 tests from unique IP addresses for any vendor to receive inclusion. That's why, despite a couple of years of operation, we still don't have information on Google Fiber (to name one such vendor). It simply doesn't have enough users who took our test in the past year.
So, the selection of providers is based on the frequency of Speedtest queries. Is that really a good way to select samples? The author presents one possible explanation for why Google Fiber is absent - that it has too few users (without any evidence). In general, there are many reasons for such an absence. One might be that a provider is so good that few customers complain about speeds and therefore they don't do speed tests. Another might be that a provider has a homegrown tool for measuring speeds. Or any number of other reasons. These reasons create biases in various directions, which makes the analysis confusing.
Think about your own behavior. When was the last time you did a speed test? Did you use Speedtest.com? How did you hear about them? For me, I was pointed to the site by the tech support person at my ISP. Of course, the reason why I called them was that I was experiencing speed issues with my connection.
Given the above, do you think the set of speed measurements used in this study gives us accurate estimates of the speeds delivered by ISPs?
While the research question is well worth answering, and the visual form is passable, it is hard to take the chart seriously because of how this data was collected.
Last week, I was quite bothered by this chart I produced using the Baby Name Voyager tool.
According to this chart, William has drastically declined in popularity over time. The name was 7 times more popular back in the 1880s compared to the 2010s. And yet, when I hovered over the chart, the rank of William in 2013 was 3. Apparently, William was the 3rd most popular boy name in 2013.
I wrote the nice people at the website and asked if there might be a data quality issue, and their response was:
The data in our Name Voyager tool is correct. While it may be puzzling, there are definitely less Williams in the recent years than there were in the past (1880s). Although the name is still widely popular, there are plenty of other baby names that parents are using. In the past, there were a limited amount of names that parents would choose, therefore more children had the same name.
What bothered me was that the rate has declined drastically while the number of births was increasing. So, I was expecting William to drop in rank as well. But their explanation makes a lot of sense: if there is a much wider spread of names in recent times, the rank could indeed remain top. It was very nice of them to respond.
There are three ways to present this data series, as shown below. One can show the raw counts of William babies (orange line). One can show the popularity against total births (what Baby Name Wizard shows, blue line). One can show the rank of William relative to all other male baby names (green line). Consider how different these three lines look!
The rate metric (per million births) adjusts for growth in total births. But the blue line is difficult to interpret in the face of the orange line. In the period 1900 to 1950, the actual number of William babies went up but the blue line came down. The rank is also tough especially in the 1970-2000 period when it took a dive, a trend not visible in either the raw counts or the adjusted counts.
Adding to the difficulty is the use of the per-million metric. In the following chart, I show three different scales for popularity: per million, per 100,000, and per 100 (i.e. proportion). The raw count is shown up top.
All three blue lines are essentially the same but how readers interpret the scales is quite another matter. The per-million births metric is the worst of the lot. The chart shows values in the 20,000-25,000 range in the 1910s but the actual number of William babies was below 20,000 for a number of years. Switching to per-100K helps but in this case, using the standard proportion (the bottom chart) is more natural.
The following scatter plot shows the strange relationship between the rate of births and the rank over time for Williams babies.
Up to 1990s, there is an intuitive relationship: as the proportion of Williams among male babies declined, so did the rank of William. Then in the 1990s and beyond, the relationship flipped. The proportion of Williams among male babies continued to drop but the rank of William actually recovered!