Lining up the dopers and their medals

The Times did a great job making this graphic (this snapshot is just the top half):

Nyt_olympicdopers_top

A lot of information is packed into a small space. It's easy to compose the story in our heads. For example, Lee Chong Wai, the Malaysian badminton silver medalist, was suspended for doping for a short time during 2015, and he was second twice before the doping incident.

They sorted the athletes according to the recency of the latest suspension. This is very smart as it helps make the chart readable. Other common ordering such as alphabetically by last name, by sport, by age, and by number of medals will result in a bit of a mess.

I'm curious about the athletes who also had doping suspensions but did not win any medals in 2016.


Treating absolute and relative data simultaneously

A friend asked me to comment on the following chart:

Mobileprogrammatic_chart

Specifically, he points out the challenge of trying to convey both absolute and relative metrics for a given data series.

This chart presents projections of growth in the U.S. mobile display advertising market. It is specifically pointing out that the programmatic segment of this market is growing rapidly (visualized as the black columns).

The blue and red lines then make a mess of the situation. Even though both of these lines espress percentages, they report to different scales. The red line represents growth rates while the blue line represents share of market.

Both of these metrics are relative metrics useful for interpreting the trend. The growth rates (red) interpret the dollar values on the basis of past values while the market shares (blue) interpret the dollar values on the basis of the total market.

It is rarely a good idea to have many scales on the same canvas. Looking at the blue line for the moment, it is shocking to find that the values depicted almost doubled from one end to the other end. The blue line appears much too gentle.

***

In the makeover, I expressed everything in the same scale (billions of dollars). I used side-by-side charts (small multiples) to isolate each trend that is found in the data. I allow readers to look at each individual segment of the market, and then examine how the individual trends affect the total market.

Redo_mobileprogrammatic_v2

One might argue that the stacked column chart by itself is sufficient. If there is a severe space limitation, I'd let go of the other two panels. However, having those panels makes the messages easier to obtain. This is particularly true of the steady growth assumption behind the programmatic spending trend (the orange columns).


Three short lessons on comparisons

I like this New York Times graphic illustrating the (over-the-top) reaction by the New York police to the Eric Garner-inspired civic protests during the holidays. This is a case where the data told a story that mere eyes and ears couldn't. The semi-strike was clear as day from the visualization.

There are three sections to the graphic, and each displays a different form of comparisons

The first chart is the most straightforward, comparing the number of summonses this year to that of the same time a year ago.

Nyt_nyc_summonses1


One could choose lines for both data series. The combination of one line and column also works. It creates a sensation that the columns should grow in height to meet last year's level. The traffic cops appear to have returned to work more quickly. That said, I don't care for the shades of brown/orange of the columns.

***

The second chart accommodates a more complex scenario, one in which the simple year-on-year comparison is regarded as misleading because the overall crime rate materially dropped from 2013 to 2014. In this scenario, a before-after comparison may be more valid.

Nyt_nyc_summonses2

The chart has multiple sections and I am only showing the section concerning summonses (The horizontal axis shows time, the first black column being the first ten months, and the other orange columns being individual months since then. The vertical axis is the percent change from a year ago.).

The chart shows that in the first ten months of 2014, before the semi-strike, the number of summonses issued was already slightly below the same period the year before. Through the dotted line, the reader is invited to compare this level of change against those in the ensuing months. How starkly did the summonses rate fell!

***

The final chart reveals yet another comparison. Geography is introduced here in the form of a proportional-symbol map.

Nyt_nyc_summonses3

Again, you can't miss the story: across every precinct, summonses have disappeared. This chart is very helpful to making the case that the observed drop is not natural.

 

 


An infographic showing up here for the right reason

Infographics do not have to be "data ornaments" (link). Once in a blue moon, someone finds the right balance of pictures and data. Here is a nice example from the Wall Street Journal, via ThumbsUpViz.

 

Thumbsupviz_wsj_footballinjuries

 

Link to the image

 

What makes this work is that the picture of the running back serves a purpose here, in organizing the data.  Contrast this to the airplane from Consumer Reports (link), which did a poor job of providing structure. An alternative of using a bar chart is clearly inferior and much less engaging.

Redowsjinjuries_bar

***

I went ahead and experimented with it:

Redo_wsj_nflinjuries

 

I fixed the self-sufficiency issue, always present when using bubble charts. In this case, I don't think it matters whether the readers know the exact number of injuries so I removed all of the data from the chart.

Here are  three temptations that I did not implement:

  • Not include the legend
  • Not include the text labels, which are rendered redundant by the brilliant idea of using the running guy
  • Hide the bar charts behind a mouseover effect.

 


Getting the basics right is half the battle

I was traveling quite a lot recently, and last week, read the Wall Street Journal cover to cover for the first time in a while. I am happy to report that there are many more data graphics than I remember of past editions.

The following chart illustrating findings of an FCC report on broadband speeds has a number of issues (a related blog post containing this chart can be found here):

Wsj_dsl_speeds

The biggest problem with the visual elements is the lack of linkage between the two components. The two charts should be connected: the one on the right presents ISP averages by the broadband technology while the one on the left presents individual ISP results. Evidently, the designer treats the two parts as separate.

If that was the intention, there are two decisions that create confusion for readers. First, the charts use two different but related scales. Just add 100% to the scale of the left chart and you get the scale of the right chart. There really is no need for two different scales.

Secondly, orange and blue are used in both charts but for different purposes. In the left chart, orange denotes all ISPs whose actual speeds were below their advertised speeds. In the right chart, orange denotes ISPs using DSL technology.

I also do not understand why some ISP names are bolded. The bolded companies include several cable providers (but not all), several DSL providers (but not all), one fiber provider and no satellite.

Lastly, I'd prefer they stick to one of "advertised" and "promised". I do like the axis labels, saying "faster than" and "slower".

***

One challenge of the data is that the FCC report (here) does not provide a mathematical linkage between the technology averages and the ISP data. We know that 91% for DSL is the average of the ISPs that use DSL as shown on the left of the chart, but we don't know the weights (relative popularity) of each ISP so we can't check the computation.

But if we think of the average by technology as a reference point to measure individual ISPs, we can still use the data, and more efficiently, such as in the following dot plot where the vertical lines indicate the appropriate technology average:

Redo_fccdslspeedwsj

(The cable section should have come before the DSL section but you get the idea.)

The key message of the chart, in my mind, is that DSL providers as a class over-promise and under-deliver.

In a Trifecta Checkup, this is a Type V chart.

 


A great visual of complicated schedules

Reader Joe D. tipped me about a nice visualization project by a pair of grad students at WPI (link). They displayed data about the Boston subway system (i.e. the T).

The project has many components, one of which is the visualization of the location of every train in the Boston T system on a given day. This results in a very tall chart, the top of which I clipped:

Mbta_viz_1

I recall that Tufte praised this type of chart in one of his books. It is indeed an exquisite design, attributed to Marey. It provides data on both time and space dimensions in a compact manner. The slope of each line is positively correlated with the velocity of the train (I use the word correlated because the distances between stations are not constant as portrayed in this chart). The authors acknowledge the influence of Tufte in their credits, and I recognize a couple of signatures:

  • For once, I like how they hide the names of the intermediate stations along each line while retaining the names of the key stations. Too often, modern charts banish all labels to hover-overs, which is a practice I dislike. When you move the mouse horizontally across the chart, you will see the names of the unnamed stations.
  • The text annotations on the right column are crucial to generating interest in this tall, busy chart. Without those hints, readers may get confused and lost in the tapestry of schedules. If you scroll to the middle, you find an instance of train delay caused by a disabled train. Even with the hints, I find that it takes time to comprehend what the notes are saying. This is definitely a chart that rewards patience.

Clicking on a particular schedule highlights that train, pushing all the other lines into the background. The side panel provides a different visual of the same data, using a schematic subway map.

Mbta_viz_2

 Notice that my mouse is hovering over the 6:11 am moment (represented by the horizontal guide on the right side). This generates a snapshot of the entire T system shown on the left. This map shows the momentary location of every train in the system at 6:11 am. The circled dot is the particular Red Line train I have clicked on before.

This is a master class in linking multiple charts and using interactivity wisely.

***

You may feel that the chart using the subway map is more intuitive and much easier to comprehend. It also becomes very attractive when the dots (i.e., trains) are animated and shown to move through the system. That is the image that project designers have blessed with the top position of their Github page.

However, the image above allows us to  see why the Marey diagram is the far superior representation of the data.

What are some of the questions you might want to answer with this dataset? (The Q of our Trifecta Checkup)

Perhaps figure out which trains were behind schedule on a given day. We can define behind-schedule as slower than the average train on the same route.

It is impossible to figure this out on the subway map. The static version presents a snapshot while the dynamic version has  moving dots, from which readers are challenged to estimate their velocities. The Marey diagram shows all of the other schedules, making it easier to find the late trains.

Another question you might ask is how a delay in one train propagates to other trains. Again, the subway map doesn't show this at all but the Marey diagram does - although here one can nitpick and say even the Marey diagram suffers from overcrowding.

***

On that last question, the project designers offer up an alternative Marey. Think of this as an indiced view. Each trip is indiced to its starting point. The following setting shows the morning rush hour compared to the rest of the day:

Mbta_viz_3

 I think they can utilize this display better if they did not show every single schedule but show the hourly average. Instead of letting readers play with the time scale, they should pre-compute the periods that are the most interesting, which according to the text, are the morning rush, afternoon rush, midday lull and evening lull.

The trouble with showing every line is that the density of lines is affected by the frequency of trains. The rush hours have more trains, causing the lines to be denser. The density gradient competes with the steepness of the lines for our attention, and completely overwhelms it.

***

There really is a lot to savor in this project. You should definitely spend some time reviewing it. Click here.

Also, there is still time to sign up for my NYU chart-making workshop, starting on Saturday. For more information, see here.


Advocacy graphics

Note: If you are here to read about Google Flu Trends, please see this roundup of the coverage. My blog is organized into two sections: the section you are on is about data visualization; the other section concerns Big Data and use of statistical thinking in daily life--click to go there. Or, you can follow me on Twitter which combines both feeds.

***

Because the visual medium is powerful, it is a favorite of advocates. Creating a chart for advocacy is tricky. One must strike the proper balance between education and messaging. The chart needs to present the policy position strongly and also enlighten the unconverted with useful information.

In my interview with MathBabe Cathy O'Neil (link), she points to this graphic by Pew that illustrates where death-penalty executions have been administered in the past two decades in the U.S. (link) Here is a screenshot of the geographic distribution for 2006:

Pew_deathpenalty

The chart is a variant of the CDC map of obesity, which I discussed years ago. At one level, the structure of the data is the same. Each state is evaluated on a particular metric (proportion obese, and number of executions) once a year. Both designers choose to roll through a sequence of small-multiple maps.

The key distinction is that the obesity map encodes the data in color while the executions map encodes data in the density of semi-transparent, overlapping dots, each dot representing a single execution.

Perhaps the idea is to combat one of the weaknesses of color encoding: humans don't have an instinctive sense of the mapping between a numerical scale and a color scale. If the color transitions from yellow to orange, how many more executions would that map to? By contrast, if you see 200 dots instead of 160, we know the difference is 40.

***

The switch to the dots aesthetic introduces a host of problems.

Density, as you recall from geometry class, is the count divided by the area. High density can be due to a lot of executions or a very small area. Look at Delaware (DE) versus Georgia (GA). The density of red appears similar but there have been far fewer executions in Delaware.

This is a serious mistake. By using dot density, the designer encourages readers to think in terms of area of each state but why should the number of executions be related to area? As Cathy pointed out, a more relevant reference point is the population of each state. An even cleverer reference point might be the number of criminals/convictions in each state.

Pew_deathpenalty_noteAnother design issue relates to the note at the bottom of the chart (shown on the right). Here, the designer is fighting against the reader's knowledge in his/her head. It is natural for a dot on a map to represent location and yet the spatial distribution of the dots here provide no information. Credit the designer for clarifying this in a footnote; but also let this be a warning that there are other visual representation that does not require such disclaimers.

***

I am confused by why dots appear but never disappear. It seems that the chart is plotting cumulative counts of executions from 1977, rather than the number of executions in each year, as the chart title suggests. (If you go to the Pew website, you find a version with "cumulative" in the title; when they produced the animated gif, they decided to simplify the title, which is a poor decision.)

It requires a quick visit to Wikipedia to learn that there was a break in executions in the 70s. This is a missed opportunity to educate readers about the context of this data. Similarly, a good chart presenting this data should distinguish between states that have banned the death penalty and states that have zero or low numbers of executions.

***

A great way to visualize this data is via a heatmap. Here, I whipped up a quick sketch (pardon the sideway text on the legend):

Executions_sketch

I forgot to add the footnote listing the states where the death penalty is banned. Also can add an axis labeling to the side histogram showing counts.

 

 


The graphical version of "to be seen"

In New York, there are many restaurants that serve mediocre food but which people go in to order to be seen. Here is the graphical equivalent, courtesy of Scientific American (link):

Sa-the-truth-about-chinas-patent-boom_2

 

This is an attractive chart, but from which one should not expect to learn much.

The labels are well placed and unintrusive. The colors are not too sharp.

The size of the font draws our attention to the percentages -- the proportion of patents granted to China that falls into the specified categories. These percentages pertain to the single stacked column chart.

Looking right to left, the reader notices that the stacked column chart is an extension of the rightmost edge of the "ink blot". The ink blot is a variant of the stacked area chart. The massive growth between 1985 and 2010 looks mighty impressive. But the reader must navigate the transition from relative numbers to absolute numbers because the ink blot chart uses the number of patents, not the relative proportion.

In fact, the switch to absolute numbers leaves a void. The reader needs to know the relative proportion from decades past in order to interpret that single column representing just the year 2010. As the chart stands, has there been a change in distribution over time? Your bet is as good as mine.

I have previously explained why the ink blot chart is a silly invention. The central axis is arbitrary and meaningless. It's challenging to judge the growth from one year to another year because the growth is split in half and moving in different directions. The reader is asked to measure the vertical height at two points in time, and mentally shift the two line segments onto an even plane.

The other obstacle to understanding the rate of growth is the choice of scale. The exponential growth in recent years causes the earlier years to look completely flat.

***

 Furthermore, the taxonomy of patents is hard to grasp. There are two dimensions: purely Chinese invention versus co-invention; and assignment to {chinese indigenous firms only, or multinational firms only, or either, or other types of organizations}.

Without reading the article itself, it's hard to understand what the point of this taxonomy is. It's hard to learn anything from looking at this chart.

But it's nice to look at. That's for sure.


Hate the defaults

One piece of  advice I give for those wanting to get into data visualization is to trash the defaults (see the last part of this interview with me). Jon Schwabish, an economist with the government, gives a detailed example of how this is done in a guest blog on the Why Axis.

Here are the highlights of his piece.

***

He starts with a basic chart, published by the Bureau of Labor Statistics. You can see the hallmarks of the Excel chart using the Excel defaults. The blue, red, green color scheme is most telling.

Schwabish_bls1

 

Just by making small changes, like using tints as opposed to different colors, using columns instead of bars, reordering the industry categories, and placing the legend text next to the columns, Schwabish made the chart more visually appealing and more effective.

Redo_schwabishbls1

 The final version uses lines instead of columns, which will outrage some readers. It is usually true that a grouped bar chart should be replaced by overlaid line charts, and this should not be limited to so-called discrete data.

Redo_schwabishbls2

Schwabish included several bells and whistles. The three data points are not evenly spaced in time. The year-on-year difference is separately plotted as a bar chart on the same canvass. I'd consider using a line chart here as well... and lose the vertical axis since all the data are printed on the chart (or else, lose the data labels). 

This version is considerably cleaner than the original.

***

I noticed that the first person to comment on the Why Axis post said that internal BLS readers resist more innovative charts, claiming "they don't understand it". This is always a consideration when departing from standard chart types.

Another reader likes the "alphabetical order" (so to speak) of the industries. He raises another key consideration: who is your audience? If the chart is only intended for specialist readers who expect to find certain things in certain places, then the designer's freedom is curtailed. If the chart is used as a data store, then the designer might as well recuse him/herself.

 


Experiments with multiple dimensions

Reader (and author) Bernard L. sends us to the Economist (link), where they walked through a few charts they sketched to show data relating to the types of projects that get funded on Kickstarter. The three metrics collected were total dollars raised, average dollars per project, and the success rate of different categories of projects.

Here's the published version, which is a set of bar charts, ranked by individual metrics, and linked by colors.

Econ_kickstarter1

This bar chart does the job. The only challenge is the large number of colors. But otherwise, it's not hard to see that fashion projects have the worst success rate and raised relatively little money overall although the average pledge amount tended to be higher than average.

The following chart used more of a Bumps chart aesthetic. It dropped the average pledge per project metric, which I think is a reasonable design choice. The variance in pledge amount is probably pretty high and thus the average may not be a good metric anyway. The Bumps format though suffers because there are too many categories and the two metrics are rather uncorrelated, resulting in a spider web. Instead of using colors as a link, this format uses explicit lines as links between the metrics.

Econ_kickstarter2

The following version combines features from both. It requires no colors. It drops the third metric, while adopting the bar chart format. The two charts retain the same order of categories so that one can read across to learn about both metrics.

Redo_crowdfund

 

PS. Readers want to see a scatter plot:

Redo_kickstarter2

The overall pattern is clearer on a scatter plot. When there are so many categories, it's a pain to put the data labels on the chart. It's odd that the amount pledged for games is the highest of the categories and yet it has among the lowest rate of being fully funded. Is this a sign of inefficiency?