Reorientation in the French election

Financial Times has this chart up about the voters for the National Front, which is Marie Le Pen's party.

FT_France_FN_C97-rl3WsAMb2aA

I find the chart very hard to decipher, even though I usually like the dot plot format.

The first thing to figure out is not visual. It's a definition of the data. The average voter represents those who voted in the 2015 regional election. The National Front voters are those who intended to vote in 2015, and these are sub-divided into "loyal" and "new" voters. All it takes one to be "loyal" is to have voted for the National Front in 2012; all others are "new."

All of the above information you pick up primarily from the footnotes, combined with various parts of the title, and legend. Similarly, you also learn that FN is the acronym for National Front.

***

 This following version is clearer:

Jc_ft_frenchnationalfront

The new version mostly just re-orients the original chart, turning it on its side. It's quite surprising how much better I feel about it. I think it's because the message is primarily about the relative ages, and in the original chart, aging is portrayed downwards, which is not natural.


Story within story, bar within bar

This Wall Street Journal offering caught my eye.

Wsj_gender_workforce_sm

It's the unusual way of displaying proportions.

Your first impression is to interpret the graphic as a bar chart. But it really is a bar within a bar: the crux of the matter - gender balance - is embedded in individual bars.

Instead of pie charts or stacked bar charts, we see  stacked columns within each bar.

I see what the designer is attempting to accomplish. The first message is the sharp decline in gender equality at higher job titles. The next message is the sharp drop in the frequency of higher job titles.

This chart is a variant of the "Marimekko" chart (beloved by management consultants), also called the mosaic chart. The only difference being how the distribution of jobs in the work force is coded.

The Marimekko is easier to understand:

Redo_wsjgenderworkforce_mekko2

A key advantage of this version is to be found in the thin columns.

Here is another way to visualize this data, drawing attention to the gender gap.

Redo_wsjgenderworkforce_lines

In the other versions, the reader must do subtractions to figure out the size of the gaps.


Lining up the dopers and their medals

The Times did a great job making this graphic (this snapshot is just the top half):

Nyt_olympicdopers_top

A lot of information is packed into a small space. It's easy to compose the story in our heads. For example, Lee Chong Wai, the Malaysian badminton silver medalist, was suspended for doping for a short time during 2015, and he was second twice before the doping incident.

They sorted the athletes according to the recency of the latest suspension. This is very smart as it helps make the chart readable. Other common ordering such as alphabetically by last name, by sport, by age, and by number of medals will result in a bit of a mess.

I'm curious about the athletes who also had doping suspensions but did not win any medals in 2016.


Treating absolute and relative data simultaneously

A friend asked me to comment on the following chart:

Mobileprogrammatic_chart

Specifically, he points out the challenge of trying to convey both absolute and relative metrics for a given data series.

This chart presents projections of growth in the U.S. mobile display advertising market. It is specifically pointing out that the programmatic segment of this market is growing rapidly (visualized as the black columns).

The blue and red lines then make a mess of the situation. Even though both of these lines espress percentages, they report to different scales. The red line represents growth rates while the blue line represents share of market.

Both of these metrics are relative metrics useful for interpreting the trend. The growth rates (red) interpret the dollar values on the basis of past values while the market shares (blue) interpret the dollar values on the basis of the total market.

It is rarely a good idea to have many scales on the same canvas. Looking at the blue line for the moment, it is shocking to find that the values depicted almost doubled from one end to the other end. The blue line appears much too gentle.

***

In the makeover, I expressed everything in the same scale (billions of dollars). I used side-by-side charts (small multiples) to isolate each trend that is found in the data. I allow readers to look at each individual segment of the market, and then examine how the individual trends affect the total market.

Redo_mobileprogrammatic_v2

One might argue that the stacked column chart by itself is sufficient. If there is a severe space limitation, I'd let go of the other two panels. However, having those panels makes the messages easier to obtain. This is particularly true of the steady growth assumption behind the programmatic spending trend (the orange columns).


Three short lessons on comparisons

I like this New York Times graphic illustrating the (over-the-top) reaction by the New York police to the Eric Garner-inspired civic protests during the holidays. This is a case where the data told a story that mere eyes and ears couldn't. The semi-strike was clear as day from the visualization.

There are three sections to the graphic, and each displays a different form of comparisons

The first chart is the most straightforward, comparing the number of summonses this year to that of the same time a year ago.

Nyt_nyc_summonses1


One could choose lines for both data series. The combination of one line and column also works. It creates a sensation that the columns should grow in height to meet last year's level. The traffic cops appear to have returned to work more quickly. That said, I don't care for the shades of brown/orange of the columns.

***

The second chart accommodates a more complex scenario, one in which the simple year-on-year comparison is regarded as misleading because the overall crime rate materially dropped from 2013 to 2014. In this scenario, a before-after comparison may be more valid.

Nyt_nyc_summonses2

The chart has multiple sections and I am only showing the section concerning summonses (The horizontal axis shows time, the first black column being the first ten months, and the other orange columns being individual months since then. The vertical axis is the percent change from a year ago.).

The chart shows that in the first ten months of 2014, before the semi-strike, the number of summonses issued was already slightly below the same period the year before. Through the dotted line, the reader is invited to compare this level of change against those in the ensuing months. How starkly did the summonses rate fell!

***

The final chart reveals yet another comparison. Geography is introduced here in the form of a proportional-symbol map.

Nyt_nyc_summonses3

Again, you can't miss the story: across every precinct, summonses have disappeared. This chart is very helpful to making the case that the observed drop is not natural.

 

 


An infographic showing up here for the right reason

Infographics do not have to be "data ornaments" (link). Once in a blue moon, someone finds the right balance of pictures and data. Here is a nice example from the Wall Street Journal, via ThumbsUpViz.

 

Thumbsupviz_wsj_footballinjuries

 

Link to the image

 

What makes this work is that the picture of the running back serves a purpose here, in organizing the data.  Contrast this to the airplane from Consumer Reports (link), which did a poor job of providing structure. An alternative of using a bar chart is clearly inferior and much less engaging.

Redowsjinjuries_bar

***

I went ahead and experimented with it:

Redo_wsj_nflinjuries

 

I fixed the self-sufficiency issue, always present when using bubble charts. In this case, I don't think it matters whether the readers know the exact number of injuries so I removed all of the data from the chart.

Here are  three temptations that I did not implement:

  • Not include the legend
  • Not include the text labels, which are rendered redundant by the brilliant idea of using the running guy
  • Hide the bar charts behind a mouseover effect.

 


Getting the basics right is half the battle

I was traveling quite a lot recently, and last week, read the Wall Street Journal cover to cover for the first time in a while. I am happy to report that there are many more data graphics than I remember of past editions.

The following chart illustrating findings of an FCC report on broadband speeds has a number of issues (a related blog post containing this chart can be found here):

Wsj_dsl_speeds

The biggest problem with the visual elements is the lack of linkage between the two components. The two charts should be connected: the one on the right presents ISP averages by the broadband technology while the one on the left presents individual ISP results. Evidently, the designer treats the two parts as separate.

If that was the intention, there are two decisions that create confusion for readers. First, the charts use two different but related scales. Just add 100% to the scale of the left chart and you get the scale of the right chart. There really is no need for two different scales.

Secondly, orange and blue are used in both charts but for different purposes. In the left chart, orange denotes all ISPs whose actual speeds were below their advertised speeds. In the right chart, orange denotes ISPs using DSL technology.

I also do not understand why some ISP names are bolded. The bolded companies include several cable providers (but not all), several DSL providers (but not all), one fiber provider and no satellite.

Lastly, I'd prefer they stick to one of "advertised" and "promised". I do like the axis labels, saying "faster than" and "slower".

***

One challenge of the data is that the FCC report (here) does not provide a mathematical linkage between the technology averages and the ISP data. We know that 91% for DSL is the average of the ISPs that use DSL as shown on the left of the chart, but we don't know the weights (relative popularity) of each ISP so we can't check the computation.

But if we think of the average by technology as a reference point to measure individual ISPs, we can still use the data, and more efficiently, such as in the following dot plot where the vertical lines indicate the appropriate technology average:

Redo_fccdslspeedwsj

(The cable section should have come before the DSL section but you get the idea.)

The key message of the chart, in my mind, is that DSL providers as a class over-promise and under-deliver.

In a Trifecta Checkup, this is a Type V chart.

 


A great visual of complicated schedules

Reader Joe D. tipped me about a nice visualization project by a pair of grad students at WPI (link). They displayed data about the Boston subway system (i.e. the T).

The project has many components, one of which is the visualization of the location of every train in the Boston T system on a given day. This results in a very tall chart, the top of which I clipped:

Mbta_viz_1

I recall that Tufte praised this type of chart in one of his books. It is indeed an exquisite design, attributed to Marey. It provides data on both time and space dimensions in a compact manner. The slope of each line is positively correlated with the velocity of the train (I use the word correlated because the distances between stations are not constant as portrayed in this chart). The authors acknowledge the influence of Tufte in their credits, and I recognize a couple of signatures:

  • For once, I like how they hide the names of the intermediate stations along each line while retaining the names of the key stations. Too often, modern charts banish all labels to hover-overs, which is a practice I dislike. When you move the mouse horizontally across the chart, you will see the names of the unnamed stations.
  • The text annotations on the right column are crucial to generating interest in this tall, busy chart. Without those hints, readers may get confused and lost in the tapestry of schedules. If you scroll to the middle, you find an instance of train delay caused by a disabled train. Even with the hints, I find that it takes time to comprehend what the notes are saying. This is definitely a chart that rewards patience.

Clicking on a particular schedule highlights that train, pushing all the other lines into the background. The side panel provides a different visual of the same data, using a schematic subway map.

Mbta_viz_2

 Notice that my mouse is hovering over the 6:11 am moment (represented by the horizontal guide on the right side). This generates a snapshot of the entire T system shown on the left. This map shows the momentary location of every train in the system at 6:11 am. The circled dot is the particular Red Line train I have clicked on before.

This is a master class in linking multiple charts and using interactivity wisely.

***

You may feel that the chart using the subway map is more intuitive and much easier to comprehend. It also becomes very attractive when the dots (i.e., trains) are animated and shown to move through the system. That is the image that project designers have blessed with the top position of their Github page.

However, the image above allows us to  see why the Marey diagram is the far superior representation of the data.

What are some of the questions you might want to answer with this dataset? (The Q of our Trifecta Checkup)

Perhaps figure out which trains were behind schedule on a given day. We can define behind-schedule as slower than the average train on the same route.

It is impossible to figure this out on the subway map. The static version presents a snapshot while the dynamic version has  moving dots, from which readers are challenged to estimate their velocities. The Marey diagram shows all of the other schedules, making it easier to find the late trains.

Another question you might ask is how a delay in one train propagates to other trains. Again, the subway map doesn't show this at all but the Marey diagram does - although here one can nitpick and say even the Marey diagram suffers from overcrowding.

***

On that last question, the project designers offer up an alternative Marey. Think of this as an indiced view. Each trip is indiced to its starting point. The following setting shows the morning rush hour compared to the rest of the day:

Mbta_viz_3

 I think they can utilize this display better if they did not show every single schedule but show the hourly average. Instead of letting readers play with the time scale, they should pre-compute the periods that are the most interesting, which according to the text, are the morning rush, afternoon rush, midday lull and evening lull.

The trouble with showing every line is that the density of lines is affected by the frequency of trains. The rush hours have more trains, causing the lines to be denser. The density gradient competes with the steepness of the lines for our attention, and completely overwhelms it.

***

There really is a lot to savor in this project. You should definitely spend some time reviewing it. Click here.

Also, there is still time to sign up for my NYU chart-making workshop, starting on Saturday. For more information, see here.


Advocacy graphics

Note: If you are here to read about Google Flu Trends, please see this roundup of the coverage. My blog is organized into two sections: the section you are on is about data visualization; the other section concerns Big Data and use of statistical thinking in daily life--click to go there. Or, you can follow me on Twitter which combines both feeds.

***

Because the visual medium is powerful, it is a favorite of advocates. Creating a chart for advocacy is tricky. One must strike the proper balance between education and messaging. The chart needs to present the policy position strongly and also enlighten the unconverted with useful information.

In my interview with MathBabe Cathy O'Neil (link), she points to this graphic by Pew that illustrates where death-penalty executions have been administered in the past two decades in the U.S. (link) Here is a screenshot of the geographic distribution for 2006:

Pew_deathpenalty

The chart is a variant of the CDC map of obesity, which I discussed years ago. At one level, the structure of the data is the same. Each state is evaluated on a particular metric (proportion obese, and number of executions) once a year. Both designers choose to roll through a sequence of small-multiple maps.

The key distinction is that the obesity map encodes the data in color while the executions map encodes data in the density of semi-transparent, overlapping dots, each dot representing a single execution.

Perhaps the idea is to combat one of the weaknesses of color encoding: humans don't have an instinctive sense of the mapping between a numerical scale and a color scale. If the color transitions from yellow to orange, how many more executions would that map to? By contrast, if you see 200 dots instead of 160, we know the difference is 40.

***

The switch to the dots aesthetic introduces a host of problems.

Density, as you recall from geometry class, is the count divided by the area. High density can be due to a lot of executions or a very small area. Look at Delaware (DE) versus Georgia (GA). The density of red appears similar but there have been far fewer executions in Delaware.

This is a serious mistake. By using dot density, the designer encourages readers to think in terms of area of each state but why should the number of executions be related to area? As Cathy pointed out, a more relevant reference point is the population of each state. An even cleverer reference point might be the number of criminals/convictions in each state.

Pew_deathpenalty_noteAnother design issue relates to the note at the bottom of the chart (shown on the right). Here, the designer is fighting against the reader's knowledge in his/her head. It is natural for a dot on a map to represent location and yet the spatial distribution of the dots here provide no information. Credit the designer for clarifying this in a footnote; but also let this be a warning that there are other visual representation that does not require such disclaimers.

***

I am confused by why dots appear but never disappear. It seems that the chart is plotting cumulative counts of executions from 1977, rather than the number of executions in each year, as the chart title suggests. (If you go to the Pew website, you find a version with "cumulative" in the title; when they produced the animated gif, they decided to simplify the title, which is a poor decision.)

It requires a quick visit to Wikipedia to learn that there was a break in executions in the 70s. This is a missed opportunity to educate readers about the context of this data. Similarly, a good chart presenting this data should distinguish between states that have banned the death penalty and states that have zero or low numbers of executions.

***

A great way to visualize this data is via a heatmap. Here, I whipped up a quick sketch (pardon the sideway text on the legend):

Executions_sketch

I forgot to add the footnote listing the states where the death penalty is banned. Also can add an axis labeling to the side histogram showing counts.

 

 


The graphical version of "to be seen"

In New York, there are many restaurants that serve mediocre food but which people go in to order to be seen. Here is the graphical equivalent, courtesy of Scientific American (link):

Sa-the-truth-about-chinas-patent-boom_2

 

This is an attractive chart, but from which one should not expect to learn much.

The labels are well placed and unintrusive. The colors are not too sharp.

The size of the font draws our attention to the percentages -- the proportion of patents granted to China that falls into the specified categories. These percentages pertain to the single stacked column chart.

Looking right to left, the reader notices that the stacked column chart is an extension of the rightmost edge of the "ink blot". The ink blot is a variant of the stacked area chart. The massive growth between 1985 and 2010 looks mighty impressive. But the reader must navigate the transition from relative numbers to absolute numbers because the ink blot chart uses the number of patents, not the relative proportion.

In fact, the switch to absolute numbers leaves a void. The reader needs to know the relative proportion from decades past in order to interpret that single column representing just the year 2010. As the chart stands, has there been a change in distribution over time? Your bet is as good as mine.

I have previously explained why the ink blot chart is a silly invention. The central axis is arbitrary and meaningless. It's challenging to judge the growth from one year to another year because the growth is split in half and moving in different directions. The reader is asked to measure the vertical height at two points in time, and mentally shift the two line segments onto an even plane.

The other obstacle to understanding the rate of growth is the choice of scale. The exponential growth in recent years causes the earlier years to look completely flat.

***

 Furthermore, the taxonomy of patents is hard to grasp. There are two dimensions: purely Chinese invention versus co-invention; and assignment to {chinese indigenous firms only, or multinational firms only, or either, or other types of organizations}.

Without reading the article itself, it's hard to understand what the point of this taxonomy is. It's hard to learn anything from looking at this chart.

But it's nice to look at. That's for sure.