Happy New Year
Jan 02, 2019
In China, 2019 is the Year of the Pig. Half of the world's pigs live in China. This graphic is inspired by this BouncyMaps project, which generated the following cartogram.
In China, 2019 is the Year of the Pig. Half of the world's pigs live in China. This graphic is inspired by this BouncyMaps project, which generated the following cartogram.
France is on my mind lately, as I prepare to bring my dataviz seminar to Lyon in a couple of weeks. (You can still register for the free seminar here.)
The following Made in France poster brings out all the stereotypes of the French.
(You can download the original PDF here.)
It's a sankey diagram with so many flows that it screams "it's complicated!" This is an example of a graphic for want of a story. In a Trifecta Checkup, it's failing in the Q(uestion) corner.
It's also failing in the D(ata) corner. Take a look at the top of the chart.
France exported $572 billion worth of goods. The diagram then plots eight categories of exports, ranging from wines to cheeses:
Wine exports totaled $9 billion which is about 1.6% of total exports. That's the largest category of the eight shown on the page. Clearly the vast majority of exports are excluded from the sankey diagram.
Are the 8 the largest categories of exports for France? According to this site, those are (1) machinery (2) aircraft (3) vehicles (4) electrical machinery (5) pharmaceuticals (6) plastics (7) beverages, spirits, vinegar (8) perfumes, cosmetics.
Compare: (1) wines (2) jewellery (3) perfume (4) clothing (5) cheese (6) baked goods (7) chocolate (8) paintings.
It's stereotype central. Name 8 things associated with the French brand and cherry-pick those.
Within each category, the diagram does not show all of the exports either. It discloses that the bars for wines show only $7 of the $9 billion worth of wines exported. This is because the data only capture the "Top 10 Importers." (See below for why the designer did this... France exports wine to more than 180 countries.)
Finally, look at the parade of key importers of French products, as shown at the bottom of the sankey:
The problem with interpreting this list of countries is best felt by attempting to describe which countries ended up on this list! It's the list of countries that belong to the top 10 importers of one or more of the eight chosen products, ordered by the total value of imports in those 8 categories only but only including the value in any category if it rises to the top 10 of the respective category.
In short, with all those qualifications, the size or rank of the black bars does not convey any useful information.
***
One feature of the chart that surprised me was no flows in the Wine category from France to Italy or Spain. (Based on the above discussion, you should realize that no flows does not mean no exports.) So I went to the Comtrade database that is referenced in the poster, and pulled out all the wine export data.
How does one visualize where French wines are going? After fiddling around the numbers, I came up with the following diagram:
I like this type of block diagram which brings out the structure of the dataset. The key features are:
The most time-consuming part of this exercise was finding the appropriate structure which can be easily explained in a visual manner.
Note for those in or near Zurich: I'm giving a Keynote Speech tomorrow morning at the Swiss Statistics Meeting (link). Here is the abstract:
The best and the worst of data visualization share something in common: these graphics provoke emotions. In this talk, I connect the emotional response of readers of data graphics to the design choices made by their creators. Using a plethora of examples, collected over a dozen years of writing online dataviz criticism, I discuss how some design choices generate negative emotions such as confusion and disbelief while other choices elicit positive feelings including pleasure and eureka. Important design choices include how much data to show; which data to highlight, hide or smudge; what research question to address; whether to introduce imagery, or playfulness; and so on. Examples extend from graphics in print, to online interactive graphics, to visual experiences in society.
***
The Big Mac index seems to never want to go away. Here is the latest graphic from the Economist, saying what it says:
The index never made much sense to me. I'm in Switzerland, and everything here is expensive. My friend, who is a U.S. transplant, seems to have adopted McDonald's as his main eating-out venue. Online reviews indicate that the quality of the burger served in Switzerland is much better than the same thing in the States. So, part of the price differential can be explained by quality. The index also confounds several other issues, such as local inflation and exchange rate
Now, on to the data visualization, which is primarily an exercise in rolling one's eyeballs. In order to understand the red and blue line segments, our eyes have to hop over the price bubbles to the top of the page. Then, in order to understand the vertical axis labels, unconventionally placed on the right side, our eyes have to zoom over to the left of the page, and search for the line below the header of the graph. Next, if we want to know about a particular country, our eyes must turn sideways and scan from bottom up.
Here is a different take on the same data:
I transformed the data as I don't find it compelling to learn that Russian Big Macs are 60% less than American Big Macs. Instead, on my chart, the reader learns that the price paid for a U.S. Big Mac will buy him/her almost 2 and a half Big Macs in Russia.
The arrows pointing left indicate that in most countries, the values of their currencies are declining relative to the dollar from 2017 to 2018 (at least by the Big Mac Index point of view). The only exception is Turkey, where in 2018, one can buy more Big Macs equivalent to the price paid for one U.S. Big Mac. compared to 2017.
The decimal differences are immaterial so I have grouped the countries by half Big Macs.
This example demonstrates yet again, to make good data visualization, one has to describe an interesting question, make appropriate transformations of the data, and then choose the right visual form. I describe this framework as the Trifecta - a guide to it is here.
(P.S. I noticed that Bitly just decided unilaterally to deactivate my customized Bitly link that was configured years and years ago, when it switched design (?). So I had to re-create the custom link. I have never grasped why "unreliability" is a feature of the offering by most Tech companies.)
Someone sent me this via Twitter, found on the Data is Beautiful reddit:
The chart does not deliver on its promise: It's tough to know which birds like which seeds.
The original chart was also provided in the reddit:
I can see why someone would want to remake this visualization.
Let's just apply some Tufte fixes to it, and see what happens.
Our starting point is this:
First, consider the colors. Think for a second: order the colors of the cells by which ones stand out most. For me, the order is white > yellow > red > green.
That is a problem because for this data, you'd like green > yellow > red > white. (By the way, it's not explained what white means. I'm assuming it means the least preferred, so not preferred that one wouldn't consider that seed type relevant.)
Compare the above with this version that uses a one-dimensional sequential color scale:
The white color still stands out more than necessary. Fix this using a gray color.
What else is grabbing your attention when it shouldn't? It's those gridlines. Push them into the background using white-out.
The gridlines are also too thick. Here's a slimmed-down look:
The visual is much improved.
But one more thing. Let's re-order the columns (seeds). The most popular seeds are shown on the left, and the least on the right in this final revision.
Look for your favorite bird. Then find out which are its most preferred seeds.
Here is an animated gif to see the transformation. (Depending on your browser, you may have to click on it to view it.)
PS. [7/23/18] Fixed the 5th and 6th images and also in the animated gif. The row labels were scrambled in the original version.
This Buzzfeed article proves that foodies love their food served with dataviz (tip: Chris P.). Menus are an undertapped resource when it comes to data visualization.
There are several examples worth discussing.
Venn diagrams are not easy to read, people.
Plus they are hard to construct well... note the asymmetric areas.
Here is one without circles:
Then, I pared it down to its essence:
***
This beer map is pretty great:
Some of its virtues:
Potential problems:
***
This next menu contains an error:
When the drink comes in one size, only one price is listed. If it comes in two sizes, two prices should be listed.
Is the cafe owner shading Americans as not good at math?
Another entry in the Google Newslab data visualization project that caught my eye is the "How to Fix It" project, illustrating search queries across the world that asks "how." The project web page is here.
The centerpiece of the project is an interactive graphic showing queries related to how to fix home appliances. Here is what it looks like in France (It's always instructive to think about how they would count "France" queries. Is it queries from google.fr? queries written in French? queries from an IP address in France? A combination of the above?)
I particularly appreciate the lack of labels. When we see the pictures, we don't need to be told this is a window and that is a door. The search data concern the relative sizes of the appliances. The red dotted lines show the relative popularity of searches for the respective appliances in aggregate.
By comparison, the Russian picture looks very different:
Are the Russians more sensible? Their searches are far and away about the washing machine, which is the most complicated piece of equipment on the graphic.
At the bottom of the page, the project looks at other queries, such as those related to cooking. I find it fascinating to learn what people need help making:
I have to confess that I searched for "how to make soft boiled eggs". That led me to a lot of different webpages, mostly created for people who search for how to make a soft boiled egg. All of them contain lots of advertising, and the answer boils down to cook it for 6 minutes.
***
The Russia versus France comparison brings out a perplexing problem with the "Data" in this visualization. For competitive reasons, Google does not provide data on search volume. The so-called Search Index is what is being depicted. The Search Index uses the top-ranked item as the reference point (100). In the Russian diagram, the washing machine has Search Index of 100 and everything else pales in comparison.
In the France example, the window is the search item with the greatest number of searches, so it has Search Index of 100; the door has Index 96, which means it has 96% of the search volume of the window; the washing machine with Index 49 has about half the searches of the window.
The numbers cannot be interpreted as proportions. The Index of 49 does not mean that washing machines account for 49% of all France queries about fixing home appliances. That is really the meaning of popularity we want to have but we don't have. We can obtain true popularity measures by "normalizing" the Search Index: just sum up the Index Values of all the appliances and divide the Search Index by the sum of the Indices. After normalizing, the numbers can be interpreted as proportions and they add up to 100% for each country. When not normalized, the indices do not add to 100%.
Take the case in which we have five appliances, and let's say all five appliances are equally popular, comprising 20% of searches each. The five Search Indices will all be 100 because the top-ranked item is given the value of 100. Those indices add to 500!
By contrast, in the case of Russia (or a more extreme case), the top-ranked query is almost 100% of all the searches, so the sum of the indices will be only slightly larger than 100.
If you realize this, then you'd understand that it is risky to compare Search Indices across countries. The interpretation is clouded by how much of the total queries accounted for by the top query.
In our Trifecta Checkup, this is a chart that does well in the Question and Visual corners, but there is a problem with the Data.
Reader Chris P. sent me this chart.
This was meant to be "light entertainment." See the Twitter discussion below.
***
Let's think a bit about the dot map as a data graphic.
Dot maps are one dimensional. The dot's location is used to indicate the latitude and longitude and therefore the x,y coordinates cannot encode any other data. If we have basically a black/white chart, as in this hog map, the dot can only encode binary data (yes/no).
The legend says "each dot represents 5,000 hogs." Think about how that statement applies to these scenarios:
Based on the legend, the designer would need two dots to represent 10,000 hogs. But those two dots pertain to the same location. Sometimes, "jitter" is added, and the two dots are placed side by side. However, with the scale of the map of the U.S., and the dots representing seemingly small neighborhoods, jitter creates more confusion than anything. Also, what about 3, 4, 5, .. dots in the same location?
Looking at the details above, are the dots jittered or do they represent neighboring locations?
Sometimes, colors are used to encode data on a dot map. But each dot can only contain one color, so it only typically shows the top category in each location.
Dot maps are very limited. Think before you use them.
The Schwab magazine has an interesting discussion of a marketing research study purportedly showing "less is more" when it comes to consumer choice. They summarized the experimental setup and results in the following succinct graphic:
The data consist of nested proportions. For example, among those seeing display 1, 60% stopped to look at the jams, and among those who stopped, 3% purchased.
The nesting is presented as overlap in this design. The blue figures on pink are those shoppers who stopped as well as purchased. The blue figures with no background are those who stopped but did not purchase. The blue figures disregarding background color include everyone who stopped. What about the gray? Those are the shoppers who did not stop at the jam display, which is not a key number. To understand what proportion of shoppers stopped, the reader must take in the entire set of figures, in effect giving the blue and blue/pink figures a change of clothes.
***
In this version, we make it easier to estimate the proportions:
Each branch starts with 100 figures. The nesting structure is clearly depicted.
***
It turns out that the original design messed up the numbers. They were trying to be precise. The right side (Display 2) had 29 figures on each row, summing to 260, exactly the number of subjects in that treatment cell. The left side had 28 figures per row (one fewer!), summing to 233. However, according to the research paper being cited, they analyzed 242 subjects who saw Display 1. Nine shoppers went missing.
The extra precision, even if correctly rendered, interferes with our comprehension of proportions. Less is more, indeed!
***
P.S. If you know someone interested in upgrading their skills to join the expanding business analytics workforce, send them to my new venture, Principal Analytics Prep, a next-gen bootcamp that helps people transition careers. Contact me for more information.
Someone at YouGov learned an important lesson in what not to do in #dataviz... (Tip from Rob M. via Twitter)
The tweet storm it unleashed was not worth the cute idea.
Here are their tweets:
Via Twitter, Nicholas S. sent this chart:
It's a layered donut. There isn't much context here except that the chart comes from USDA. Judging from the design, I surmise that the key message is the change in proportion by food groups between 1970 and 2014. I am assuming that these food groups are exhaustive so that it makes sense to put them in a donut chart, with all pieces adding up to 100%.
The following small-multiples line chart conveys most of the information:
The story is the big jump in "Added fats and oils". In the layered donut, the designer highlighted this by a moire effect, something to be avoided.
Note the parenthetical 2010 next to the Added fats and oils label. The data for all other food groups come from 2014 but the number for the most important category is four years older. The chart would be more compelling if they used 2010 data for everything.
One piece of information is ostensibly absent in the line chart version - the growth in the size of the pie. The total of the data increased about 20% from 1970 to 2014. In theory, the layered donut can convey this growth by the perimeters of the circles. But it doesn't appear that the designer saw this as an important insight since the total area of the outer donut is clearly more than 20% of the area of the inner donut.