« December 2019 | Main | February 2020 »

The unspoken rules of visualization

My latest is at DataJournalism.com.


It's an essay on the following observation:

The efficiency and multidimensionality of the visual medium arise from a set of conventions and rules, which regularises the communications between producers of data visualisation and its consumers. These conventions and rules are often unspoken: it's the visual equivalent of saying ’it goes without saying’ .

There are lots of little things visualization designers do in their sleep that don't get mentioned. When a visual design deviates from these rules, the readers may get confused.

Here is one example I discussed in the article (hat tip to Xan Gregg).


This pie chart is not easy to read beyond the obvious point that English is the most popular. The following pie chart is much easier on the readers:



The designer follows some common conventions, such as placing the first slice at the top vertical, sorting the slices from largest to smallest (excepting the "other"), and introducing multiple colors only to encode data differences.

These rules are silently applied, and are not announced to the reader. There is a network effect: the more practitioners use these rules, the stronger they stick.

My essay attempts to outline some of the most important unspoken rules of visualizaiton. For more, see here.

Habit-busting designs don't work

The design changes that most frustrate users are those that bust their habits.

Case in point. Apple re-designed the bottom navigator of the iphone mail app. See what it looked like before and what it looks like today:


Notice how the 2nd slot from the bottom right used to be for replying, and after the re-design, it has become the button for deleting. So when I intended to reply to a message, my finger instinctively presses that 2nd button and now, instead of replying, the message gets deleted!

In the last few years, my finger hit that button thousands of times whenever the brain said to reply. Now, it's really hard to change this habit. I kept having to undo the delete. It's frustrating beyond belief.

This also shows the habit is in the muscle memory, and I'm no longer paying attention to the visual icon. A more direct dataviz analogy is when you belatedly discovered that the horizontal axis in a line chart isn't representing time because you didn't read the axis labels.


A similar thing happened inside an elevator (lift) recently.

Most elevator panels place the Door Open and Door Close buttons side by side. Typically, the Door Open is on the left and the Door Close is on the right.

This particular elevator panel has the Door Open button on top, and Door Close at the bottom, laid out vertically. To the right of the Door Open button is the Alarm button! So I sounded the Alarm when I intended the doors to close.

(I didn't take a photo at the time. The figure on the right is a rough sketch of what the panel looked like.)


I bet the alarm is pressed multiple times a day by mistake.

Food coma and self-sufficiency in dataviz

The Hustle wrote a strong analysis of the business of buffets. If you've read my analysis of Groupon's business model in Numbersense (link), you'll find some similarities. A key is to not think of every customer as an average customer; there are segments of customers who behave differently, and creating a proper mix of different types of customers is the management's challenge. I will make further comments on the statistics in a future post on the sister blog.

At Junk Charts, we'll focus on visualizing and communciating data. The article in The Hustle comes with the following dataviz:


This dataviz fails my self-sufficiency test. Recall: self-sufficiency is a basic requirement of visualizing data - that the graphical elements should be sufficient to convey the gist of the data. Otherwise, there is no point in augmenting the data with graphical elements.

The self-sufficiency test is to remove the dataset from the dataviz, and ask whether the graphic can stand on its own. So here:


The entire set of ingredient costs appears on the original graphic. When these numbers are removed, the reader gets the wrong message - that the cost is equally split between these five ingredients.

This chart reminds me of the pizza chart that everyone thought was a pie chart except its designer! I wrote about it here. Food coma is a thing.

The original chart may be regarded as an illustration rather than data visualization. If so, it's just a few steps from becoming a dataviz. Like this:


P.S. A preview of what I'll be talking about at the sister blog. The above diagram illustrates the average case - for the average buffet diner. Underneath these costs is an assumption about the relative amounts of each food that is eaten. But eaten by whom?

Also, if you have Numbersense (link), the chapter on measuring the inflation rate is relevant here. Any inflation metric must assume a basket of goods, but then the goods within the basket have to be weighted by the amount of expenditure. It's much harder to get the ratio of expenditures correct compared to getting price data.



Gazing at petals

Reader Murphy pointed me to the following infographic developed by Altmetric to explain their analytics of citations of journal papers. These metrics are alternative in that they arise from non-academic media sources, such as news outlets, blogs, twitter, and reddit.

The key graphic is the petal diagram with a number in the middle.


I have a hard time thinking of this object as “data visualization”. Data visualization should visualize the data. Here, the connection between the data and the visual design is tenuous.

There are eight petals arranged around the circle. The legend below the diagram maps the color of each petal to a source of data. Red, for example, represents mentions in news outlets, and green represents mentions in videos.

Each petal is the same size, even though the counts given below differ. So, the petals are like a duplicative legend.

The order of the colors around the circle does not align with its order in the table below, for a mysterious reason.

Then comes another puzzle. The bluish-gray petal appears three times in the diagram. This color is mapped to tweets. Does the number of petals represent the much higher counts of tweets compared to other mentions?

To confirm, I pulled up the graphic for a different paper.


Here, each petal has a different color. Eight petals, eight colors. The count of tweets is still much larger than the frequencies of the other sources. So, the rule of construction appears to be one petal for each relevant data source, and if the total number of data sources fall below eight, then let Twitter claim all the unclaimed petals.

A third sample paper confirms this rule:


None of the places we were hoping to find data – size of petals, color of petals, number of petals – actually contain any data. Anything the reader wants to learn can be directly read. The “score” that reflects the aggregate “importance” of the corresponding paper is found at the center of the circle. The legend provides the raw data.


Some years ago, one of my NYU students worked on a project relating to paper citations. He eventually presented the work at a conference. I featured it previously.


Notice how the visual design provides context for interpretation – by placing each paper/researcher among its peers, and by using a relative scale (percentiles).


I’m ignoring the D corner of the Trifecta Checkup in this post. For any visualization to be meaningful, the data must be meaningful. The type of counting used by Altmetric treats every tweet, every mention, etc. as a tally, making everything worth the same. A mention on CNN counts as much as a mention by a pseudonymous redditor. A pan is the same as a rave. Let’s not forget the fake data menace (link), which  affects all performance metrics.

Bubble charts, ratios and proportionality

A recent article in the Wall Street Journal about a challenger to the dominant weedkiller, Roundup, contains a nice selection of graphics. (Dicamba is the up-and-comer.)


The change in usage of three brands of weedkillers is rendered as a small-multiples of choropleth maps. This graphic displays geographical and time changes simultaneously.

The staircase chart shows weeds have become resistant to Roundup over time. This is considered a weakness in the Roundup business.


In this post, my focus is on the chart at the bottom, which shows complaints about Dicamba by state in 2019. This is a bubble chart, with the bubbles sorted along the horizontal axis by the acreage of farmland by state.


Below left is a more standard version of such a chart, in which the bubbles are allowed to overlap. (I only included the bubbles that were labeled in the original chart).


The WSJ’s twist is to use the vertical spacing to avoid overlapping bubbles. The vertical axis serves a design perogative and does not encode data.  

I’m going to stick with the more traditional overlapping bubbles here – I’m getting to a different matter.


The question being addressed by this chart is: which states have the most serious Dicamba problem, as revealed by the frequency of complaints? The designer recognizes that the amount of farmland matters. One should expect the more acres, the more complaints.

Let's consider computing directly the number of complaints per million acres.

The resulting chart (shown below right) – while retaining the design – gives a wholly different feeling. Arkansas now owns the largest bubble even though it has the least acreage among the included states. The huge Illinois bubble is still large but is no longer a loner.


Now return to the original design for a moment (the chart on the left). In theory, this should work in the following manner: if complaints grow purely as a function of acreage, then the bubbles should grow proportionally from left to right. The trouble is that proportional areas are not as easily detected as proportional lengths.

The pair of charts below depict made-up data in which all states have 30 complaints for each million acres of farmland. It’s not intuitive that the bubbles on the left chart are growing proportionally.


Now if you look at the right chart, which shows the relative metric of complaints per million acres, it’s impossible not to notice that all bubbles are the same size.

Taking small steps to bring out the message

Happy new year! Good luck and best wishes!


We'll start 2020 with something lighter. On a recent flight, I saw a chart in The Economist that shows the proportion of operating income derived from overseas markets by major grocery chains - the headline said that some of these chains are withdrawing from international markets.


The designer used one color for each grocery chain, and two shades within each color. The legend describes the shades as "total" and "of which: overseas". As with all stacked bar charts, it's a bit confusing where to find the data. The "total" is actually the entire bar, not just the darker shaded part. The darker shaded part is better labeled "home market" as shown below:


The designer's instinct to bring out the importance of international markets to each company's income is well placed. A second small edit helps: plot the international income amounts first, so they line up with the vertical zero axis. Like this:


This is essentially the same chart. The order of international and home market is reversed. I also reversed the shading, so that the international share of income is displayed darker. This shading draws the readers' attention to the key message of the chart.

A stacked bar chart of the absolute dollar amounts is not ideal for showing proportions, because each bar is a different length. Sometimes, plotting relative values summing to 100% for each company may work better.

As it stands, the chart above calls attention to a different message: that Walmart dwarfs the other three global chains. Just the international income of Walmart is larger than the total income of Costco.


Please comment below or write me directly if you have ideas for this blog as we enter a new decade. What do you want to see more of? less of?