« May 2016 | Main | July 2016 »

A Tufte fix for austerity

Trish, who attended one of my recent data visualization workshops, submitted a link to the Illinois Atlas of Austerity.

Atlas_il_austerity_clientsShown on the right is one of the charts included in the presentation.

This is an example of a chart that fails my self-sufficiency test.

There is no chance that readers are getting any information out of the graphical elements (the figurines of 100 people each).

Everyone who tries to learn something from this chart will be reading the data labels directly.

The entire dataset is printed on the chart itself. If you cover up the data, the chart becomes unreadable!


Here is a simple fix that resettles the figurines onto a bar chart:


Tufte would not be amused by this composition. The figurines are purely decorative.

This version is more likely to delight Tufte:


It is the edges of the bars in the bar chart that make all the difference.


Aside from the visual problems, there is also a data issue. They should have controlled by the size of different programs.



Raining, data art, if it ain't broke

Via Twitter, reader Joe D. asked a few of us to comment on the SparkRadar graphic by WeatherSpark.

At the time of writing, the picture for Baltimore is very pretty:


The picture for New York is not as pretty but still intriguing. We are having a bout of summer and hence the white space (no precipitation):


Interpreting this innovative chart is a tough task - this is a given with any innovative chart. Explaining the chart requires all the text on this page.

The difficulty of interpreting the SparkRadar chart is twofold.

Firstly, the axes are unnatural. Time runs vertically, defying the horizontal convention. Also, "now" - the most recent time depicted - is at the very bottom, which tempts readers to read bottom to top, meaning we are reading time running backwards into the past. In most charts, time run left to right from past to present (at least in the left-right-centric part of the world that I live in.)

Location has been reduced to one dimension. The labels "Distance Inside" and "Distance from Storm" confuse me - perhaps those who follow weather more closely can justify the labels. Conventionally, location is shown in two dimensions.

The second difficulty is created by the inclusion of irrelevant data (aka noise). The square grid prescribes a fixed box inside which all data are depicted. In the New York graphic, something is going on in the top right corner - far away in both time and space - how does it help the reader?


Now, contrast this chart to the more standard one, a map showing rain "clouds" moving through space.


(From Bing search result)

The standard one wins because it matches our intuition better.

Location is shown in two dimensions.

Distance from the city is shown on the map as scaled distance.

Time is shown as motion.

Speed is shown as speed of the motion. (In SparkRadar, speed is shown by the slope of imaginary lines.)

Severity is shown by density and color.

Nonetheless, a panel of the new charts make great data art.



What doesn't help readers (on the chart) and what does help (off the chart)

Via Twitter, Bart S (@BartSchuijt) sent me to this TechCrunch article, which contains several uninspiring charts.

The most disturbing one is this:


There is a classic Tufte class here: only five numbers and yet the chart is so confusing. And yes, they reversed the axis. Lower means higher "app abandonment" and higher means lower "app abandonment". The co-existence of the data labels, gridlines, and axis labels increases processing time without adding information.

A simple column chart shows there is almost nothing going on:


I suspect that if they were to break the data down by months and weeks, it would be clear that the fluctuations are meaningless.


The graphical scaffolding, or what Tufte calls the non-data ink, should provide context to help readers understand the data. This is not the case here.

Worse, the context needed to interpret "app abandonment" is sorely missing.

You might argue with me. Isn't it clear from the chart title? And doesn't the subtitle provide the details of how app abandonment is measured? It says "% of users who abandon an app after one use".

That definition is an emperor with no clothes.

The five numbers could not really be percentages of users because every user has many apps. So one may abandon app A after a single use, but one may also have used app B four times, and app C 12 times, etc.

It seems possible that they are counting user-app pairs. This measure is much harder to interpret because every user is represented as many times as he/she has apps. The more apps he/she has, the more times he/she is represented in the data.

And be careful, we are not counting all apps either. For the definition to make sense, we should be counting only apps that are downloaded in the given year. This means that lurking behind the time series is the proportion of "new" apps and how this evolved over time. It is also murky what "new" means. I am aware that many app developers keep forcing users to download upgraded apps - sometimes, I think these are counted when developers publish app download statistics. Obviously, someone who upgrades an app is likely to be an active user. So whether upgrades or later versions of the same app are counted or not is another question.

Finally, what constitutes a "use"?


From a Trifecta perspective, this is a Type DV chart. There are obvious visual flaws but the real issue is the missing context related to how the metric is defined.




Why is this chart so damn hard to read?

My summer course on analytical methods is already at the midway point. I was doing some research on recommendation systems the other day, and came across the following chart:

Rec sys chart

Ouch. This is from the Park, et. al. (2012) survey of research papers on this subject. It's the 21st century, people. The column chart copies the older-generation Excel design made infamous by Tufte, and since abandoned. Looking more closely, I suspect that the chart was hand-crafted, not made in Excel.

There are several challenges of reading this chart.


The gaps between columns are narrower than the columns. Only in the last two years do the eight categories all count. So a key task is to learn which column stands for which type of application. Having one's eyes flip back and forth between the columns and the legend below the chart is a big hassle. As readers, we tend to learn a short cut, which is to memorize the order of the categories (first column is book, second column is document, etc.). The incorrect width of zero-valued columns thwarts this simple strategy.

The designer creates another obstacle by sorting the categories alphabetically. Shopping and movies are two of the most important applications and that message is buried.

The key to cleaning up this graphic is to bring the visual design closer to the question being addressed. The question of the chart is how interest in various applications has changed over time.

Here is a small-multiples presentation of the same data:


The answer is that applications are getting more diversified (the rise of the Other), and that Documents, Shopping and Movie applications were growing while research on Image, Music, TV Program and Book stagnated during the study period.