Involuntary head tilt

A data graphic's first mission is communications. Looking cute should not come before. This one is a big fail by MIT Technology Review (link).


What makes the designer want to tilt the reader's head?

This chart is unreadable. It also fails the self-sufficiency test. All 13 data points are printed onto the chart. You really don't need the axis, and the gridlines.

A further design flaw is the use of signposts. Our eyes are drawn to the hexagons containing the brand icons but the data is at the other end of the signpost, where it is planted on the surface!

Here is a sketch of something not as cute:


Notice that I expressed time as years, and undid the log-transform on the axis of users. The mobile-related entities are labelled red. The dots could be replaced by the hexagonal brand icons.


The other two charts on the same page have their own issues. Health warning: your head may tilt.


Digging it out

Tr_diggbgAnother sunset photo compilation?  Not quite.

This chart acts and smells like the sunset chart, being generated by many unknowing collaborators, this time, visitors to the content aggregation site, Digg.  For those unfamiliar, web browsers can "digg" any web page they find interesting (by clicking on an image), which causes a link to be generated at Digg's web-site.  We can use the number of Diggs to judge the value or popularity of a web page.

In effect, Digg is a gigantic save folder for the masses.  What happens when we have huge amounts of data?  We have to work really hard to dig out the useful information.  This chart goes quite a long way to answer one specific question.

Digg users are plotted horizontally and the stories they Digged are plotted vertically.  The bright white vertical strip represents suspicious activity; some user digged a large number of stories within the time window of the chart, most likely a bot trying to usurp the mass rating system.

Flickr and Digg are two of the more prominent stories of the so-called "Web 2.0", or mass collaboration on the Web.    Between my last post and this post, I have kind of lost enthusiasm for this type of charts, at least from a statistical perspective.  There is no real collaboration: the photographer who contributed sunset No. 103 does not know the one who uploaded No. 31, for example.  Using this logic, every survey or census ever conducted qualifies as mass collaboration, just because there are many participants providing data. 

What's worse, a typical survey brings together results from a random sample.  These charts all have highly biased samples, and I haven't seen any discussion yet of this issue.  They cannot be interpreted without understanding who participated.

Reference: "How Digg Combats Cheater", Technology Review, Jan 24, 2007.

Industry sector innovation indices

Here is a chart from MIT's Technology Review and a junkart version:


These are both great charts.  As always, it's important to marry form with function.  If one wants to read off the sector ranks, the dot chart works better; if one wants to focus on the change in ranks, the line chart works better.  If one wants to track sectors as they change over a longer period of time, the line chart works its magic: we can just stack a bunch of them next to each other.

The headline identifies 6 "improving" sectors.  This is difficult to see in the dot chart because the reader needs to associate orange with 2003 and yellow with 2004, regardless of which color appears on the left.  In the line chart, the improving sectors are the ones with lines going up; I colored them blue for clarity.

Moving from the graphical to the statistical, I have major problems with the creation of this "innovation index".

The Innovation Index is calculated by combining, with equal weights, 2004 R&D spending rank, percent change in R&D spending, absolute change in R&D spending, and R&D spending as a percentage of sales.

It is unclear why those variables (including ranks, percentages and dollars) were combined with equal weights.  And in fact, absolute change in R&D spending probably dominated the ranking since it has the largest and widest scale.

And then, the sector ranking is the average of the ranks (1-150) of the companies in the specified sector.  This is a useless average because it implicitly assumes that the difference between company #2 - company #1 is the same as the difference between company #150 - company #149.  It is an ordinal ranking imposed without justification.  Besides, some sectors consist of only 3 companies while others contain 28.  If I have time, I will illustrate these points with charts.

Reference: "R&D 2005", Technology Review, August 2005.