Who are you talking to?

Reader Daniel L. points us to this "dashboard" of statistics concerning downloads of a piece of software presumably called "maven". This sort of presentation has unfortunately become standard fare.

Maven-stats

Daniel was shocked by the pie chart. Just for laughs, here is the pie chart:

Maven_pie
Something else is worth noting -- ever wondered who the chart designer is talking to?

Is it an accountant who cares about every single download (thus needing the raw data)?

Is it a product manager who cares about the current run rates, and the mix of components downloaded?

Is it an analyst who is examining trends over time, aggregate of all components?

***

In other words, the first order of business is to identify the user. 


Answering an open call

Dan Goldstein, who writes the Decision Science News blog, relays an internal debate occurring at Yahoo! about the relative merits of some simple charts. From what I can tell, they used three types of methods (known as "Search", "Baseline", and "Combined") on four sample data sets with different subjects ("flu", "movies", "music", "games") and compared the performance of the methods. I imagine the underlying practical question to be: does having search data improve the performance of some kind of predictive model that can be applied to the different data sets? There exists an existing baseline model that does not use search data.

(I noticed that Dan has since put up the final version of the chart they decided to use for publication. I will ignore that for the moment, and put up my response. Their final version is similar to my revised version.)

Dsn_charts

I'd like to use this data to reiterate a couple of principles that I have championed here over the years.

First, we must start a bar chart at zero. There were some back and forth on Dan's blog about whether this should be an iron-clad rule, and some comments about it being not a big deal. It is a big deal; just take a look:

Redo_barcharts2
The left chart has the full scale while the right chart chopped off the parts below 0.4. The result of the chopping off is that the relative lengths of the bars become distorted. For the music data, the search method appears (on the right) to be half as effective as the baseline, which is far from reality as shown on the left chart.

I used R to generate these charts, and was pleasantly surprised that the barplot function automatically assumes that bars start at zero. If you try to start the vertical axis above zero, the bars would literally walk off the chart, making it extremely ugly! (I had to pull some tricks to create the version shown above.)

***

Andrew Gelman suggested using a line chart. He also recently wrote that he has become a fan of line charts. Long-time readers know I am a fan of line charts, too... and I have tifosi who come here to complain about my over-use of line charts, especially when we have categorical data (as here!).

In particular, I have written about grouped bar charts before, and most of the time, they can be made into line charts and made clearer and better. (See here, or here.)

Some of the readers of Dan's blog complained that the dot plot makes it difficult to compare the performance of say the search method across different subjects (data sets). They think the bar charts do this better.

If comparing across subjects is the key activity for the reader of this chart, then a line chart is even better. Imagine you are reading the bar chart and comparing across subjects. Follow your eyes. You are essentially tracing lines across the top of the bars. The line chart makes this explicit. That, to me, is the key argument for using line charts in place of grouped bar charts.

I have made this argument before. Here is an illustration of the argument. The broken red lines are the same as the lines in the line chart.

Redo_dsn_charts0

***

On the line chart shown above, it's easy to see that the Combined method has the best average performance, and is never worse than the other two methods for any of the subjects. It also shows that the music subject differentiates the three methods most, primarily because the search data was not adding much to the effort. There is also no need to add colors, which can quickly make the bar charts unwieldy and disorienting.

In the final chart, shown on the right below, I flipped the two axes, changed the plot characters, used colors, shifted emphasis slightly to dots rather than lines, and started the chart at 0.5 (!)

Redo_dsn_charts
 

Line charts are more flexible in that they can make sense even when the axis does not start at zero. In particular, when the point of the chart is to make comparisons, that is, to look at the gaps between dots or lines, rather than the absolute values, then it is fine to start the axis at some place other than zero.

Take again the example of the performance on music by the three methods (red line). The drop in performance between combined and baseline and that between baseline and music are indeed roughly equal. The vertical distances to the bottom of the chart are still distorted as in the bar chart but in a line chart, readers are less likely to get distracted by those distances because the bars are not there.


Book review: Interactive Graphics for Data Analysis

I am happy to provide the following review of this interesting book by Martin and Simon, who are readers of Junk Charts. Martin also publishes a blog, and he's the one who has created bumps charts for the Tour de France races (which also appear in the book).

Interactive Graphics for Data Analysis is an advanced book written by two researchers who have deep experience developing graphics software. People who like to go beyond the basics will find it a useful addition to the literature.

To give you an idea of the level of sophistication, just in Chapter 1 (titled Interactivity), the two authors utilize set operations, SQL statements, and parallel coordinate plots. They assume you have some sense of what those are. That said, those sections can be skipped without interrupting the flow of the book.

The following key messages from these authors are worth repeating:

  • There is a distinction between statistical graphics and data graphics. Underlying trends and patterns in the data is often made clear by performing statistical analyses on the data, with the results added to charts (e.g. loess lines). When dealing with very large data sets, statistical charts (such as box plots) are found to be much more scalable, precisely because they do not attempt to put every data point onto the page.
  • The authors stress the need to look at a variety of charts when doing exploratory data analysis. This is because most chart types do certain things well but not others.
  • Igda_img003  Throughout the book, they make much hay of the problem of "over-plotting", that is, overlapping data. This happens when data is abundant, or when values are concentrated in a narrow range. A great illustration of this problem is the parallel coordinates plot, which can look entirely different depending on which lines are plotted on top of which other lines. (The charts on the right are identical except for the order in which the lines are plotted.) Common strategies include "jittering", and varying transparency. Many of these strategies have issues of their own. 
  • They also point out that the look of many multivariate charts (such as mosaic charts) depends on the sorting of the data. This is a key weakness of many such plots. Just think about this the next time you create a stacked column chart.

The book is divided into two sections: Principles and Examples. The second half, the Examples section, consists of case studies in which the authors show examples of how to investigate the structure of a given data set.

Igdaimg002 The example of using the fatty-acid contents of Italian olive oils to deduce their regional origin is a good visualization of how the statistical technique of classification trees work. Here is the telling diagram:

 Notice that data with the same color are oils from the same region, the rectangular sections are results of the statistical classification procedure, and we would like to see most (if not all) of the data within each section having the same color.

***

Without a doubt, graphics designers should be aware of the issues raised by these authors. The book appears to be written for students who are creating statistical software (complete with end-of-chapter exercises.) I'm left wondering what users of graphics software can do with this information because much of this material relates to the design of graphics software. Knowing these issues makes you want to do things the software may not be designed to do efficiently. For example, most software packages I have used do not have a simple toggle to sort categorical variables by various means (alphabetical, increasing or decreasing frequency, increasing or decreasing value of another variable, etc.).


Hoisted from the archives: a revolution

In October 2007, I wrote about the "canvass" metaphor for graphing software. This was what I said:

With the advent of AJAX and other interactive technologies, one can only hope that new graphing software will use the "canvass" metaphor.  If we want to reduce the spacing between bars, we should be able to grab the bars and move them together.  If we want to change the ordering, we should be able to mouse over some menu and select a pre-defined ordering scheme, or to drag and move bars around as we please. etc. etc.

To push this metaphor further, this kind of software should facilitate the "exploratory" stage of graph-making. I blogged about this stage of making sketches before. One longs for software that allows one to flip through many different chart types quickly, to settle on the desired type, and then to make the nitty-gritty changes to the axes, colors, dots, etc.

The revolution has arrived in the form of JMP's Graph Builder function. It is not perfect yet, as even the example I use will show, but I'm excited because we are getting closer to that "canvass" metaphor.

***

Spam_donutsI'm going to re-make this inedible pair of donuts from an otherwise quite nice infographics on the growth and nature of spam in the last 10 years. (New Scientist)

I have pointed out the biggest shortcoming of donut charts often: the fact that the most important clue to the size of each sector of the underlying pie chart, that is, the angle at the center of the pie, has been cut off from the chart, and often, as in here, obscured by a number.

There are dramatic shifts in proportions of spam types during the last decade but the effect is underwhelming as depicted.

In the Graph Builder, I can push around the data and create different chart types.  First, I made a small-multiples bar chart.

Bars_sm_multiple

By clicking on the word "Year" and dragging it to a box called "Overlay", I made a paired bar chart:

Paired bars

What about a dot plot instead? This change requires a right click but easy enough:

Dots

Here's where I encountered a little inconvenience. It's probably ignorance on my part since I didn't read the manual. I couldn't figure out how to increase the dot size for all dots at once, only one at a time.

In any case, I'm still searching.  I want to do a small-multiples line chart. For this, I drag the word "Year" into the bottom of the chart labelled "X", and then right-click to add a line to the dot chart.

Lines_sm_multi

This is close to a desired chart type for this data.  The change from year to year is highly apparent, and the increased and decreased spam types are also obvious. I would color the increases differently from the decreases if I have the time.

I had a very difficult time (and failed in) getting the year labels to say 1999 and 2009 which are the logical points for this data. JMP seems to have a mind of its own.

Since it takes no time, I experimented some more.  By moving "Category" to "Wrap", I reproduced the above chart but in a matrix form:

Lines_sm_multi_wrapped

Finally, I made the "Category" an "overlay" which resulted in this chart.  This is kind of like the Bumps chart but obviously a bad idea for this data: (I'm not even showing the really ugly legend).

Lines_overlay_category

So, my dream toy -- the "canvass" style graph maker -- is here! It only takes a few minutes to move the data around this canvass, and see these different chart types.

***
I indicated that this goes a long way but isn't perfect. Right now, sketching and exploring is easy but refining and detailing is not as easy.

What I would like to see: once the general form of the chart is chosen, maybe a second canvass is needed, with Photoshop as a metaphor, in which we can chisel out the nitty-gritty details, like the axis labels, dot sizes, line widths and so on.

Also, the number of chart types can, and I presume will, be increased over time. For instance, I don't think the current version allows a profile chart; it seems to adhere to the overly-rigid rule that a categorical data series should not be connected by a line.

(I should say that in the current release, one way to accomplish this is to save the resulting graph-sketch as a "JMP script" and then go into the code and change things around. But since we are doing point and click, and visual interaction, why not go all the way?)

Most existing graphing software fall into two extremes: the Excel style which is super-rigid, or the R style which allows minute control over every little thing. This, I think, is the third way.

 


Data democracy

I have not yet been fully convinced of the direction of infographics until now -- I find too narrow the focus on organizing, structuring and visualizing large datasets; often times, we get pretty pictures with extremely high data-ink ratios but more often than not, these very dense graphics fail to speak directly to readers.  We see a lot of information; we find hardly any insights.

I think I have seen the future.  My friend Adam has been working on a web service called Empirasign, which I will describe as a form of data democracy - he takes boatloads of financial data, runs all sorts of analyses and models, and presents these results in a variety of formats, including on-line reports and tweets.  He does not attempt to visualize all the data, or all possible relationships.  Each analysis or model focuses on specific matters and he presents the result in tables and charts.

For example, a business problem might be as follows (timely for the year-end): in my portfolio, I am carrying some loser stock which I'd like to sell by year end so I can take a tax deduction on the loss, perhaps to cover some investment gains I have realized last year; however, I also believe that the loser stock may be near bottom, and if I sell now, I'd want to buy it back in short order - alas, this may be considered a "wash sale" and prohibited.  What if one can find a hedge (another stock or a portfolio of stocks) that replicates the performance of the loser stock so now I can get the best of both worlds - I sell the loser stock for the tax deduction, but keep the performance by taking a position in the hedge, then unwind when the regulation allows me to buy back into the loser stock?  (If you are interested in this trade, you should consult the experts: Adam's tutorial or wikipedia on "wash sale" or IRS-ese (pdf file).)

There are lots of stocks out there, and lots of possible hedges.  An unsophisticated investor like myself would have to spend a lot of effort to find the right hedge.  Also, it's very unlikely that staring at an infographics chart will uncover such hedges.  What Adam has done is he has collected all the required data and run analyses to find the right hedge for pretty much every (loser) stock out there.  And instead of presenting all the underlying data, he presents the results.  See below.

Wash-sale-table
Wash-sale-chart

These data displays are not sexy - and can be improved (the explanation for the columns of the table is found on a separate page, e.g.), but for the target audience looking for trade ideas, they get to the point.  This is the gift of statistical data reduction.

What is also worth noting is through the magic of R, and Web technologies, Adam makes all this run automatically, so the insights from the data are uncovered in real time.  The wash sale avoidance strategy is not the only analysis he provides; there are tons more on the website that implements all sorts of other techniques (of which I am no expert) but it appears that users can pick and choose whatever strategy they like to follow, and Empirasign saves them any of the analytical work.

As I said at the start of this post, I see this as a promising direction for infographics, moving from visualizing data to visualizing insights.


P.S. As with previous years, I have updated my Amazon wish list (click on button on top right).  If you'd like to show your support for this blog, please help me build out my library.  Thanks to those readers who have contributed in past years - since Amazon does not always provide me your contact information, I have not been able to thank each of you personally.  Happy holidays! 




Seth's Rules

(Via Gelman blog)

Prominent marketer Seth Godin came up with some sensible rules for making "graphs that work".  We pretty much agree with most of what he says here, unlike the last time he talked about charting.

One must recognize that he has a very specific type of chart in mind, the purpose of which is to facilitate business decisions.  And not surprisingly, he advocates simple, predictable story-telling.

His first rule: dispense with Excel and Powerpoint.  Agreed but to our dismay, there are not many alternatives out there that sit on corporate computers.  So we need a corollary: assume that Excel will unerringly pick the wrong option, whether it is the gridlines, axis labels, font sizes, colors, etc.  Spend the time to edit every single aspect of the chart!

His second rule: never show a chart for exploration or one that says nothing.  I used to call these charts that murmur but do not opine.  (See here, for example.)  This pretty much condemns the entire class of infographics as graphs that don't work.   This statement will surely drive some mad.  One of the challenges that face infographics is to bridge the gap between exploration and enlightenment, between research and insight.  As I said repeatedly, I value the immense amount of effort taken to impose structure and clarity on massive volumes of data -- but more is needed for these to jump out of the research lab.

In rules 3 and 4, Seth apparently makes a distinction between rules made to be followed and rules made to be broken.  In his view, time going left to right belongs to the former while not using circles belongs to the latter.  He gave a good example of why pictures of white teeth are preferred to pie charts, bravo.  I hope all those marketers are listening.

As readers know, I cannot agree with "don't connect unrelated events".  He's talking about using line charts only for continuous data.  This rule condemns the whole class of profile plots, including interaction charts in which statisticians routinely connect average values across discrete groupings.  The same rule has created the menace of grouped bar charts used almost exclusively to illustrate market research results (dozens to hundreds of pages of these for each study).  I'd file this under rules made to be broken!

What menace?

Menace1

What menace?

Menace2


What menace?

Menace3


What menace?

Alright, I made my point.  If you don't work in market research, the mother lode of cross-tabs and grouped bars, consider yourself lucky.  If you do, will you start making line charts please?






R power, math stats power

Amusingly, the New York Times finally got wind of the R software.   See the article here.

We in the statistics community owe these folks a lot of gratitude for developing such a flexible, powerful software.  It is unfortunate that they didn't mention graphing as one of the great strengths of the software.

Equally amusingly, the Wall Street Journal told us what we already know, that we have the best jobs in the world.  Their discovery here.


 


Charts, charts, charts

Jorge Camoes has been a regular reader and sometime commenter for a while.  Little did we know that he has been blogging in Portuguese for the last 10 months.  Recently, he has decided to join the English-speaking world.  His new blog is, simply, Charts.

One post discusses the "population pyramid" chart for comparing advertising spending. 
ChartsspendHe suggested the overlapping bar chart; see his comment here.  By folding one side onto the other, this chart is clearly an improvement over the original, and yet it fails to convey the proportional spend, which is the key point being made in the article.

In another post, Jorge created a "screencast" (tutorial) of how to create a population pyramid in Excel.  A lot of this mirror my own experience using Excel for graphing.  Those of you who have asked for tips in the past should definitely see it.

What you'll find is that creating a nice-looking chart in Excel requires a lot of tedious finger-work.  It is truly incredible how many steps, how much opening and closing of windows, back and forth navigation, etc. users are made to suffer through to make cosmetic changes.

With the advent of AJAX and other interactive technologies, one can only hope that new graphing software will use the "canvass" metaphor.  If we want to reduce the spacing between bars, we should be able to grab the bars and move them together.  If we want to change the ordering, we should be able to mouse over some menu and select a pre-defined ordering scheme, or to drag and move bars around as we please. etc. etc.

(I have heard that Apple's spreadsheet software Numbers has some of these features.  I have yet to use it myself.  If any of you have, let us know what you think.)