I am happy to provide the following review of this interesting book by Martin and Simon, who are readers of Junk Charts. Martin also publishes a blog, and he's the one who has created bumps charts for the Tour de France races (which also appear in the book).
Interactive Graphics for Data Analysis is an advanced book written by two researchers who have deep experience developing graphics software. People who like to go beyond the basics will find it a useful addition to the literature.
To give you an idea of the level of sophistication, just in Chapter 1 (titled Interactivity), the two authors utilize set operations, SQL statements, and parallel coordinate plots. They assume you have some sense of what those are. That said, those sections can be skipped without interrupting the flow of the book.
The following key messages from these authors are worth repeating:
- There is a distinction between statistical graphics and data graphics. Underlying trends and patterns in the data is often made clear by performing statistical analyses on the data, with the results added to charts (e.g. loess lines). When dealing with very large data sets, statistical charts (such as box plots) are found to be much more scalable, precisely because they do not attempt to put every data point onto the page.
- The authors stress the need to look at a variety of charts when doing exploratory data analysis. This is because most chart types do certain things well but not others.
- Throughout the book, they make much hay of the problem of "over-plotting", that is, overlapping data. This happens when data is abundant, or when values are concentrated in a narrow range. A great illustration of this problem is the parallel coordinates plot, which can look entirely different depending on which lines are plotted on top of which other lines. (The charts on the right are identical except for the order in which the lines are plotted.) Common strategies include "jittering", and varying transparency. Many of these strategies have issues of their own.
- They also point out that the look of many multivariate charts (such as mosaic charts) depends on the sorting of the data. This is a key weakness of many such plots. Just think about this the next time you create a stacked column chart.
The book is divided into two sections: Principles and Examples. The second half, the Examples section, consists of case studies in which the authors show examples of how to investigate the structure of a given data set.
The example of using the fatty-acid contents of Italian olive oils to deduce their regional origin is a good visualization of how the statistical technique of classification trees work. Here is the telling diagram:
Notice that data with the same color are oils from the same region, the rectangular sections are results of the statistical classification procedure, and we would like to see most (if not all) of the data within each section having the same color.
Without a doubt, graphics designers should be aware of the issues raised by these authors. The book appears to be written for students who are creating statistical software (complete with end-of-chapter exercises.) I'm left wondering what users of graphics software can do with this information because much of this material relates to the design of graphics software. Knowing these issues makes you want to do things the software may not be designed to do efficiently. For example, most software packages I have used do not have a simple toggle to sort categorical variables by various means (alphabetical, increasing or decreasing frequency, increasing or decreasing value of another variable, etc.).