I contributed the following post to the Statistics Forum. They are having a discussion comparing information visualization and statistical graphics. I use the following matrix to classify charts in terms of how much work they make readers do, and how much value readers get out of doing said work.
Like Australia-based reader Ken B., I don't understand why many chart designers insist on using charts to deliver lessons to the public on map geography. Here is a recent example from Down Under, on earthquakes: (click on this link for the interactive version)
Was there a quake that shook the middle of the Pacific? Did a new geological formation give New Zealand a Pinocchio nose? No and no. The ugly presentation of the 2010 and 2011 Christchurch earthquakes -- as two ends of a dumbbell -- makes clear the straitjacket that maps are when it comes to delivering quantitative information.
Besides, the bubbles represent the relative magnitude of the quakes when one would hope that their sizes represent the geographical extent of the damage; at least, that would be information that has a spatial dimension.
The location of the quake is the only data with a spatial dimension surfaced on this plot. The only purpose of the map background is to tell us where Christchurch, Sichuan, etc. are on a map. In order to deliver this map lesson, the designer has to hide all of the more interesting data, like the relative magnitudes, the time-lines, the extent of the damage, the mortality rates, etc. In my mind, that is a very poor tradeoff.
Martha left a comment on my previous post asking my comments on this National Geographic word cloud map of surnames in the U.S. (Click on the link to look at the interactive map.)
Here is a close-up of California:
Anytime someone expands the possibilities of a chart type, like the word cloud, it's a commendable project. So I'm quite enthusiastic about what they tried to do here. Not every new feature is successful, though.
These are the things I like:
Using colors that mean something: they use different colors to indicate different countries of origin of particular surnames. Good idea. I prefer to have the same color and different shades for each continent.
For once, the data being depicted is not a speech or a piece of text; it's a set of surnames.
This chart (or map) is multi-variate: it tries to address deeper questions such as the correlation between geography and origin of popular names, and the correlation between geography and popularity of names, etc. This is an important advance from all those word clouds out there that tells us nothing but the frequency of words in a document. In general, statistical clustering methods can be combined with text mining methods to develop multivariate word clouds.
The designers realize it's a futile -- as well as ill-advised -- task to try to print every name on the map so they only include the top 25 names in each state. As I explain below, I'm not happy with this inclusion/exclusion criterion but the key point is by taking out the minor bits of data ("noise"), the chart is more able to draw our attention to the more interesting parts.
These are things I don't like:
They really ought to have used relative popularity rather than absolute popularity. This is another area of improvement for all word clouds. Today, word clouds plot the number of times a specific word appears in a piece of text. We often try to compare several word clouds against each other; and when we do that, the only sensible measure is the proportion (relative frequency) of time a specific word appear. Say, one compares Obama and McCain speeches by comparing two word clouds. If these two speeches differ significantly in length, then comparing the number of times each candidate use "education" words is silly -- we have to compare the number of times per length of the speech.
The cutoff of top 25 names in each state suffers a similar problem. The 26th most popular name in California, a populous state, is of more interest than say the 15th most popular name in Montana (or insert your favorite small state). Instead, a more sensible cutoff would be including names that account for at least 2 percent (say) of a state's population. By doing this, the more populated states would have more entries than the less populated states.
Given the above bullets, it is not surprising that the word-size scale has serious problems. Because it is an absolute number and not relative to each state's population, the big words can only show up in populous states. In other words, the size of the words tells us about the geographical distribution of the U.S. population. As I mentioned before (such as here), this insight is available on pretty much every map used to plot data that has ever been produced. The one thing that all these maps never fail to tell us is the fact that most of the U.S. population is bi-coastal. Unfortunately, the real message of the map -- in this case, the geography of surnames -- is subsumed.
And then, the map invents false data. Notice that there are 1,250 geographic sites on the map (25 names times 50 states). This is a visually prominent feature of the map, and yet there is no rhyme or reason as to where the names are placed, with the exception of respecting state boundaries. The casual reader may think that the appearance of the Chinese name "Lee" in the inner, central part of California implies that Lee-named Chinese-Americans aggregate in those parts of California. Far from the truth!
So, I think they did a reasonable job in rethinking the possibilities of word clouds. It's well intentioned and there is room for improvement.
Lastly, they might get some ideas from the Baby Names navigator.
The work of Hans Rosling and Gapminder (now part of Google) highlighted moving images as part of the graphics toolbox. Let me call these "graphlicks", graph-movies. It is clear that lots of people love graphlicks.
There is one open problem in graphlicks that needs creative solutions: how to incorporate memory into the experience?
If a movie is required to show patterns in the data, it would be to highlight a temporal pattern - the changes over time are interesting. As the movie goes from Day/Month/Year 1 to Day/Month/Year X, the old stuff is usually taken off the canvass to make way for the newer stuff. In effect, we rely on the reader's memory compared on the current scene in the movie.
What gets me thinking about this is a graphlick created by my friend Adam, whose startup Empirasign compiles and markets data on mortgage prices and other financial data:
The data relates to 30-year mortgages originated in 2010. The coupon rate shown on the horizontal axis ranges from below 4% to 8%, which are the cash flows an investor gets. Each line chart shows how the "market" was valuing the 30-year mortgages of different coupon rates on a particular day. The price is an index, equalling 100 at issue.
The general shape of the line indicates that the market valued the higher-coupon ones more than the lower-coupon ones (except for the right tip of the line). Since interest rates have been coming down, the mortgages issued at 4% coupon were newer ones than those issued at 7-8%, which means they had higher "duration risk" for investors, thus lower value. The dip beyond 7% may be due to a countervailing "prepayment risk": if the debtholder prepays, the investor would be forced to take 100 for something they may have paid over 100.
As you play the graphlick, two features of the data ought to stand out: the general shift upwards of the line which indicates that the market was increasing the valuation of these mortgages over the year (regardless of coupon); also the stronger volatility on the left-side of the line.
Noticing either feature requires the reader to remember the trajectory of the lines. What are some ways to help the reader?
Fade out but don't remove old lines?
Include a cumulative average line?
Include an "envelope" that captures the maximum and minimum prices over time?
The Trifecta checkup requires us to align all three aspects to make a great chart. It is sometimes the case that a wise choice has been made regarding the type of chart, but the other elements are missing. Reader Parker S. sent in an example of such a chart.
This chart created by ESPN illustrates the evolution of the "power ranking" of the San Diego Chargers football team within each 18-week-long season and across multiple years.
The bumps chart is invented for this type of ranking over time data. And in fact, we are looking at a bumps chart.
But with lots of distractions: the multiple colors (instead of year labels), the dots, the legends, the year selector, no foreground (current season).
*** Parker couldn't figure out the practical question this chart is supposed to answer (the top corner of the Trifecta).
It seems to me that the more interesting question is how different teams fare from week to week within a given season, rather than how one team fared from week to week over consecutive seasons.
In fact, one of the secrets of the Bumps chart -- the reason why it feels far less cluttered than it has the right to be -- is that no two data points will overlap, that is, for any given week, only one team occupies any particular rank. This simple rule is violated when the same team's rank across multiple seasons is plotted, and thus the chart feels very busy.
It proves impossible to find a source of ESPN power rankings that has all teams for a given season. However, I found something similar at CBS Sportsline, a competitor. Here is their version of the ranking chart:
They got the practical question right but severely under-utilized the form. We can see how the Chargers season is going but have no ability to compare them to other teams.
We can start with the question of visualizing how Chargers and their AFC West compatriots are doing relative to the rest of the league:
The AFC West is a mediocre division this season, with all four teams in the middle of the pack, none in the top quarter of the table. The Chargers started high, plunged and are recovering while the Oakland Raiders have improved over the course of the season.
The Bumps chart is more powerful when the full set of data is plotted, and when the lines are highlighted with reference to the question being answered. Are AFC teams or NFC teams doing better?
The next one highlights the teams that earned the largest change in ranking from week 1 to week 10. The background (gray lines) consists of those teams whose rankings in Week 10 were within 5 places of their initial rankings.
The practical question might be whether Week 1 rankings are a good predictor of Week 10 rankings. The following chart shows that most teams in the top quartile remain there (except San Diego which is coming back, and Dallas which could be coming back too), the bottom-quartile teams also tend to remain there, while not surprisingly, the middle teams don't tend to stay in the middle. The color scheme should be reversed if one wants to highlight the dispersion of the rankings of these middle teams by Week 10.
A reader sent in this "pie chart" (better called a "donut chart") which summarizes the results of this survey.
My dislike of donut charts has been well documented. Click here.
What I want to discuss is the use of interactivity, a feature of this chart but something that backfires. The underlying data is a 5-level rating of "corporate sentiment" by industry, by country, and over time. That would be 4 dimensions jostling for space on a surface. Obviously, some decisions have to be made as to which dimension to highlight and which to push to the background.
This chart highlights the 5-level ratings using the donut device. All other dimensions are well hidden by the interactive feature. Pressing on the forward/backward buttons reveals the industry dimension. Pressing on the arrow on the top left corner reveals the time dimension. Pressing on the map reveals the country dimension.
The problem with this level of detachment is that readers are obstructed from viewing multiple dimensions at once. For instance, it is very hard to understand the differences in sentiment between different industries, or between different countries, or the change in sentiment over time.
The version on the right shows, for instance, the distribution of ratings by industry for Q3 2010, and for all Asia combined. This is a rough sketch, and one would want to fix quite a few things: making the sector labels horizontal, reducing the distance between the columns, labeling the ratings 1 as "very positive", ordering the sectors from most positive to least positive, etc.
A chart of ratings by country (aggregate of all industry sectors) would follow the same format. Similarly, one can compare ratings across countries, for a given sector... and this can be replicated 11 times for each sector. Similarly, ratings across industries for any given country.
For comparisons across time, I'd suggest using average ratings rather than keeping track of five proportions. This reduces a lot of clutter that does not improve readers' comprehension of the trends. A line chart would be preferred.
A better way to organize the chart is to start with the types of questions that the reader is likely to want to answer. Clicking on each question (say, compare ratings across industries within a country) would reveal one of the above collections of charts.
Another improvement is to add annotations. For instance, one wonders whether the airlines colluded to all give a 2 rating. It is always a great idea to direct readers' attention to the most salient parts of a chart, especially if it contains a lot of data.
I look at a fair number of online videos, especially those embedded on blogs. But I haven't seen this feature implemented broadly. It is a wow feature.
Look at the dots above the progress bar: they tell you what topic is being discussed and allow you to jump back and forth between segments. (the particular dot I moused over said "Randy Moss") The video I saw came from this link.
This simple-looking feature is immensely useful to users. You can efficiently search through the audio file and find the segments you're interested in. It's like bookmarks students might put on pages of a textbook for easy reference, except these are audio bookmarks.
Why isn't this feature more prevalent? I think it's because of the amount of manual effort needed to set this up. Imagine how the data has to be processed. In the digital age, the audio file is a bunch of bits (ones and zeroes) so no computer or humans will be able to identify topics from data stored in that way. So, someone would need to listen to the audio file, and mark off the segments manually, and tag the segments. Then, the audio bookmarks can be plotted on the progress bar... basically a dot plot with time on the horizontal axis.
In theory, you can train a computer to listen to an audio file and approximate this task. The challenge is to attain the required accuracy so you don't need to hire an army of people to correct mistakes.
A very simple concept but immensely functional. Great job!
Bill Zeller, a PhD student at Princeton, sent me the link to his project "graph your inbox", that is an attempt to visualize the "data" in your Gmail account.
Seems to me that it acts as a sophisticated "search my mail" engine. The most interesting part is the ability to click on a point or a bar in one of the charts, and have the corresponding emails show up in the preview panel. This interactive ability is also available in the modern commercial graphing packages, and they are extremely useful for data exploration.
Technically, this is a compelling achievement. The amounts of data being processed, organized, summarized, plotted.
I think he needs to figure out some compelling use cases for something like this. Can you help? How would you use this capability if it is available?
I am happy to provide the following review of this interesting book by Martin and Simon, who are readers of Junk Charts. Martin also publishes a blog, and he's the one who has created bumps charts for the Tour de France races (which also appear in the book).
Interactive Graphics for Data Analysis is an advanced book written by two researchers who have deep experience developing graphics software. People who like to go beyond the basics will find it a useful addition to the literature.
To give you an idea of the level of sophistication, just in Chapter 1 (titled Interactivity), the two authors utilize set operations, SQL statements, and parallel coordinate plots. They assume you have some sense of what those are. That said, those sections can be skipped without interrupting the flow of the book.
The following key messages from these authors are worth repeating:
There is a distinction between statistical graphics and data graphics. Underlying trends and patterns in the data is often made clear by performing statistical analyses on the data, with the results added to charts (e.g. loess lines). When dealing with very large data sets, statistical charts (such as box plots) are found to be much more scalable, precisely because they do not attempt to put every data point onto the page.
The authors stress the need to look at a variety of charts when doing exploratory data analysis. This is because most chart types do certain things well but not others.
Throughout the book, they make much hay of the problem of "over-plotting", that is, overlapping data. This happens when data is abundant, or when values are concentrated in a narrow range. A great illustration of this problem is the parallel coordinates plot, which can look entirely different depending on which lines are plotted on top of which other lines. (The charts on the right are identical except for the order in which the lines are plotted.) Common strategies include "jittering", and varying transparency. Many of these strategies have issues of their own.
They also point out that the look of many multivariate charts (such as mosaic charts) depends on the sorting of the data. This is a key weakness of many such plots. Just think about this the next time you create a stacked column chart.
The book is divided into two sections: Principles and Examples. The second half, the Examples section, consists of case studies in which the authors show examples of how to investigate the structure of a given data set.
The example of using the fatty-acid contents of Italian olive oils to deduce their regional origin is a good visualization of how the statistical technique of classification trees work. Here is the telling diagram:
Notice that data with the same color are oils from the same region, the rectangular sections are results of the statistical classification procedure, and we would like to see most (if not all) of the data within each section having the same color.
Without a doubt, graphics designers should be aware of the issues raised by these authors. The book appears to be written for students who are creating statistical software (complete with end-of-chapter exercises.) I'm left wondering what users of graphics software can do with this information because much of this material relates to the design of graphics software. Knowing these issues makes you want to do things the software may not be designed to do efficiently. For example, most software packages I have used do not have a simple toggle to sort categorical variables by various means (alphabetical, increasing or decreasing frequency, increasing or decreasing value of another variable, etc.).