« December 2010 | Main | February 2011 »

Stacked pancakes leave us empty

Reader Tyson A. serves dessert for dinner, and stacked pancakes are on the menu!

Pancakes  According to the St. Louis Beacons that published these charts (and more):

These pie charts take the individual states' percentages, split them up and then stack them. In this way, you can see how the proportion of taxes in each category collected by each state compares with the states around it.

This presentation fails our self-sufficiency test: one is completely lost if the entire data set was not printed on the chart itself.

The pie pieces apparently lost shape as they got stacked on top of each other. The top green slice labeled Tennessee represents 2.1% but look at the difference between the green Nebraska (40%) and the green Kansas (40.8%), for example. 

Also, the red pieces and the green pieces are ordered on their own so that the Tennessee red is near the bottom of the stack while the Tennessee green is at the top.

This data can be shown clearly in a pair of line charts.


To really learn something about the data, we can create a scatter plot.

Stlb_scatterFrom this plot, we see that most of these states (clustered in the middle) have similar taxation policies.

The exceptions are Illinois and Tennessee, and to a lesser extent, Missouri.


A smarter word cloud: likes and not likes

Martha left a comment on my previous post asking my comments on this National Geographic word cloud map of surnames in the U.S. (Click on the link to look at the interactive map.)


 Here is a close-up of California:



Anytime someone expands the possibilities of a chart type, like the word cloud, it's a commendable project. So I'm quite enthusiastic about what they tried to do here. Not every new feature is successful, though.

These are the things I like:

  •  Ng_nameoriginlegendUsing colors that mean something: they use different colors to indicate different countries of origin of particular surnames. Good idea. I prefer to have the same color and different shades for each continent.
  • For once, the data being depicted is not a speech or a piece of text; it's a set of surnames.
  • This chart (or map) is multi-variate: it tries to address deeper questions such as the correlation between geography and origin of popular names, and the correlation between geography and popularity of names, etc. This is an important advance from all those word clouds out there that tells us nothing but the frequency of words in a document. In general, statistical clustering methods can be combined with text mining methods to develop multivariate word clouds.
  • The designers realize it's a futile -- as well as ill-advised -- task to try to print every name on the map so they only include the top 25 names in each state. As I explain below, I'm not happy with this inclusion/exclusion criterion but the key point is by taking out the minor bits of data ("noise"), the chart is more able to draw our attention to the more interesting parts.

These are things I don't like:

  • They really ought to have used relative popularity rather than absolute popularity. This is another area of improvement for all word clouds. Today, word clouds plot the number of times a specific word appears in a piece of text. We often try to compare several word clouds against each other; and when we do that, the only sensible measure is the proportion (relative frequency) of time a specific word appear. Say, one compares Obama and McCain speeches by comparing two word clouds. If these two speeches differ significantly in length, then comparing the number of times each candidate use "education" words is silly -- we have to compare the number of times per length of the speech.
  • Ng_wordsizelegend The cutoff of top 25 names in each state suffers a similar problem. The 26th most popular name in California, a populous state, is of more interest than say the 15th most popular name in Montana (or insert your favorite small state). Instead, a more sensible cutoff would be including names that account for at least 2 percent (say) of a state's population. By doing this, the more populated states would have more entries than the less populated states.
  • Given the above bullets, it is not surprising that the word-size scale has serious problems. Because it is an absolute number and not relative to each state's population, the big words can only show up in populous states. In other words, the size of the words tells us about the geographical distribution of the U.S. population. As I mentioned before (such as here), this insight is available on pretty much every map used to plot data that has ever been produced. The one thing that all these maps never fail to tell us is the fact that most of the U.S. population is bi-coastal. Unfortunately, the real message of the map -- in this case, the geography of surnames -- is subsumed.
  • And then, the map invents false data. Notice that there are 1,250 geographic sites on the map (25 names times 50 states). This is a visually prominent feature of the map, and yet there is no rhyme or reason as to where the names are placed, with the exception of respecting state boundaries. The casual reader may think that the appearance of the Chinese name "Lee" in the inner, central part of California implies that Lee-named Chinese-Americans aggregate in those parts of California. Far from the truth!

So, I think they did a reasonable job in rethinking the possibilities of word clouds. It's well intentioned and there is room for improvement.


Lastly, they might get some ideas from the Baby Names navigator.

Quick fix for word clouds: area not length

Here is one of my several suggestions for word-cloud design: encode the data in the area enclosed by each word, not the length of each word.

Every word cloud out there contains a distortion because readers are sizing up the areas, not the lengths of words. The extent of the distortion is illustrated in this example:


The word "promise" is about 3.5 times the size of "McCain" but the ratio of frequency of occurrence is only 1.6 times.

This is a quick fix that Wordle and other word cloud software can implement right away. There are other more advanced issues I bring up in my presentation (see here).


Update: Talk in NYC

What snow? The talk is happening. For those who can't come, they have live streaming: http://livestream.com/NYViz.

I'm the second speaker, probably starting around 8 pm EST. The presentation will eventually make its way here.


Here is the presentation (5.9M PDF).

The topic is word clouds (tag clouds, Wordle). I think this new chart type (circa 2000s) has a lot of potential. Software like Wordle has made it extremely popular; Wordle is amazingly easy to use. It is most often used to summarize text documents, such as speeches. This usage has turned it mainstream but should not restrict our imagination of other use cases.

However, the developments since 2005 when I first wrote about word clouds have been disappointing. Wordle has made it easy to turn out chartjunk (crazy colors, "dis" orientation, etc.). In this presentation, I lay out a set of improvements that can help realize the potential of this chart type. They are all based on statistical principles, which is the underlying theme of the talk: that when we design charts, even the apparently artistic decisions can be made by appealing to logical or scientific or statistical concepts.

There were a lot of questions, and I couldn't get to all of them. Feel free to continue the conversation here in the comments.

A serious magazine seeks our attention, badly

Reader Brian R. could not believe the Atlantic magazine would print a pile of chartjunk like this, and neither do we.

Pretty much every chart deserves its own entry, and they all fail our self-sufficiency test: when the actual data is removed from each chart, the failure is exposed, as one realizes that the graphical constructs do not add to the readers’ experience, and frequently subtract from it.

We'll focus on three examples where they tried to innovate, badly. The data has been stripped from each chart.

Atlantic_readingtime The chart shown on the right compares the amount of time spent reading by 15 to 19 year olds in 2007 and in 2009. We definitely see the severe drop in time spent but how many times higher were the average minutes in 2007?

(Amusingly, these books have 13 lines per page, not 12 lines, not 10 lines, not 15 lines.)

The next chart is similar, but comparing the minutes spent playing games. It’s a pie chart! Did our kids spend 100% of their weekend days in 2009 playing games?  

Atlantic_gamingtime No, it can’t be a pie chart. The caption said “average minutes”, not a proportion of a total; it’s a clock. Is it a 60-minute clock? But it’s a weekend day so maybe it’s a 24-hour clock. That can’t be, since the kids won’t be spending every hour of each weekend day playing games, would they? They do need to sleep, don’t they?

 So we cheat and look at the data. Average minutes in 2007: 46.8 minutes; in 2009, 61.2 minutes. Oh, it’s a malfunctioning clock. In the 2007 version, it’s about a minute too fast, and in the 2009 version, it’s a minute too slow. But who can blame the 2009 clock? You can’t show 61.2 minutes in a 60-minute clock.

With just two pieces of data, it's often the case that graphics are superfluous. Even if "entertainment" is desired, one ought to keep that from obscuring the data. Perhaps like this:



 OK, just one more. Not surprisingly, US book  sales are shown using stacks of books except that the data was not encoded in the height of the stacks, the thickness of the books, the number of books, or other usual suspects. The data is embedded into the width of a page plus the thickness of a book, assuming every book is identical in design.

Atlantic_usbooks Evidently, book sales declined from 2007 to 2009. By how much? It would be impossible to know without reading the actual data (which I have stripped away).

Since the data is given, we can use a little bit of algebra to figure out how many units are represented by the long side (L) and the short side (S) respectively:

Atlantic_booksales_mathThe Atlantic has invented some new math. The long line represents 1.1 billion units while the short line, 3.68 billion units!

What this means is that the difference shown in the picture of one long line is vastly exaggerated; the same difference in units would have been equivalent to one-third of the short line.


Other problems noticed by Brian:

Use of what looks like a Gaussian distribution instead of a bar.

The piggy bank graphic that distorts the saving rate.

The redundancy of pie charts next to simple percentages.

Also the presentation of statistics without any apparent relationship between the theme being presented.  For example, what does the increase in 3-D movies being produced have to do with the recession?
My guess more 3-D movies is more due to technology advances and implementation than recession economics.


Organizing the bookshelves

When you go to the library, you expect to find the books in an organized fashion, typically sorted first by subject matter, then by author, then by title, and so on. Imagine the frustration when you walk in and discover that books are spread out everywhere with no discernible order. We are very particular about tidiness: it would still be terrible if the books were arranged by author and title without first splitting by subject matter. We are annoyed because it would take too long to find a book.

I did run into such an exasperating bookstore -- I believe it is in Brooklyn. The (used) books in this store are arranged by the date on which the owner acquired them. Fiction, I recall, is ordered by alphabets of last names, and then, say within the 'A' authors, the books were sorted by date of acquisition. What a headache!

Reader Pat L. had a big headache trying to figure out this chart, found on Wikipedia: (I'm just excerpting a small part of it; the full chart is here).


To quote Pat:

I was overwhelmed by the information -- so many chemicals and so many units of measure.  I quickly gave up and opened up the image in an picture editor.  One-by-one, I erased the blood chemicals I wasn't interested in.  Maybe if I was a doctor, the chart might have been useful.


One way to simplify this is using small multiples. Recognize that few if any users would need to directly compare every one of these chemicals. I'm guessing that groups of chemicals can go on separate charts. This is no different from a bookseller organizing shelves to help readers find books.

Also get rid of the minor gridlines.

For a summary chart of this kind, I doubt that it adds anything to include the information on whether the end of a range is definite and consistent, definite but inconsistent, or unknown.

Who are you talking to?

Reader Daniel L. points us to this "dashboard" of statistics concerning downloads of a piece of software presumably called "maven". This sort of presentation has unfortunately become standard fare.


Daniel was shocked by the pie chart. Just for laughs, here is the pie chart:

Something else is worth noting -- ever wondered who the chart designer is talking to?

Is it an accountant who cares about every single download (thus needing the raw data)?

Is it a product manager who cares about the current run rates, and the mix of components downloaded?

Is it an analyst who is examining trends over time, aggregate of all components?


In other words, the first order of business is to identify the user. 

A graphlick showing mortgage prices

The work of Hans Rosling and Gapminder (now part of Google) highlighted moving images as part of the graphics toolbox. Let me call these "graphlicks", graph-movies. It is clear that lots of people love graphlicks.

There is one open problem in graphlicks that needs creative solutions: how to incorporate memory into the experience?

If a movie is required to show patterns in the data, it would be to highlight a temporal pattern -  the changes over time are interesting. As the movie goes from Day/Month/Year 1 to Day/Month/Year X, the old stuff is usually taken off the canvass to make way for the newer stuff. In effect, we rely on the reader's memory compared on the current scene in the movie.

What gets me thinking about this is a graphlick created by my friend Adam, whose startup Empirasign compiles and markets data on mortgage prices and other financial data:



Youtube link here.

The data relates to 30-year mortgages originated in 2010. The coupon rate shown on the horizontal axis ranges from below 4% to 8%, which are the cash flows an investor gets. Each line chart shows how the "market" was valuing the 30-year mortgages of different coupon rates on a particular day. The price is an index, equalling 100 at issue.

The general shape of the line indicates that the market valued the higher-coupon ones more than the lower-coupon ones (except for the right tip of the line). Since interest rates have been coming down, the mortgages issued at 4% coupon were newer ones than those issued at 7-8%, which means they had higher "duration risk" for investors, thus lower value. The dip beyond 7% may be due to a countervailing "prepayment risk": if the debtholder prepays, the investor would be forced to take 100 for something they may have paid over 100.

As you play the graphlick, two features of the data ought to stand out: the general shift upwards of the line which indicates that the market was increasing the valuation of these mortgages over the year (regardless of coupon); also the stronger volatility on the left-side of the line.

Noticing either feature requires the reader to remember the trajectory of the lines. What are some ways to help the reader?

  • Fade out but don't remove old lines?
  • Include a cumulative average line?
  • Include an "envelope" that captures the maximum and minimum prices over time?