« Shower of bullets | Main | Less is more »



I have never really understood tag clouds. To me it seems like a pretty ineffectual way of presenting the data. Why do tag clouds sort alphabetically? If the concept is to highlight the distribution of various tags, why is alphabetic order important?

And with the use of font size as an indicator of frequency, the issue of linear size versus area size crops up - not to mention the mess it makes of text alignment.

What is wrong with a simple list of words, sorted by frequency, with a bar chart? The same information, easily parsed, and importantly, all the words are then easily read.


josh, alphabetical does do one thing: it puts words with a similar root next to each other, so TEACHER and _teachers_ are adjacent instead of at opposite ends of the list, if they were sorted by frequency.

I sort of agree that alphabetical order isn't great, but then, what opportunity for portraying a second dimension of info are they precluding? None that I can see.

Now, what would really make your suggestion fly, IMHO, would be if next to each word was a sparkline charting instances of that word through the minutes of the speech. That would add a true extra dimension, that of time, as the list showed the words appearing earlier or later in the speech.

Even better, if it was a true interactive debate, would be to present the lists for each candidate alongside each other (now arranged in order of total occurrences for all cnadidates, rather than for that candidate). Reading horizontally, the sparklines of one word for each candidate would show who first used the word early on, and who picked it up and ran with it.


Josh: excellent points. I actually discussed these in a much older post. There was even a bar chart included for comparison!

The alphabetical order is typically meaningless and to be avoided. This case, I believe, is an exception. Any kind of random order would spread things out more than the the order by frequency. The alphabetical just happens to induce some "randomness" (although I'm sure we can find examples when it doesn't).

Derek: introducing the time dimension would be exciting. Especially if these debates were unmoderated; otherwise, we'd end up looking at the moderator's preference. Allowing free flowing debate would be akin to leaving the mike on; how embarrassing.


Something that could actually be done with the raw numbers gathered by this exercise (although hard to extract from these graphics - a point to josh), would be a scatter graph of word frequency between two debaters. Top right, of concern to both; bottom left, neglected by both; bottom right, B talked about more than A; top left, vice versa.


I was sceptical when starting to read this article. However, the conclusions do seem consistent with my own intuitive reaction to the debate.


Not sure if you've already covered this, but you might be interested in Chirag Mehta's tag clouds of presidential speeches going back to Washington. I think the was the first such application of tag clouds.



Ars: thanks for the link. Good site. I'm not sure I understand the "recency" dimension since the date of the speech is fixed.


That reminds me of the State of the Union parsing tool on Jonathan Corum's style.org.

The comments to this entry are closed.


Link to Principal Analytics Prep

See our curriculum, instructors. Apply.
Kaiser Fung. Business analytics and data visualization expert. Author and Speaker.
Visit my website. Follow my Twitter. See my articles at Daily Beast, 538, HBR.

See my Youtube and Flickr.

Book Blog

Link to junkcharts

Graphics design by Amanda Lee

The Read

Keep in Touch

follow me on Twitter