Cutting through the noise
Apr 28, 2007
A terrific application of tag clouds can be seen over at pollster.com, following the first debate of Democratic Presidential hopefuls the other night. Here is Senator Biden's "tag cloud", depicting the top 50 words that came out of his mouth that night. The size of each word is proportional to how often he uttered it.
Having not seen the debate, I can use this summary device to get a quick read on what his main points were. It's clear that he talked about the war ("Iraq", "troops"), education ("teachers", "students"), abortion ("roe", "wade" but interesting not the word "abortion"). Of course, if he had a distinct message, that would have been even better. For what the tag cloud exposed (assuming it was done right) was that he was pretty much all over the place, touching on many different things about equally often.
It is disconcerting that a word like "so-called" made it into the top 50. Better is "better" is his #1 word.
It is typical to process text-based data by removing all the most common words that do not carry real meaning (um, ur, the, so-called, etc.) but in this case, keeping them is helpful so the candidates can catch problems like the excessive use of "so-called".
However, the tag cloud would have been improved if "stemming" were used to collapse "talk" and "talking", "teacher" and "teachers", etc.
Pollster did tag clouds for every candidate. Comparing them provides even more insights! Here's one for Senator Clinton.
Her message is much more focused, quite a lot of time spent proclaiming her "readiness" for "President", quite a bit on "healthcare" and quite a bit on the "war".
As Pollster correctly pointed out, it is unclear if the size of words could be compared across tag clouds. If so, the setup would be even more powerful.
The entire set of tag clouds can be seen here. Long-time readers of this blog will remember that we have advocated such use back in Jan 2006, when discussing the "concordance" feature at Amazon. This successful application validates our enthusiasm.
I have never really understood tag clouds. To me it seems like a pretty ineffectual way of presenting the data. Why do tag clouds sort alphabetically? If the concept is to highlight the distribution of various tags, why is alphabetic order important?
And with the use of font size as an indicator of frequency, the issue of linear size versus area size crops up - not to mention the mess it makes of text alignment.
What is wrong with a simple list of words, sorted by frequency, with a bar chart? The same information, easily parsed, and importantly, all the words are then easily read.
Posted by: josh | Apr 29, 2007 at 11:56 PM
josh, alphabetical does do one thing: it puts words with a similar root next to each other, so TEACHER and _teachers_ are adjacent instead of at opposite ends of the list, if they were sorted by frequency.
I sort of agree that alphabetical order isn't great, but then, what opportunity for portraying a second dimension of info are they precluding? None that I can see.
Now, what would really make your suggestion fly, IMHO, would be if next to each word was a sparkline charting instances of that word through the minutes of the speech. That would add a true extra dimension, that of time, as the list showed the words appearing earlier or later in the speech.
Even better, if it was a true interactive debate, would be to present the lists for each candidate alongside each other (now arranged in order of total occurrences for all cnadidates, rather than for that candidate). Reading horizontally, the sparklines of one word for each candidate would show who first used the word early on, and who picked it up and ran with it.
Posted by: derek | Apr 30, 2007 at 08:13 AM
Josh: excellent points. I actually discussed these in a much older post. There was even a bar chart included for comparison!
The alphabetical order is typically meaningless and to be avoided. This case, I believe, is an exception. Any kind of random order would spread things out more than the the order by frequency. The alphabetical just happens to induce some "randomness" (although I'm sure we can find examples when it doesn't).
Derek: introducing the time dimension would be exciting. Especially if these debates were unmoderated; otherwise, we'd end up looking at the moderator's preference. Allowing free flowing debate would be akin to leaving the mike on; how embarrassing.
Posted by: Kaiser | Apr 30, 2007 at 11:40 PM
Something that could actually be done with the raw numbers gathered by this exercise (although hard to extract from these graphics - a point to josh), would be a scatter graph of word frequency between two debaters. Top right, of concern to both; bottom left, neglected by both; bottom right, B talked about more than A; top left, vice versa.
Posted by: derek | May 01, 2007 at 07:04 AM
I was sceptical when starting to read this article. However, the conclusions do seem consistent with my own intuitive reaction to the debate.
Posted by: Closets | May 01, 2007 at 12:36 PM
Not sure if you've already covered this, but you might be interested in Chirag Mehta's tag clouds of presidential speeches going back to Washington. I think the was the first such application of tag clouds.
http://chir.ag/phernalia/preztags/
Posted by: ars | May 01, 2007 at 04:13 PM
Ars: thanks for the link. Good site. I'm not sure I understand the "recency" dimension since the date of the speech is fixed.
Posted by: Kaiser | May 02, 2007 at 01:18 AM
That reminds me of the State of the Union parsing tool on Jonathan Corum's style.org.
Posted by: derek | May 02, 2007 at 06:03 AM