« March 2007 | Main | May 2007 »

Cutting through the noise

A terrific application of tag clouds can be seen over at pollster.com, following the first debate of Democratic Presidential hopefuls the other night.  Here is Senator Biden's "tag cloud", depicting the top 50 words that came out of his mouth that night.  The size of each word is proportional to how often he uttered it.

Bidentag400_2 Having not seen the debate, I can use this summary device to get a quick read on what his main points were.  It's clear that he talked about the war ("Iraq", "troops"), education ("teachers", "students"), abortion ("roe", "wade" but interesting not the word "abortion").  Of course, if he had a distinct message, that would have been even better. For what the tag cloud exposed (assuming it was done right) was that he was pretty much all over the place, touching on many different things about equally often. 

It is disconcerting that a word like "so-called" made it into the top 50.  Better is "better" is his #1 word.

It is typical to process text-based data by removing all the most common words that do not carry real meaning (um, ur, the, so-called, etc.) but in this case, keeping them is helpful so the candidates can catch problems like the excessive use of "so-called".

However, the tag cloud would have been improved if "stemming" were used to collapse "talk" and "talking", "teacher" and "teachers", etc.

Clintontag400_2 Pollster did tag clouds for every candidate.  Comparing them provides even more insights!  Here's one for Senator Clinton. Her message is much more focused, quite a lot of time spent proclaiming her "readiness" for "President", quite a bit on "healthcare" and quite a bit on the "war".

As Pollster correctly pointed out, it is unclear if the size of words could be compared across tag clouds.  If so, the setup would be even more powerful.

The entire set of tag clouds can be seen here.   Long-time readers of this blog will remember that we have advocated such use back in Jan 2006, when discussing the "concordance" feature at Amazon.  This successful application validates our enthusiasm.

Shower of bullets

Nyt_gundeaths_sm Here's one of those infographics that makes the reader work hard (via Dustin J).  The graphic in its full glory is here; it's much too large to be reproduced, and I have clipped off the bottom half.

Much to the designer's credit, he extracted data of interest, rather than trying to cram everything onto the page.  In particular, he was most interested in the distribution of deaths among different age groups, the types of deaths (suicides, homicides) and the identities of the deceased (race, gender).

Just like the election fraud graphic, such rich data lend themselves to multiple levels of aggregation.  Here, the designer focuses on the most detailed level, making it easiest to see facts like "among the 18-25 age group, there were 6 black men murdered per day".

However, it takes much more attention to notice higher-level facts like "homicides per day are relatively flat across age groups while suicides heavily skew toward 40+".

Redo_gundeaths_sm In the junkart version, I decided to emphasize the more aggregated data, showing the number of deaths of each type across age groups. The detailed break-down of race and gender is shoved into parentheses, as they can be omitted by less serious readers.

The reader who discovers that the homicide/suicide pattern described above may surmise that homicide gunfire deaths are more "random" while suicides, being  premeditated, may affect older people disproportionately.  More research would be needed to confirm such and other suspicions.

Source: "An Accounting of Daily Gun Deaths", New York Times, April 21 2007.


Embedding logic

Bernard L. (from France) submitted this bubble chart for consideration.  It accompanied an NYT article claiming the absence of evidence of election fraud.  (Of course, as is well-known, absence of evidence is not the same as evidence of absence.  Here, I'm purely interested in data presentation.)

As a seasoned consultant, Bernard asked if a Marimekko chart would be superior.

Nyt_convictions_2 This is one ambitious chart.  Ignoring the bubbles (which are more nuisance than anything), we are asked to interpret data at three different levels of aggregation in one go.

First, there were 95 cases classified into five indictment types.  Second, these cases resulted in either convictions or acquittals/dismissals.  Third, among the cases ending in convictions (the highlighted area), we were shown the occupations of those convicted.

By flattening three levels into one table, some key information is obscured.  For example, how many cases resulted in conviction?  The reader has to compute either 95-25 or 26+31+10+3.  What percent of civil rights violation convictions were committed by party/campaign workers?  It's not 2/3 = 67% (bottom row) but rather 2/2 = 100%.

The following junkart brings out the logic that is embedded in the complicated bubble-table.  While there is a lot on the page, the text labels plus the flow directions allow readers to absorb the data one level at a time.


I have not attempted the Marimekko as I am not a fan of such charts.  You're welcome to try.

Source: "In 5-Year Effort, Scant Evidence of Voter Fraud", New York Times, April 2007.

PS. I will be working through the backlog of reader submissions.  Thanks for your patience.  Keep them coming!


Remark (Apr 25 2007): Thanks to readers for keeping me honest (see comments below).  The conviction rates shown previously were indeed the inverse.  I have now fixed them.

Peripherals 2

In terms of interactive charting, Google Finance did much more than hide the legend.  In their main stock price chart, they used a number of neat features.


This chart effectively conveys a huge amount of information in a small space.  The bottom strip which shows relative prices for the past two years provides context to interpret the five-day movement shown in the main chart area.  I prefer to see a scale on the bottom strip as well. 

The sliding scrollbar can be dragged to show historical data.  Besides, the width of the window shown in the main area can be controlled.  For instance:


Without any effort, we are now looking at a 3-month chart for Q2 2006.  Notice the summary statistic on the top right corner also morphed.  The axis scale changed, and it never did start from zero to begin with.  (This shortcoming is alleviated by the profile chart in the bottom strip.)

Further, by placing the cursor in the chart area, we can highlight a particular day: a dot appeared on the price curve, the volume on that day was highlighted, and the text on the top right switched.  That text is what we typically place inside the chart area as a "data label".  The effect of moving it to the corner is similar to hiding the legend: it makes the graph more legible and provides space for longer descriptions.  As we move the cursor from left to right, the graph dynamically adapts.  Marvellous!


It may not be obvious the amount of data processing that has to take place to implement these sorts of features. I don't have space to address the data issue but maybe some of our readers can comment on it. 

Peripherals 1

Like any technology, charts also come with peripherals: I'm talking about legends, data labels, grid-lines and so on.  These things typically give us the most trouble, especially with complex data sets.  The analogy is apt: one may feel inextricably knotted up like bunches of cords and wires.

Interactive graphics is a particularly elegant solution to this problem, and Google Finance has done a fantastic job leading the way.  One trick is to show the legend only when the user asks for it. 
Google_sectorsum_lgUsing bar charts (on the left), Google summarizes neatly the performance of stocks within each industry sector.  The bar chart gives a sense of the dispersion which adds to the average returns printed next to them.  For example, most sectors gained on average but then about 30% of the individual stocks in most sectors actually declined on that day.  So the fact that technology stocks gained 0.48% on average doesn't necessarily mean that the two tech stocks you own gained 0.48% or gained at all.

Typically, we would put a legend on the side or at the bottom of the chart, which all be told, is an ugly duckling next to a well-executed chart.  Here, the legend is hidden behind the "What's this?" link.  The side benefit is that the legend can be as verbose as needed since it doesn't interfere with the chart.

There are a few minor things to consider:

  • "What's this?" is not very informative: Why not call it a "legend" or "key"?
  • The graph designer seems to think that the most important information sought by readers was the extremes, i.e. the percentage of stocks that gained/lost more than 2%.  By darkening the sides of the bar, it draws attention away from the middle which is the boundary between the gainers and the losers.  I'd like to see that boundary delineated.
  • Similar to the above point, I'd sketch out a version which aligns the gainer/loser boundary to the middle so it's easy to see the balance between gainers and losers.  This version however would require more space
  • I'd provide sorting by average return, and by percentage of gainers

Tricks of the trade 1

Handmad1From time to time, I get queries about what software I use to create junkart charts.  This is my first post on the wide-ranging topic, which I shall take up again.

My first rule of thumb is: develop the concept first, then worry about tools. 

I believe the software question is misplaced.  One should never allow tools to get in the way of one's imagination.

Like an artist, I carry a sketchbook in which I draw many versions of charts for each data set I come across.  Once I see each version, I can better judge what works, and what doesn't.  As I sketch, I'll sometimes find insights in the data I haven't notice before, which will prompt another round of sketches.  Until I finalize the concept, I don't think about software.  Until this point, it's as primitive as it gets.

What has all these got to do with the Madonna wall advertisement?Handmad2 Notice the artists standing on the crane in the lower left corner.  I was walking in New York while thinking about this post, and thought what a perfect example of sketching, or developing the concept.  The artists weren't deciding what and how to paint the ad while the crane scaled the ten-storey building; they already had it sketched out, both on paper and on the wall itself.  Here is the blown-up image of Madonna's unfinished hand.  The sketchmarks were clearly visible.  So next time you make a chart, try making sketches first!