« September 2005 | Main | November 2005 »

Tag cloud example

Postings have been, and will be, limited as I'm on the road.

By way of Quantum of Wantum, here is another example of a tag cloud, courtesy of Harvard Law School.  I suppose the font size is proportional to the number of posts related to a particular country.  This graphic does a great job highlighting the most important categories; as the number of posts grow, the more powerful it gets.

Hls


Tabloid journalism

Britain's newspaper industry is much more vibrant than America's, in terms of the number of major dailies in circulation, but the papers face the same problem of declining, aging readership.  One recently hot remedy is to change to a tabloid format; McKinsey investigated this phenomenon and published a thoughtful article examining the risks and rewards.

McktabloidThe article contains a problematic chart shown on the right. (The junkchart version is below.)

The title of this chart is "Mixed results".  This message is much clearer in our junkchart version, where everything below the line is bad; everything above, good (or indifferent).

RedotabloidAlso note that I plotted raw numbers, rather than percentages. With only 14 cases, "10%" represent one newspaper so why obfuscate?

Connecting lines between columns are supposed to aid our comparison of segments across different columns.  For the moment, where you see increase/no change/decrease,  think national/regional/local (newspapers).  Note the difference: national/ regional/local are pre-existing segments so that each newspaper stays in the same segment across the four columns;  increase/neutral/decrease are post-hoc segments based on measured responses so that the same newspaper can show up as "decrease" in Ad Revenue but "increase" in circulation.

Finally, lets discuss comparability.  Did changing to tabloid format help or hurt the newspapers in each of four response metrics? 

To answer this question properly requires using our imagination because we should be comparing the actual scenario of what happened after the Independent newspaper switched to tabloid with the imaginary scenario where the Independent did not switch.  The latter is not observable so statisticians would find a proxy for the Independent, say a similar newspaper, which did not make the switch and compare that with the Independent.  Further, we would look at many similar newspapers, not just two, in order to generalize our result.

The proxy, also called the "control", is necessary to establish causality.  The consultants' analysis omitted a proxy: all they did was to compare the same newspaper before and after it switched formats, and then only for those papers that made the switch.

Reference: "Dwindling readership: Are tabloids the answer?", McKinsey Quarterly, Jan 2005.


Practical statistics

On my profile, I list "practical statistics" as an interest.  This chart on U.S. Grade 8 test scores gives me an opportunity to explain what that means:

Grade8scores

The reader is fed (force-fed?) two messages: that there has been a small but detectable improvement in math scores since 1990; and that most of the increases were "statistically significant" (behold the asterisks!)

Redograde8scoresOn the right is a "practical" view of the same data. 

  • By using the start-at-zero rule (and max-at-500 because 500 is the maximum score), the small changes are immediately seen to be irrelevant; the line is almost flat.  I haven't checked how the "scale score" is created but surely sub-300 scores out of 500 hardly constitute a record of pride
  • Because "accommodations" (providing assistance to certain needy groups) clearly had a positive impact on the scores, and because this effect was not accounted for, a side-by-side comparison of the two periods is misleading and useless.  When the dashed line (1990-6) is removed, the trend is further flattened.

Most destructive for the enterprise known as "statistical testing" is that asterisk next to 278: this asterisk asserts that the 1-point increase from 2003 to 2005 is "statistically significant" (at 95%).  This result makes a mockery of statistics.  Clearly, no one cares about a 1-point difference; everyone can agree that it is not practically meaningful*. 

If you work for college admissions and you have two candidates, one scoring 278 and the other 279, would you accept the latter and reject the former based on the 1-point difference?  If, further, you realize that the top score is 500, how would you rate these two candidates?

Practical statistics do not accept statistical results without first understanding if they are practically meaningful.

Reference: NAEP: the site contains a wealth of data and some interesting graphical presentations, worth a look!
 

* For those interested in the theory behind statistical significance: statisticians distinguish between the true (population) average and the sample (observed) average.  In 2003, the average math score was observed to be 728 but the true average is likely to be 728 +/- 0.3 where 0.3 is known as the sampling error.  This sampling error reflects our uncertainty in the true average because of random noise (such as measurement errors).  Practically, this means that while we observe 728, the true average can be as low as 727.7 or as high as 728.3 or anything in between (most of the time).

Now instead of estimating 2003 score, estimate the difference between 2005 and 2003 scores.  We observe a difference of 1.  But practically, the true difference will lie in the interval 1 +/- X, where X again is the sampling error.  If X > 1, the interval contains 0, which means that some of the time, the true difference can be zero or negative so we conclude that the difference is not statistically significant.  If X < 1, then the difference is statistically significant.

So far, so good... until you realize what factors affect the size of X.  One factor is the sample size.  The larger the sample size, the smaller X is.  This is called the Law of Large Numbers; the estimate of the true mean gets better and better when we get more data.  So just by increasing the sample size, and thus reducing X, even very small differences (like 1) can become "statistically significant".  But as we learn from this example, even if it is statistically significant, the tiny difference is practically meaningless.


Mid-week light entertainment

Next in our light entertainment series, brought to you by Northern Trust:

Ntad2sm

Ironically, it is even more important for advertising copy writers to make sure their key messages come across effortlessly when they use charts.  They literally have three seconds to get my attention.

What are we supposed to read from this chart?  Discuss.

Thanks to Annette for passing this on.


Multiple line comparisons

An informative, and rather sophisticated, graphic appeared in the Times this weekend so lets appreciate it: 

Nytmanuf

Contained in this chart were two key messages:

  • while in general manufacturing employment has shrunk in most Western industrialized nations during the last 10 years, the trajectory of decline differed by country;
  • while manufacturing has lost employment in the U.S., other sectors have gained.

NytmanufsmBy establishing comparability, this chart significantly betters our understanding of the U.S. manufacturing data.  If only the U.S. data were available (as shown on the right), the only appropriate emotion would have been groan and moan.

To tell the full story, the chart used two related line charts: superimposed lines for the international comparison, and small multiples for the U.S. comparison by industry. 

In general, superimposed lines provide better aid to visual comparison although plotting large number of lines or large number of crossings can result in an inextricable mess.  Also, only so many shades of gray or dotted-ness could be used before readers get dizzy. 

However, superposition saves space which is a key disadvantage of small multiples.  When space is of minor concern, as in this example, small multiples can do wonders.  The graphs are nicely ordered from greatest growth to biggest decline, painting a vivid picture of the differential fortunes of various economic sectors.

Printing small multiples horizontally makes it easy to compare the growth rates but harder to compare time periods.  The relative importance of rate versus time comparison must be established before the design decision.  This artifact is avoided in the case of superimposed lines.

The temporal perspective is sometimes important.  The top chart shows that Germany's manufacturing sector has been in continuous decline since 1993 (not sure why this arbitrary year is chosen) while for most other nations, a large drop occurred around 2001.  It also indicates that the U.S. has done comparatively better since mid 2003 than France, Germany and Britain so perhaps we should re-evaluate the recent brouhaha about outsourcing to China and India.

Neither the charts nor the article address the issue of what caused the drop in employment.  The blame is often placed on foreign competition but surely we must consider displacement by automation, machines, Internet, software, etc.

Reference: "Proof, Near and Far, That It's Not 1950 Anymore", New York Times, Oct 15, 2005.


Poll results and "Alabama first!"

Poll results are often presented with pie charts or bar charts.  When the responders are divided into segments (male/female, age groups, etc.), the grouped bar chart is the favored graphic.  With small number of groups (2-3), the stacked bar chart works better, as in here (the left chart is the original):

Redowebsvc

The pollster wants to know what are the major obstacles to adopting the new technology called Web services.  The responders were grouped according to their response: very, somewhat and not concerned.

Given that question, the key result should be a ranking of the factors, from most important to least.  That is done in the junkchart version on the right.  Specifically, "Security" has the least proportion of "not concerned" responses. 

One can also re-order according to the greatest proportion of "very concerned" responses; when I tried that, I noticed that "Confusion about the alphabet soup ..." and "Cost of training" had few "very" responses but a lot of "somewhat" responses.  I made the decision to treat those two categories as one, in essence.

How were the factors ordered in the original chart?  Howard Wainer (book1 and book2) coined this rationale "Alabama first!" since Alabama is placed first in an alphabetical listing of the 50 U.S. States.  Alabama first is very often used but most often sub-optimal; one exception was the tag cloud.

Reference: "The Adolescent Web Services", IT Architect, Oct 2005.


Tag clouds are histograms

The tag cloud is one of the most pleasing new charts to appear in recent years.   Here is an example from Flickr, the on-line photo site, which is especially pretty:

FlickrtagsWhen a Flickr user uploads a photo, she has the option of assigning one or more labels ("tags") to it.  Flickr then produces a frequency count for each tag and then plots the top 120 (?) tags; the font size of each tag is proportional to its frequency of use.  Tagging is hailed as a massively distributed and participative method of classifying information, and I think it works brilliantly.

The data itself is nothing more than a frequency table ("wedding" 132,356; "party" 120,222; etc.) but this presentation is visually appealing and aptly functional.  Compare with this typical histogram presentation: Tagshisto

  • The Flickr version is ordered alphabetically whereas the histogram is by frequency and therefore it serves both people who are looking for the most popular categories as well as those who are looking for a specific term.
  • Flickr uses a clean interface without excessive underlining, highlighting, dots and so on.  No chartjunk!  To see chartjunk, go here and here.

Here are some ideas for extension:

  • Be flexible in selecting the underlying population of tags: clicking on wedding will give a list of all photos that were labelled "wedding": it is the most popular tag overall but will lead to too many results and too little relevance.  Flickr has little tag clouds for each user
  • Be flexible with the metric being plotted: aside from frequency of use, the size of the words can vary with other measurements such as recency of use and frequency of clicks
  • Introduce a hierarchy of tags: for example, clicking on "wedding" leads to another tag cloud so users can drill down.  This can be implemented using a hierarchical clustering algorithm, for example

P.S.  The idea to write this post came to me while chatting with Scott Matthews, who has created an interesting browser add-on, found at www.bitty.com


Rising bankruptcies and home prices

Steve has kindly plotted house price movements on a map so we can compare that with the bankruptcy growth map.  Recall the observation that "where home prices rise steeply, bankruptcies fall".  Notice that Steve reversed the color scheme so that blue indicates low bankruptcy growth and high home price growth.  This helps us visually inspect the two maps (nice touch!).

Redobankrupt

  • The assertion that low bankruptcy growth is associated with high home price growth makes sense only in California, the Eastern Seaboard and Florida
  • In middle America, even though home prices did not rise by much (and we don't know how much "much" is without the legend), there exist many pockets of high bankruptcy growth counties
  • Whether those pockets all constitute the "small-sample-size" regions marked out by the proviso on the original map is unclear
 

This article raises the issue of association versus causation.  One might be tempted to conclude that by creating conditions for a rising real estate market, a county government can hope to control the growth in bankruptcies.  Doing so is to confuse causation with association. 

We already noticed that both house price and bankruptcy growth are related to geography, exposing the familiar coastal/middle or East/West/Central distinctions.  Because so many metrics are correlated with such geographical segmentation, it is very difficult to argue that home price growth is the cause and bankruptcy growth is the effect.  This is particularly so because we don't have a controlled experiment, only an observational study.  Mahalanobis has written about "latent variables" before; those variables you don't include in the study can well be more important.  Elsewhere, David Freedman has written much on causality from a statistician's perspective.

To answer Steve's question, the reason why I would like to see population density added to the plot is that as depicted, the areas of colors are proportional to the map areas (which because of projection are not even proportional to real physical areas) but the better index should be population density rather than map area.  I was thinking along the lines of a cartogram but I don't know how to create one.  It's always a challenge how you put the pieces together now that they are scaled and no longer map-sized.



Rising bankruptcies

It was quite exciting to see this nice map in the Sunday Times:
 

Nwr_bankrupt_map

Actual headline for the map: "Where Home Prices Rise Sharply, Bankruptcies Fall"

Alternative headline for the map: "Bankruptcies Jump in the South and Midwest"

Not clear from this map (but mentioned in the article):

  • A new law, effective Oct 17, which will make it harder to clear away credit card debt has touched off "a rush to the court" (This effect would have happened in 2005 so would be hidden in 2000-2005 changes.)
  • There is "strong evidence" that home equity borrowing is providing a further bulwark against disaster although the author also cites an economist saying that unemployment rates, not house prices, tend to be the most important predictors of bankruptcy

The map'd be even nicer if the element of population density can be added to it although I am not sure how this can be done without producing clutter.  Do make a suggestion in the comments if you have an idea.  Such a map would then adjust for the problem of "low populations" as indicated in the very useful note on the map.

I'll also repeat a previously mentioned point, which is that the legend should mark out the actual maximum and minimum of the data, rather than using "greater than 35%".

Reference: "Where Home Prices Rise Steeply. Bankruptcies Fall", New York Times, Oct 9 2005.