« April 2007 | Main | June 2007 »

If we report it, it's a fact

David Leonhardt wrote in the NYT of a shocking incident of statistical abuse committed by Lou Dobbs and the CNN crew.

On several recent occasions, while commenting on the red-hot immigration issue, Lou and company remarked that "there had been 7,000 cases of leprosy in this country over the previous three years, far more than in the past".  (Leprosy is a flesh-eating disease prevalent among immigrants, particularly of Asian or Latin American origin.)

Nyt_leprosyWhen asked about fact-checking, Lou reportedly said: "If we reported it, it's a fact."  A quick visit to the government's leprosy program web-site immediately reveals the time-series chart, shown on the left.  With annual rates at about 150 in the last 5 years or so, one is hard impressed to find the 7,000 alleged cases!

Furthermore, because this chart lacks comparability, we fail to see that 150 cases out of a population of 300 million represent a minuscule risk.

A slight downward trend is evident in the last 20 years or so; this record is even more impressive when we realize the population grew during this period.  These points can be made clearer in multivariate plots.

Source: "Truth, Fiction and Lou Dobbs", New York Times, May 30, 2007; U.S. National Hansen's Disease web-site.


Looking for survival

Retention_rate_by_daniel_waisberg_2 Daniel W of esnips has started a collection of graphics on visualizing web statistics.  The following graph is an attempt to capture the ability of the web-site to attract returning customers.

The time axis serves double duty here: it is an indication of which "cohort" the users belong to, in other words, when they signed up; it is, also, the month of returning visits.

Web_surv A more typical chart used by statisticians is the survival curve.  As shown here, these are the same curves as above but having the same starting point.  Now, the time axis is interpreted as number of months after registration.  Of 100 members who registered in January, how many returned one month later, two months later, etc.

If the purpose is to evaluate the consistency of retaining customers by cohort, then this graphic is less cluttered.  I also used a fading metaphor to color the lines so that the oldest cohort (also, the longest line) is the faintest.  Line labels are best hidden, and revealed interactively when the user mouses over a line of interest.

Not sure if Daniel was plotting real data; in general, we expect a certain amount of criss-crossing.  If the data is real, then his site has seen uninterrupted improvement every month thus far.

Source: The Web Analytics Graph Collection, eSnips.

Visualizing web statistics

Tim inquired about:

how to create an elegant graph for Web visitor traffic statistics that shows both how many views a page gets and then how many people click that page to go further ("conversion rate"). Part of the problem is that conversion rates vary from, say, .3% to 50% (a wide range).

Lets work with this sample data set.  Web1I ordered it from highest to lowest click rate, which is the primary metric of interest.  The number of page views is of interest too as sometimes rarely-visited pages may have high click rates.

At this point, it's important to know the context.  Specifically, who controls the allocation of pages? Did the data come from a randomized experiment? Or did they get a self-selected sample (e.g. web surfers deciding which section of the site to visit)?

Web_lift The first construct I tried is the "lift curve" often used in marketing.  It's the same thing as the Lorenz curve used by demographers but interpreted differently.  Here, we see that Guitar pages accounted for 26% of the page views but 37% of the clicks; House pages accounted for an incremental 44% of the pages and 59% of the clicks; etc.  The relative click rates are immediately clear from the steepness of the line segments.  The lift curve is appropriate for the self-selected case, in which we can take the allocation of page views as fixed.

Web_scatter If the allocation of page views is a decision to be made, then it doesn't make much sense to accumulate page views.  The second construct is the "scatter plot" of % clicks versus % page views.  The steepness of the line through the origin helps us compare the click rates.  Bicycles is clearly inferior in generating clicks.

Both these constructs are highly efficient; adding new data does not expand the chart at all.

Keen readers will observe that the slope of the line is not the click rate but rather a click rate index (relative to the overall click rate).  This means that any data point above the diagonal has above-average click rate.

People picture

Ind_cancersurvival This graphic appeared on the front page of the British paper, the Independent.  I find it to be effective, although defiantly not efficient a la Tufte: the data-to-ink ratio is abysmal.  Two data points on the entire page, with both data labels drawn in extra large font!

It can be improved if the 24 guys are given a different color so we can see the amount of improvement between 1971 and "NOW".

Some may complain that the use of percentages obscured population growth during this period.  Perhaps there should be fewer men on the left than on the right.  Unfortunately, that would in turn obscure the comparison of percentages.

A bit of research into the data (at Cancer Research UK) reveals that the average survival rate hides a very wide range of rates (by type of cancer, by gender, by gender and type, etc.).  One might argue that the average is quite meaningless for most users.

An alternative construct is a time series chart showing the increase in survival rate over time.  It would plot more data and depict a trend (or lack thereof).  I'd have to agree with the editor that such a chart would look unattractive on the newstand.

Source: "Cancer: the good news", The Independent, May 16, 2007

String music

Dimitriorbi_2 I have to admit I don't understand this graphic but it looks beautiful and I leave it to you to dicipher it.  This is apparently a visualization of a piece of Western classical music in multiple dimensions using results from String Theory.

Technology Review describes how to read this particular graphic:

In the image, generated by Tymoczko's program, each ball represents a three-note chord. The farther apart the balls are, the farther the voices have to jump between chords. As a song plays, another ball moves around the cone. If it moves in a circle, a chord pattern is repeating. If it moves from the cone's tip to its base, a piece is progressing toward dissonance.

Dimitri has many more, and has been featured in Science, Time, New Scientist, etc. etc.

Source: "Seeing Music", Technology Review, Sept 8 2006.



Visualizing sensitivity

A reader wrote:

I'm a loyal reader who hopes you'll indulge him in just one or two questions.

In finance (valuation, specifically), we often create two-way sensitivity tables. Unfortunately, a three-way sensitivity table is what's most often called for. Of course, we work around this by producing multiple two-way tables.

Now, obviously, it's pretty hard to build  three-way table or chart in two dimensions, and the use-bigger-bubbles method doesn't really make sense in this kind of application-- but can you conceive of a good way to present the data in any other form?

3waydata_2 Like he indicated, we typically see multiple two-way data tables for such data.  The virtue of this approach is that the data is exceptionally well-organized; it's great for looking up the outcome given the three dimensions (I called them Red, Green and Blue to protect the innocent.)

Further, starting from a baseline i.e. a particular cell in the table, it's easy to move our eyes up, down or jump tables to observe the impact of changing dimensions (so-called sensitivity analysis).

These data tables facilitates "local" sensitivity analysis but obscure "global" sensitivity: staring at those numbers, we feel lost in the trees and can't see the forest.  What's the effect of increasing Green on average?  What's the effect of increasing Green while decreasing Blue? etc. etc.

3waygraph The junkart construct (right) is made to address these questions.  The black stripes establish the baseline, the overall range of values.  Then, if interested in the effect of Red = 0.11, we can compare those red stripes with the black.  Since the spread is wide, we note that Red = 0.11 is not a strong indicator of value, and to the extent it is, it points to lesser values.

What about Red = 0.11 and Green = 2?  Now, we focus on the first red stripes and the first green stripes.  We note that the overlapping region (which is where both conditions apply) is highly concentrated to the low end of value range.  Thus, we conclude that under those conditions, value is low (below 10,000) and further, that it is low primarily because Green = 2.

On and on for any one-way, two-way or three-way effects.

Although it's not the purpose of the chart, local sensitivity can also be observed.  For example, the highest value comes from Red = 0.09, Green = 16 and Blue = 0.30.  What if Blue decreases to 0.28?  We start on the Blue = 0.28 layer; going from right to left, as we see a blue stripe, we scan vertically to find the corresponding red and green stripes; the 3rd stripe from the right, we find the scenario of interest.  Such analysis would benefit from adding an interactive vertical guiding line.

Do you prefer 3-D plots?  Contour plots? Feel free to share your ideas!

Less is more

Suparse Derek pointed me to the style.org site which also parses political speeches.  Their preferred graphic is not the tag cloud but a labeled bar chart.

From top to bottom, each bar represents a sentence; the length of each bar is the length of each sentence.  Further, the user can specify word pairs for comparison.  Here the red bars are sentences containing the word "freedom"; the blue bars, "security".

It's a good illustration of the "small multiples" principle in constructing comparative graphics.

However, the choice of dimensions is perplexing.  I'd be much more interested in the timing of mentions of those words, rather than which sentence they appeared in.  I also find the length of each sentence to be irrelevant.

Redo_suparse Here's one concept that brings out the point better.  It uses less space and voluntarily gives up some of the data (the sentence structure).