May 31, 2007

If we report it, it's a fact

David Leonhardt wrote in the NYT of a shocking incident of statistical abuse committed by Lou Dobbs and the CNN crew.

On several recent occasions, while commenting on the red-hot immigration issue, Lou and company remarked that "there had been 7,000 cases of leprosy in this country over the previous three years, far more than in the past".  (Leprosy is a flesh-eating disease prevalent among immigrants, particularly of Asian or Latin American origin.)

Nyt_leprosyWhen asked about fact-checking, Lou reportedly said: "If we reported it, it's a fact."  A quick visit to the government's leprosy program web-site immediately reveals the time-series chart, shown on the left.  With annual rates at about 150 in the last 5 years or so, one is hard impressed to find the 7,000 alleged cases!

Furthermore, because this chart lacks comparability, we fail to see that 150 cases out of a population of 300 million represent a minuscule risk.

A slight downward trend is evident in the last 20 years or so; this record is even more impressive when we realize the population grew during this period.  These points can be made clearer in multivariate plots.

Source: "Truth, Fiction and Lou Dobbs", New York Times, May 30, 2007; U.S. National Hansen's Disease web-site.

 

May 23, 2007

Looking for survival

Retention_rate_by_daniel_waisberg_2 Daniel W of esnips has started a collection of graphics on visualizing web statistics.  The following graph is an attempt to capture the ability of the web-site to attract returning customers.

The time axis serves double duty here: it is an indication of which "cohort" the users belong to, in other words, when they signed up; it is, also, the month of returning visits.

Web_surv A more typical chart used by statisticians is the survival curve.  As shown here, these are the same curves as above but having the same starting point.  Now, the time axis is interpreted as number of months after registration.  Of 100 members who registered in January, how many returned one month later, two months later, etc.

If the purpose is to evaluate the consistency of retaining customers by cohort, then this graphic is less cluttered.  I also used a fading metaphor to color the lines so that the oldest cohort (also, the longest line) is the faintest.  Line labels are best hidden, and revealed interactively when the user mouses over a line of interest.

Not sure if Daniel was plotting real data; in general, we expect a certain amount of criss-crossing.  If the data is real, then his site has seen uninterrupted improvement every month thus far.

Source: The Web Analytics Graph Collection, eSnips.

May 17, 2007

People picture

Ind_cancersurvival This graphic appeared on the front page of the British paper, the Independent.  I find it to be effective, although defiantly not efficient a la Tufte: the data-to-ink ratio is abysmal.  Two data points on the entire page, with both data labels drawn in extra large font!

It can be improved if the 24 guys are given a different color so we can see the amount of improvement between 1971 and "NOW".

Some may complain that the use of percentages obscured population growth during this period.  Perhaps there should be fewer men on the left than on the right.  Unfortunately, that would in turn obscure the comparison of percentages.

A bit of research into the data (at Cancer Research UK) reveals that the average survival rate hides a very wide range of rates (by type of cancer, by gender, by gender and type, etc.).  One might argue that the average is quite meaningless for most users.

An alternative construct is a time series chart showing the increase in survival rate over time.  It would plot more data and depict a trend (or lack thereof).  I'd have to agree with the editor that such a chart would look unattractive on the newstand.

Source: "Cancer: the good news", The Independent, May 16, 2007


May 03, 2007

Less is more

Suparse Derek pointed me to the style.org site which also parses political speeches.  Their preferred graphic is not the tag cloud but a labeled bar chart.

From top to bottom, each bar represents a sentence; the length of each bar is the length of each sentence.  Further, the user can specify word pairs for comparison.  Here the red bars are sentences containing the word "freedom"; the blue bars, "security".

It's a good illustration of the "small multiples" principle in constructing comparative graphics.

However, the choice of dimensions is perplexing.  I'd be much more interested in the timing of mentions of those words, rather than which sentence they appeared in.  I also find the length of each sentence to be irrelevant.

Redo_suparse Here's one concept that brings out the point better.  It uses less space and voluntarily gives up some of the data (the sentence structure).

Apr 12, 2007

Peripherals 2

In terms of interactive charting, Google Finance did much more than hide the legend.  In their main stock price chart, they used a number of neat features.

Google_ahm1

This chart effectively conveys a huge amount of information in a small space.  The bottom strip which shows relative prices for the past two years provides context to interpret the five-day movement shown in the main chart area.  I prefer to see a scale on the bottom strip as well. 

The sliding scrollbar can be dragged to show historical data.  Besides, the width of the window shown in the main area can be controlled.  For instance:

Google_ahm2

Without any effort, we are now looking at a 3-month chart for Q2 2006.  Notice the summary statistic on the top right corner also morphed.  The axis scale changed, and it never did start from zero to begin with.  (This shortcoming is alleviated by the profile chart in the bottom strip.)

Further, by placing the cursor in the chart area, we can highlight a particular day: a dot appeared on the price curve, the volume on that day was highlighted, and the text on the top right switched.  That text is what we typically place inside the chart area as a "data label".  The effect of moving it to the corner is similar to hiding the legend: it makes the graph more legible and provides space for longer descriptions.  As we move the cursor from left to right, the graph dynamically adapts.  Marvellous!

Google_ahm3

It may not be obvious the amount of data processing that has to take place to implement these sorts of features. I don't have space to address the data issue but maybe some of our readers can comment on it. 

Feb 16, 2007

Mirror, mirror

Ec_sarko Mirror, mirror on the wall...

I don't see what the second line adds to this plot, given there were only two candidates in this election. 

Political graphs do not get much better than those at the Political Arithmetik blog.

For instance, in the chart below, he wisely chose to draw trend-lines rather than connecting the individual dots.  TopdemsAlso, typically, he plots dots for all the different polls, which allows us to assess the variability (reliability) of the observed trend.

 

Reference: "Sarko embraces the Anglo-Saxons", Economist, Feb 3 2007.

Feb 01, 2007

Error spotting

My friend Augustine pointed me to this interesting graph showing the time of sunset over the course of a year.  (The original author's write-up is here.)

Flickr_sunset

Of course, one can produce a perfect chart by looking up meterological records.  The main interest in this graph is how it was constructed.  Each cell in the graph represents an hour of a day, with days running across and time running down. The cells that are not dark each contain a photograph of the sunset contributed to Flickr, the photo-sharing site.  So this is in effect a graph created through mass collaboration (about 35,000 photos).

The "white" band roughly indicates the sunset.  What intrigues me is the variability... what are the reasons for lighted cells appearing all over the graph?

Some ideas include:

  • Different time zones
  • Incorrect time setting by some photographers
  • Erroneous tagging of photos as "sunset"

Jan 17, 2007

Losing count of Doomsday

The Doomsday Clock is making the news today: because of the  growing nuclear threat and continued denial of global warming, scientists say we are "five minutes from Doomsday".

Nyt_doomsdayclock This graph traces the movement of the clock's hand over the last few decades.  (I think it appeared on the New York Times website but I cannot find it now.)

The little tickmarks are superfluous, and the thin white borders between red columns serve only to make us dizzy.
As shown below, a line chart is much easier on the eyes.







Redo_doomsday Now, a question for the scientists: Why the clock analogy?  Does it reflect a kind of fatalism that we can never be more than 60 minutes away from Armageddon?  How many minutes were we from Doomsday two hours ago?

Jan 10, 2007

Complex is not random

There is a tendency to mistake complexity for randomness.  Faced with lots of data, especially when squeezed into a small area, one often has trouble seeing patterns, leading to a presumption of randomness -- when upon careful analysis, distinctive patterns can be recognized.

We encountered this when looking at the "sad tally" of the Golden Gate Bridge suicides (here, here, here, here and here).  Robert Kosara's recent work on scribbling maps of zip codes also highlights the hidden patterns behind seemingly random numbers.

Estrellaloto Robert found
a related example (via Information Aesthetics, originally here): the artist takes random numbers (lottery numbers), and renders them in a highly irrelevant graphical construct, as if to prove that spider webs can be generated randomly.

According to Infosthetics, each color represents a number between 1 and 49, which means the graph contains 49 colored zigzag lines (not counting gridlines and axes).  Each point on the year axis represents a frequency of occurrence.

Imagine if you are tasked with using this chart to ascertain the fairness of the lottery, that is, the randomness of the winning numbers.  The complexity of this spider web makes a tough job impossible!  We must avoid the tendency to jump to the conclusion of randomness based on this non-evidence.

In fact, testing for randomness can be done using any of the methods described in the postings on the "Sad Tally" (links above).  A first step will be to plot the frequency of occurrence data as a simple column chart with 1 to 49 on the horizontal axis.  We'd like to show that the resulting histogram is flat, on average over all years.

Dec 26, 2006

End of year effect?

Nyt_babies2 I agree with JF who suggested that this chart was mind-boggling.  The chart accompanied a somewhat diffuse NYT article postulating that tax break or shifting medical practice or less apprehension about tired nurses or added labor-inducing stress from visiting relatives may have something to do with more babies being born in December, particularly at month's end.

This chart presumably shows the "spike" in December births, or more precisely, the shift of January births into December.  The trouble with it is its lack of comparability.  We need to compare the 2002-3 trend to some prior year to see the shift.

Even then, we would have seen only one data point.  So it would have been better to plot multiple years.

Finally, after reading the article, I cannot discern the importance of Monday and Friday.  The yellow-pink coloring has not improved my comprehension of the data; it leaves me with more questions than before.

Reference: "To-Do Lists: Wrap Gifts, Have Baby", New York Times, Dec 20 2006.



PS. Please now visit Jon's response.  Kudos for digging out the historical data series and a stellar analysis!

Mentions


  • My Amazon.com Wish List

  • Yahoo! Picks

Search Junk Charts


  • Custom Search

Residues

May 2008

Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31