« December 2006 | Main | February 2007 »

Convenience charting

Statisticians have long riled against "convenience sampling", that is, the practice of selecting samples based on what's easily available, not at random.  Say picking your friends.

Wpost_childmortality Dustin J sent in this example of what can only be called "convenience charting".  Dustin said he had no clue what this chart is saying, and I am not surprised. 

The chart plots a statistical object known as the "survival function".  It is likely that "survival analysis" was done, after which the chart creator  picked up the resulting statistical object and dumped it onto this "convenience chart".

If we take the top line on the "child survival" graph, it shows the probability of one child surviving up to a certain age, if the child belonged to a family with 1-3 kids.  The chance is about 92.5% that the child will survive through age 2, and 88% that the child will survive through age 18.  The difference between those percentages is due to the chance that the child may die between ages 2 and 18.

A slight transformation of the data will make this point much clearer.  What is the probability of a child dying by a certain age?  Using the example, a child has 12% chance to die by age 18, and 7.5% chance of dying between ages 0-2.

Redochildmortality The junkart chart depicts this probability.  (I reverse-engineered the data which explains why the distances between the line segments look strange.)

What this chart doesn't address is how we are to interpret the probability of "a child dying" in a family with more than one child.  Is it a random child dying?  At least one child dying?  Exactly one child dying (the other X-1 surviving)? 

The original chart also committed a number of standard errors.  The child survival function represent probabilities, not percentages.  The third category should be 8-11 kids, not 7-11.  If we are picky, then we would also like to see "confidence intervals" because there must have been many fewer families in the 12+ sample than the 1-3 sample.  In the second chart (which I don't have space to discuss), some data labels are missing, which indicates a presumption that all readers have seen the first chart.

Reference:  "Child, Parents Drive Each Other to Early Graves", Washington Post, Jan 14, 2007. 


Losing count of Doomsday

The Doomsday Clock is making the news today: because of the  growing nuclear threat and continued denial of global warming, scientists say we are "five minutes from Doomsday".

Nyt_doomsdayclock This graph traces the movement of the clock's hand over the last few decades.  (I think it appeared on the New York Times website but I cannot find it now.)

The little tickmarks are superfluous, and the thin white borders between red columns serve only to make us dizzy.
As shown below, a line chart is much easier on the eyes.







Redo_doomsday Now, a question for the scientists: Why the clock analogy?  Does it reflect a kind of fatalism that we can never be more than 60 minutes away from Armageddon?  How many minutes were we from Doomsday two hours ago?


Subjectivity

Irwebfeature_1 When I look at charts like this one, I ponder: Should graph designers adopt "objectivity" as practiced by American journalists?

Is it even possible to make "objective" charts?  Every design choice we make seem to chip away some of the detachment.  In this chart, the choice to order important web-site features by shopper -- rather than merchant -- ratings is a tacit preference for those ratings.  Bringing out key messages in the data is a subjective act, isn't it?

Are "objective" charts useful?  In our example, the design choices are kept to a minimum, and so it seems is its usefulness.  In comparing shopper and merchant ratings, one would be most interested in identifying the most effective web-site features as well as those features offered by merchants that find little resonance with shoppers-users.  These questions are better addressed by directly plotting the average rank and the ranking gap between merchants and shoppers (see below).

Notice that I said "ranking" rather than "rating".  The footnote discloses that the ratings were obtained from two different surveys conducted by two different companies at two different times.  How should we interpret the difference of 13% between the 89% of shoppers rating "Free Shipping" "very to extremely helpful" and the 76% of merchants rating "Free Shipping" "somewhat to very valuable"?

RedowebfeatureIn the junkart chart, we can focus on three groups of features:

  • the three top features ("Promo Discounts", "Free Shipping" and "Keyword Search") which attained the best average rank and least ranking gap;
  • the three "orphan" features ("Recommended Products", "Top Sellers", "Gift Selection") created by loving web-site producers, abandoned by independent-minded shoppers;
  • the three "neglected stepchildren" ("Shop the Catalog", "Store Locator", "Product Comparison") whose importance to shoppers were vastly underestimated by the merchants.

Unfortunately, while being "objective",  the data table fails to point out anything of interest to the reader.

Reference: "Consumers want one thing -- merchants are delivering another", Internet Retailer, Jan 2007.


Complex is not random

There is a tendency to mistake complexity for randomness.  Faced with lots of data, especially when squeezed into a small area, one often has trouble seeing patterns, leading to a presumption of randomness -- when upon careful analysis, distinctive patterns can be recognized.

We encountered this when looking at the "sad tally" of the Golden Gate Bridge suicides (here, here, here, here and here).  Robert Kosara's recent work on scribbling maps of zip codes also highlights the hidden patterns behind seemingly random numbers.

Estrellaloto Robert found
a related example (via Information Aesthetics, originally here): the artist takes random numbers (lottery numbers), and renders them in a highly irrelevant graphical construct, as if to prove that spider webs can be generated randomly.

According to Infosthetics, each color represents a number between 1 and 49, which means the graph contains 49 colored zigzag lines (not counting gridlines and axes).  Each point on the year axis represents a frequency of occurrence.

Imagine if you are tasked with using this chart to ascertain the fairness of the lottery, that is, the randomness of the winning numbers.  The complexity of this spider web makes a tough job impossible!  We must avoid the tendency to jump to the conclusion of randomness based on this non-evidence.

In fact, testing for randomness can be done using any of the methods described in the postings on the "Sad Tally" (links above).  A first step will be to plot the frequency of occurrence data as a simple column chart with 1 to 49 on the horizontal axis.  We'd like to show that the resulting histogram is flat, on average over all years.


Table pitfall

Happy New Year to you all!  I'm now back from holiday.

Datatable At work today, I came across this data table (shown right is a small extract from the very large data table, with labels changed to protect the innocent).  I was scanning through the numbers, looking for differences between type A and type B samples.

If your eyes work like mine, you may pick out the "West" region comparison, mainly because of the jump in the leading digit.  But then I circled back, because the right side of my brain wanted both columns to add up to 100% (less rounding) and something has to compensate for the 15-21 jump.  After a moment's search, after finding the 35-30 flip in the "South" region, I let go a sigh of relief.

Even though the above differences were about the same (5 or 6 percentage points), my eyes caught the change in leading digits and stuck to it.  This problem is especially acute when scanning quickly through reams of data tables.


So data analysts beware!  This includes those who scan financial statements, financial data, computer-generated logs, statistical software output (e.g. SAS), market research data, etc. for a living.  We are easily fooled.

Not convinced?  Let your eyes decide which difference is larger:

Datatable2_6