Feb 01, 2007

Error spotting

My friend Augustine pointed me to this interesting graph showing the time of sunset over the course of a year.  (The original author's write-up is here.)

Flickr_sunset

Of course, one can produce a perfect chart by looking up meterological records.  The main interest in this graph is how it was constructed.  Each cell in the graph represents an hour of a day, with days running across and time running down. The cells that are not dark each contain a photograph of the sunset contributed to Flickr, the photo-sharing site.  So this is in effect a graph created through mass collaboration (about 35,000 photos).

The "white" band roughly indicates the sunset.  What intrigues me is the variability... what are the reasons for lighted cells appearing all over the graph?

Some ideas include:

  • Different time zones
  • Incorrect time setting by some photographers
  • Erroneous tagging of photos as "sunset"

Jan 24, 2007

Convenience charting

Statisticians have long riled against "convenience sampling", that is, the practice of selecting samples based on what's easily available, not at random.  Say picking your friends.

Wpost_childmortality Dustin J sent in this example of what can only be called "convenience charting".  Dustin said he had no clue what this chart is saying, and I am not surprised. 

The chart plots a statistical object known as the "survival function".  It is likely that "survival analysis" was done, after which the chart creator  picked up the resulting statistical object and dumped it onto this "convenience chart".

If we take the top line on the "child survival" graph, it shows the probability of one child surviving up to a certain age, if the child belonged to a family with 1-3 kids.  The chance is about 92.5% that the child will survive through age 2, and 88% that the child will survive through age 18.  The difference between those percentages is due to the chance that the child may die between ages 2 and 18.

A slight transformation of the data will make this point much clearer.  What is the probability of a child dying by a certain age?  Using the example, a child has 12% chance to die by age 18, and 7.5% chance of dying between ages 0-2.

Redochildmortality The junkart chart depicts this probability.  (I reverse-engineered the data which explains why the distances between the line segments look strange.)

What this chart doesn't address is how we are to interpret the probability of "a child dying" in a family with more than one child.  Is it a random child dying?  At least one child dying?  Exactly one child dying (the other X-1 surviving)? 

The original chart also committed a number of standard errors.  The child survival function represent probabilities, not percentages.  The third category should be 8-11 kids, not 7-11.  If we are picky, then we would also like to see "confidence intervals" because there must have been many fewer families in the 12+ sample than the 1-3 sample.  In the second chart (which I don't have space to discuss), some data labels are missing, which indicates a presumption that all readers have seen the first chart.

Reference:  "Child, Parents Drive Each Other to Early Graves", Washington Post, Jan 14, 2007. 

Dec 15, 2006

Emergent patterns

It's always a pleasure to read blow-by-blow accounts of how charts were constructed.  The piece on time-travel maps was instructive.  Similarly in the previous post, I quoted the following:

It’s easier to answer this question if you leave out the six states that didn’t elect any Republicans in 2000; after all, they didn’t have any to throw out. If you also remove New Hampshire and South Dakota, where the percentage of Republicans elected dropped to 0 from 100 — New Hampshire only has two seats in the House and South Dakota has one — a pattern starts to appear.

At first sight, this appears as a case of removing outliers, which many statisticians recommend.  Except that the data omitted were not outliers.  Indeed, when both x- and y-variables are bounded (between 0% and 100% share of the House seats; between -100% and +100% change in share), there can be no extreme values.

In effect, when the author eliminated those eight points, he followed the "emergent pattern" theory, by which I mean the notion of removing data until a pattern "emerges".  (By the way, emergence is now a science, as expounded here.)  If enough data is removed, one can produce any pattern as one pleases.  One can find subsets of data to support a hypothesis of positive linear, flat linear or quadratic, as shown below.

Redoelectiond

Focusing now on the full data set on the upper left corner, one is hard pressed to conclude that a positive correlation exists between the two variables. In particular, most states experienced no changes in the share of House seats, and in these states, the income growth ranged from under 20% to over 40%, which is pretty much the extent of variability across the full data set.

Aug 15, 2006

Bumps charts and NYT

I just cannot resist another post on Bumps charts since  NYT finally started using them.  Here are two recent examples:



Nyt_propertytaxThis first chart illustrates the change in property taxes in different municipalities since 1998, as compared to the national average.

A wealth of information is revealed:

  • All these places charge more than the national average today
  • New York City used to charge less than average but that ended in 2003
  • The tax rates are clustered into three groups, about 6%, about 5% and below 4%.  The variance between different places has decreased during these years
  • A sharp rise was recorded in all these places in 2001-3 although New York City lagged slightly.  The sharp rise was not observed nationwide


Reference: "Gain in Income is Offset by Rise in Property Taxes", New York Times, Aug 8 2006.

Nytmconfidant
The second example is much cleaner as it involves only one period.  Bolding the "no one" line is particularly effective, bringing out the author's point well.

However, I'd have put the "no one" label on the right, just like the other labels, but bolded.

One could also argue that the real story is the simultaneous decline of "friend", "co-worker" and "neighbor" and rise of "no one" and "spouse".

Finally, it'd be interesting to see the multi-period version as the smooth linear trends are rather incredulous.

Reference: New York Times Magazine, July 16 2006.

Aug 12, 2006

Transparent circles

HousepricetoearningsratiolargeJens from Library House sent us this chart featuring house price to earnings ratios.  In his own words:

"the key thing that I just love is that they have included the data points, but not as points, but as little transparent circles. This allows you to understand by how much two data points are spaced apart from each other, visualising growth and making this chart look very dynamic. I have never seen this in this form before: very nice. Beyond this, the axes are clearly labelled, all in all a very simple chart, beautifully executed."

Nov 15, 2005

The sad tally 4: the boxplot

Given several datasets (scatter plots), how does one tell random from non-random?  We plot features of the data structure and hope to see abnormal behavior, which we take to indicate the presence of non-randomness.

In a last post, we noticed that the key distinguishing feature is the degree of variability in the data: for purely random data, we expect a certain degree of variability; Realsuidkeys_1non-random data betrays their character because they would show more or less variability at different points in the distribution (see right).  So each scatter plot reveals variability around the mean (0) but the variability has apparently a different shape in Dataset 9 (the real data) than in the other datasets.  We saw this in a "cumulative distribution" plot and less strongly in a plot of the frequency of extreme values.

Given what we know now, a much simpler plot will suffice.  The boxplot, an invention of John Tukey, a giant among statisticians, succinctly summarizes the key features of a distribution, such as the median (bold line), the 25th and 75th percentiles (edges of the box), the spread (height of the box), and outliers (individual dots above or below the box).


SuicidesboxThe evidence of non-randomness in Dataset 9 now starkly confronts us.  Its box is much wider; the median line is significantly smaller than the other 8 datasets; the extreme value of 61 is clearly out of whack.

 
 

 
 
Now, lets take a step back and look at what this non-randomness means.  We are concluding that the choice of suicide location is not random.  Location 69/70 (the outlier, with 61 deaths) is definitely the most popular.  Partly because of this, many of the other locations have had fewer than 20 deaths, which are fewer than expected if people had randomly selected locations.

Next time, I will describe another way to compare distributions; this method is more advanced but also more direct.

Mentions


  • My Amazon.com Wish List

  • Yahoo! Picks

Search Junk Charts


  • Custom Search

Residues

May 2008

Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31