« November 2006 | Main | January 2007 »

End of year effect?

Nyt_babies2 I agree with JF who suggested that this chart was mind-boggling.  The chart accompanied a somewhat diffuse NYT article postulating that tax break or shifting medical practice or less apprehension about tired nurses or added labor-inducing stress from visiting relatives may have something to do with more babies being born in December, particularly at month's end.

This chart presumably shows the "spike" in December births, or more precisely, the shift of January births into December.  The trouble with it is its lack of comparability.  We need to compare the 2002-3 trend to some prior year to see the shift.

Even then, we would have seen only one data point.  So it would have been better to plot multiple years.

Finally, after reading the article, I cannot discern the importance of Monday and Friday.  The yellow-pink coloring has not improved my comprehension of the data; it leaves me with more questions than before.

Reference: "To-Do Lists: Wrap Gifts, Have Baby", New York Times, Dec 20 2006.

PS. Please now visit Jon's response.  Kudos for digging out the historical data series and a stellar analysis!

Scribbling as art

ZipscribblemapcolorthumbOver at EagerEyes, they created this beautiful visual of zip codes in the U.S., proving that scribbling is art.

They took the zip codes in numeric order, connecting all of them in a line.  The colors represent States.  We begin to see some order in the 5-digit madness.

Of course, such scribbling serves a specific and highly appropriate purpose here, and would not be generally recommended.

Emergent patterns

It's always a pleasure to read blow-by-blow accounts of how charts were constructed.  The piece on time-travel maps was instructive.  Similarly in the previous post, I quoted the following:

It’s easier to answer this question if you leave out the six states that didn’t elect any Republicans in 2000; after all, they didn’t have any to throw out. If you also remove New Hampshire and South Dakota, where the percentage of Republicans elected dropped to 0 from 100 — New Hampshire only has two seats in the House and South Dakota has one — a pattern starts to appear.

At first sight, this appears as a case of removing outliers, which many statisticians recommend.  Except that the data omitted were not outliers.  Indeed, when both x- and y-variables are bounded (between 0% and 100% share of the House seats; between -100% and +100% change in share), there can be no extreme values.

In effect, when the author eliminated those eight points, he followed the "emergent pattern" theory, by which I mean the notion of removing data until a pattern "emerges".  (By the way, emergence is now a science, as expounded here.)  If enough data is removed, one can produce any pattern as one pleases.  One can find subsets of data to support a hypothesis of positive linear, flat linear or quadratic, as shown below.


Focusing now on the full data set on the upper left corner, one is hard pressed to conclude that a positive correlation exists between the two variables. In particular, most states experienced no changes in the share of House seats, and in these states, the income growth ranged from under 20% to over 40%, which is pretty much the extent of variability across the full data set.

The trouble with percentages

In the aftermath of the Democratic victory in the 2006 mid-term election, the NYT published a column floating the idea that "it was the economy, stupid".  For statistics buffs, this column provides much food for thought. 

Suffice it to say, if you were my student, you would not want to hand this in as an essay.  To the author's credit, he did backload the article with lots of disclaimers.

The key thesis of the piece is:

if your state wasn’t among the best economic performers in the last six years, judged by the growth of personal income, it appears that you were three times as likely to vote to throw the bums out.

Redo_election06b_1 (We'll just assume he didn't mean "you" but "your state".) To help us understand the author's logic, I created a scatter plot, relating the change in state average personal income (2000-2006) to the change in percent of Republican seats.

He first segmented the states into two groups: the red dots had the top 10 income growth rates; the blue dots were the remaining states.  Then for each group, he computed the average drop in % Republican.  For the reds, it was 2%; for the blues, it was 7%.  (These levels are indicated by the horizontal lines.  My data are slightly different from his.)  Case proven -- with disclaimers.

Some of you are already counting the dots.  If you only find 42, you'd have counted correctly.  The following explanation provided by the analyst is classic:

It’s easier to answer this question if you leave out the six states that didn’t elect any Republicans in 2000; after all, they didn’t have any to throw out. If you also remove New Hampshire and South Dakota, where the percentage of Republicans elected dropped to 0 from 100 — New Hampshire only has two seats in the House and South Dakota has one — a pattern starts to appear.

I will leave the emergent pattern thesis to a future post.  For this post, I am interested in the trouble with percentages.  He is right to point out that for those 100% Blue states, the change in %Republican is constrained to be positive, from 0% up to 100%.  For most other states, the change can be positive or negative.

Good observation but wrong remedy -- those six states with 0% Republicans in 2000 are not special; removing them from the analysis is wrong-headed.  What about those states with 100% Republicans in 2000?  There, the change in %Republican can only be 0% or negative.  In fact, the possible range for the change in seats for each state is different, and it depends on the Republican proportion in 2000!  For example, if in 2000 the Republicans held 30% of the seats, then in 2006, the change must be between -30% and +70%.

The situation is worse: the range of possible values also depends on the number of seats in each state.  The fewer total seats there are, the fewer possible values that can be taken.  As the author notes, with only 1 seat, you either lose it, gain it or retain it, so that the change will be either -100%, +100% or 0%.  No other values are possible!

Both the above troubles arise because we use percentages to describe something discrete (number of seats).  This is a difficult problem and I don't know of a general solution. Redo_election06c However, in this example, because the change in seats is small across all states, regardless of the total number involved, I recommend that we avoid percentages and stick with positive, zero and negative changes.

The boxplot shows that there is little correlation between income growth and whether Republicans would win or lose House seats in 2006.  Here, the states are divided into three groups depending on whether the Republicans gained, lost or retained seats in the 2006 mid-term election.  The median income growth are similar in all three groups and the boxes overlap heavily.

Reference: "Maybe You Did Vote Your Pocketbook", New York Times, Nov 12 2006.

PS. If you like this post, consider sending me a holiday gift.


2006 Holidays

As the end of the year approaches, my posting schedule will be less predictable, and may at some point stop till 2007.

As before, I have put up a holiday wish list (it will be seen in the right column during the holidays).  If you've enjoyed this blog, consider a small gift to help stock my library.  My selection this year is a mix bag of interesting reads.

2006 Holiday Wish List

Time travel


One of my scientific heroes and seminal teachers is Professor Frank Kelly at Cambridge.  What a pleasant surprise to see his involvement in a data visualization project.  To cite his wise words:

The travel-time maps are more than just pretty to look at; they also demonstrate an innovative way to use and present existing data. We are entering a world where we have access to vast quantities of data, and ways of turning that data into information, often involving clever ideas about visualisation, are becoming more and more important in science, government and our daily lives.

The little black dot near the center of the map indicates the Mathematics building at Cambridge.  The contours (vaguely visible at our scale) represent intervals of 10 minutes by public transportation away from the black dot.  Any colored dot on the map refers to the time at which a traveller must leave in order to get to the Math building by 9 am, taking into account traffic situation, time of day, and decisions.  The hope of such maps is to help commuters (by public transit) plan their travel.

Professor Kelly has a very nice write-up on the intricacy of generating the data for such a map, which includes techniques of sampling, smoothing, extrapolation and so on.  It is rare that we get insights into the chart-making process.  He also carries a larger version of the travel-time map.

A similar article can be found at Plus magazine.



Behind the smokescreen lies the informative conclusion: among households with smokers, about 40% smoke in residence all the time while about half never smoke in residence.

This graphic, unfortunately chosen, contains many distractions from the main message, including:

  • the liberal sprinkling of colors
  • the inclusion of data for 1, 2, 3, 4, 5, 6 days, almost all of which were effectively zero
  • the redundant vertical scale, as all the data already appeared on the chart itself
  • the comparison of smokers to "total sample" (rather than non-smokers)

The last point merits special attention.  The total sample contains households with smokers as well as households without smokers. Any data from the total sample is a weighted average of these two types of households.  It is better to directly compare the two household types than to indirectly compare one type to the overall.

Further, households without smokers should be extremely likely to have no smoking in residence all week. 
And if most households have no smokers (76% of this sample), then the statistics of the total sample will mimic those of no-smoker households. That is to say, the total sample statistics do not add much to the analysis.  Our junkart version below corrects for this as well as other things.

Redo_smokeathomeOne of the key functions of a graph is data reduction, i.e. to aggregate data in such a way as to expose the information contained within.  Typically, a graph that uses aggregated data is clearer and stronger than one that plots every piece of data.  In this example, by combining 1-6 days into a single category ("smokes in residence part of the week"), we have a graph that is much more readable.

I want to thank Dr. Mike Rabinoff for inspiring me to look up these second-hand smoking statistics.  Mike recently published a book called "Ending the Tobacco Holocaust", which tells you more than you want to know about the tobacco industry.

Reference: "Second Hand Smoke Survey: Final Report", Madison Department of Public Health, Dec 2003.