« November 2009 | Main | January 2010 »

See what I mean

Andrew Sullivan said that the Rasmussen poll has been disconcordant with other polls in recent months, and he shows us the graphs (from Pollster.com) to show us.  A good example of effective visualization.


Note what makes it work: identical vertical scales on both charts, identical time frames, matched colors (disapproving red, approving black).

Reference: "Rasmussen vs. the rest", Andrew Sullivan blog, Dec 29 2009.

[Update - 1/5/2010: A few constructive comments, including a stern note from "A Professor", sent me scrambling to see if I have been too trusting of the experts.  Thankfully, the original source of these charts, Pollster.com, provides interactive tools that can be used to test the suggestions. 

It is true that the Gallup poll is a counter-balance to the Rasmussen, biased in the opposite direction, although when one looks at the evidence below, one still has to conclude that the variation between the Rasmussen and all others is much more striking than that between the Gallup and all others.  In particular: the disapproval proportion has exceeded the approval since August in the Rasmussen when this pattern is still not completely clear in the aggregate of all other polls by December.  (The cross-over appears to be inevitable unless some new policy sways public opinion back to Obama soon.)


On the balance, I'd still consider Sullivan's point to be valid.  Then again, I don't consider myself an experts on polls so there could well be other anomalies hiding within the dozens of polls.  I am a bit intrigued/disturbed by the fact that the Gallup apparently did not measure disapproval until August, or perhaps there was a glitch in the plotting software.

Of course, any polls or market research interviews can be easily manipulated, via selection of samples, via using leading questions, via the structure of questionnaires, etc. etc.  That's why the Pollster-style charts showing us the aggregate trends are crucial to look at.

Cherry-picking is to be frowned upon but sometimes the cherry-picked item is indeed an outlier, and at other times it is not.  When an entire group is being taken out, and the underlying dataset is large, as in here, the risk of falsely throwing out good data is smaller but it is always good to be vigilant.

On this last point, I'm again grateful to our vigilant readers for pointing out problems with the initial post.]

Data democracy

I have not yet been fully convinced of the direction of infographics until now -- I find too narrow the focus on organizing, structuring and visualizing large datasets; often times, we get pretty pictures with extremely high data-ink ratios but more often than not, these very dense graphics fail to speak directly to readers.  We see a lot of information; we find hardly any insights.

I think I have seen the future.  My friend Adam has been working on a web service called Empirasign, which I will describe as a form of data democracy - he takes boatloads of financial data, runs all sorts of analyses and models, and presents these results in a variety of formats, including on-line reports and tweets.  He does not attempt to visualize all the data, or all possible relationships.  Each analysis or model focuses on specific matters and he presents the result in tables and charts.

For example, a business problem might be as follows (timely for the year-end): in my portfolio, I am carrying some loser stock which I'd like to sell by year end so I can take a tax deduction on the loss, perhaps to cover some investment gains I have realized last year; however, I also believe that the loser stock may be near bottom, and if I sell now, I'd want to buy it back in short order - alas, this may be considered a "wash sale" and prohibited.  What if one can find a hedge (another stock or a portfolio of stocks) that replicates the performance of the loser stock so now I can get the best of both worlds - I sell the loser stock for the tax deduction, but keep the performance by taking a position in the hedge, then unwind when the regulation allows me to buy back into the loser stock?  (If you are interested in this trade, you should consult the experts: Adam's tutorial or wikipedia on "wash sale" or IRS-ese (pdf file).)

There are lots of stocks out there, and lots of possible hedges.  An unsophisticated investor like myself would have to spend a lot of effort to find the right hedge.  Also, it's very unlikely that staring at an infographics chart will uncover such hedges.  What Adam has done is he has collected all the required data and run analyses to find the right hedge for pretty much every (loser) stock out there.  And instead of presenting all the underlying data, he presents the results.  See below.


These data displays are not sexy - and can be improved (the explanation for the columns of the table is found on a separate page, e.g.), but for the target audience looking for trade ideas, they get to the point.  This is the gift of statistical data reduction.

What is also worth noting is through the magic of R, and Web technologies, Adam makes all this run automatically, so the insights from the data are uncovered in real time.  The wash sale avoidance strategy is not the only analysis he provides; there are tons more on the website that implements all sorts of other techniques (of which I am no expert) but it appears that users can pick and choose whatever strategy they like to follow, and Empirasign saves them any of the analytical work.

As I said at the start of this post, I see this as a promising direction for infographics, moving from visualizing data to visualizing insights.

P.S. As with previous years, I have updated my Amazon wish list (click on button on top right).  If you'd like to show your support for this blog, please help me build out my library.  Thanks to those readers who have contributed in past years - since Amazon does not always provide me your contact information, I have not been able to thank each of you personally.  Happy holidays! 

Tiger tiger

Picked up the Metro paper the other day and found them ventilating about the possibility that Tiger Woods used steroids; the news was that a Canadian doctor he (and other professional athletes) hired has been caught with HGH and drug equipment. In the section on why Tiger couldn't be doping, the following chart appeared:


According to this line of argument, since steroids should improve driving distances, and since driving distance determines overall performance, the fact that his average driving distance "remained almost constant throughout the years" proved that he did not dope.

Now, I have no idea if he dopes or not.  But this particular argument is full of holes.  In the modern era, steroids are used not just for enhancing brute strength but also shortening recovery times, prolonging training, etc.  Also, it holds only if overall performance is heavily affected by driving distance.

The bar chart has multiple problems:

  • The choice of starting the vertical scale at 250 is completely arbitrary, and as been shown before, cutting off the bottoms of bars is a bad idea -- the lengths of the remaining parts are no longer proportional to the stated data.
  • The choice of the three years is also unexplained, especially when 2001 is not in the middle of 1997 and 2009. 
  • The horizontal gridlines are totally redundant since all three numbers sit in the very last section (290-300).  

Why were those three years chosen?  The following line chart that plots all the data may give us a clue:


The choice of 2001 and 2009 means we missed the peak of his driving distance performance.  Looking at the standardized units, we see that at its peak, the driving distance was about 2.6 times the standard deviation above his career average (the zero line using the scale on the right). 

The difference between 1997 and the peak was about 20, which looked large compared to the standard deviation of 6 over this entire period. Establishing a reference point is very important to interpreting any observed difference.

This is one of the few occasions where double axes can be recommended.  The two axes in fact plot the same data, only reflecting a difference in scale.

Reference: "Three reasons to believe he's totally clean", Metro USA, Dec 16 2009.


Here are some things I have been reading while I'm traveling (the posting schedule will be erratic):

Does the vaccine matter?  Shannon Brownlee and Jeanne Lenzer investigates for The Atlantic.  About 100 million Americans get the flu shot each year; what benefit does it confer?  This is an excellent article.

Some provocative quotes:

Flu comes and goes with the seasons, and often it does not kill people directly, but rather contributes to death by making the body more susceptible to secondary infections like pneumonia or bronchitis. For this reason, researchers studying the impact of flu vaccination typically look at deaths from all causes during flu season, and compare the vaccinated and unvaccinated populations.

The estimate of 50 percent mortality reduction is based on “cohort studies,” which compare death rates in large groups, or cohorts, of people who choose to be vaccinated, against death rates in groups who don’t. But people who choose to be vaccinated may differ in many important respects from people who go unvaccinated—and those differences can influence the chance of death during flu season. [Ed: people who can afford the flu shot vs. those who can't; people who are more health-conscious vs. those who aren't, etc.]

“For a vaccine to reduce mortality by 50 percent and up to 90 percent in some studies means it has to prevent deaths not just from influenza, but also from falls, fires, heart disease, strokes, and car accidents. That’s not a vaccine, that’s a miracle.”

In the flu-vaccine world, Jefferson’s call for placebo-controlled studies is considered so radical that even some of his fellow skeptics oppose it. ... “It is considered unethical to do trials in populations that are recommended to have vaccine,” a stance that is shared by everybody from the CDC’s Nancy Cox to Anthony Fauci at the NIH. They feel strongly that vaccine has been shown to be effective and that a sham vaccine would put test subjects at unnecessary risk of getting a serious case of the flu.

Another pie chart, Fox News (via FlowingData, thanks to reader Katherine M for the pointer; also via Wonkette and reader omegatron).

Yet another pie chart, Business Insider.  Not on the same scale as the one above but still why?

Clean Water Act Violations, New York Times. Can we trust tap water?  As usual, a set of small bars would work better than concentric circles.

How does your state compare to California? (via Pew and Mother Jones) This is a nice illustration that often it is better to plot data derived from the raw data, as opposed to the raw data itself.  Since the designer decided to hide the information, let's figure out what were the cut-off points for the color categories.  If the size of each category is not the same, the designer needs to explain the scale.  Also, the two shades of light blue are hard to tell apart.  But all in all, a good effort here.

The real climategate

Climategate is all the rage at the moment. What interests me about this episode is not the integrity of certain scientists, or science in general, nor the culture of academia, and certainly not the evidence of climate change. For me, the real climategate is the woeful state of statistical education.  Let me explain.

Here is the infamous email: (via Nathan Silver, with my highlights)

From: Phil Jones
To: ray bradley ,[email protected][snipped], [email protected]
Subject: Diagram for WMO Statement
Date: Tue, 16 Nov 1999 13:31:15 +0000
Cc: [email protected][snipped],[email protected][snipped]
Dear Ray, Mike and Malcolm,

Once Tim’s got a diagram here we’ll send that either later
today or first thing tomorrow. I’ve just completed Mike’s Nature
trick of adding in the real temps to each series for the last 20
years (ie from 1981 onwards)
amd [sic] from1961 for Keith’s to
hide the decline. Mike’s series got the annual land and marine
values while the other two got April-Sept for NH land N of 20N.
The latter two are real for 1999, while the estimate for 1999
for NH combined is +0.44C wrt 61-90. The Global estimate for
1999 with data through Oct is +0.35C cf. 0.57 for 1998.

Thanks for the comments, Ray.

Cheers, Phil

What concerns me is Phil Jones' describing what he did as a "trick" to "hide the decline".  He apparently thought that he was doing something shameful. But when is it shameful to extend the plot of a time series so as to display the long-term trend, and not be misled for short-term fluctuations? This is providing statistical context to the data being examined. Lots of people are condemning this as a willful act to mislead the public but if they have some statistical literacy, they will understand that finding the appropriate time scale to look at the data is one of the most important tasks of analyzing time series data. It's a problem when even prominent scientists do not comprehend why they should be doing this.

Ts_beer I have always wondered why in climatology as well as in economics, we rarely see decomposed time-series plots (at least not in the public's eye). 

On the right, I found on-line a plot of a decomposition of beer sales that separates out seasonality, trend and other parts of a time series. The original data is shown up top. In practice, newspapers and blogs give us such plots all the time when they should show us the third plot down (the trend with the seasonal factor removed), unless the story is about seasonality.

Note to self: should include basic time-series decomposition in the intro stats syllabus; much too important a topic to leave to a second course.