« January 2007 | Main | March 2007 »

Mean and median

In the comments of the last post on on-line weather forecasts, Hadley raised the evergreen statistical question of mean vs median.  In this context, median error is unaffected by particular days in which the forecaster makes extreme errors while mean error takes into account the magnitude of every forecasting error in the sample.

Which one to use depends on the situation.  Brandon, who did the original analysis, was motivated by planning a trip to a unfamiliar location.  In this case, he might be better served by lower mean error, which would imply few extremely bad forecasts.

On the other hand, if I am interested in my local weather, then I'd likely be less concerned about a few extremely bad forecasts, and more concerned that the forecast is on the money on most days.  Then perhaps the median error would come into play.

Redoonlineweather2 It turns out it doesn't much matter for our weather forecast data.  In this new chart, I superimposed the mean error data (in black).  The scatter of points was exactly as it was for median error (in red).  (MSN had a particularly bad forecast for a low temperature one day, which pulled its location to the left.)

This shows further that the difference between CNN, Intellicast and The Weather Channel is negligible.

Going out on a limb

Earlier in the month, Prof. Gelman linked to Brandon's fascinating analysis of on-line weather forecasting accuracy.  I have done some additional analysis of the data and the result can be visualized as follows.


I'll concentrate my comments on three observations:

  • CNN was the clear winner in forecasting accuracy during this period based on two criteria: its median error in forecasting daily lows, and its median error in forecasting daily highs.  Moreover, both the median errors were zero, which gives us confidence about its accuracy.  The Weather Channel (TWC) and Intellicast (INT) were not far behind.
  • The ability to forecast highs was better across the board than that of forecasting lows (except BBC).  I am not sure why this should be so.
  • Overall, our weather forecasters were much too risk-averse.  Notice that the errors were heavily biased in the lower left quadrant.  A negative error on low temperatures means predicted low is higher than actual low; a negative error on high temperatures means predicted high is lower than actual high.  Taking these together, we observe that the range of actual temperatures have generally been larger than the range of predicted temperatures!  No one was willing to go out on a limb, so to speak, to forecast extremes.

Actually, I believe this inability or unwillingness to forecast extreme values is endemic to all forecasting methodologies.

Before closing, I mention that the graph was based on a subset of Brandon's data.  I only considered same-day forecasts, did not consider Unisys (because they didn't provide forecasts for lows), and also noted that there might be bias since there were breaks in the time series.  Also, I retained the sign information and didn't take absolute values as Brandon did.

Bubbles of death 2

Here is an alternative way to present the death risk data.  It's a variation of Tukey's stem-and-leaf plot.  Instead of presenting the exact odds, I believe it is sufficient to generalize the data by grouping them into categories.  Not much is to be gained by knowing that the odds of dying from fire and smoke is 1 in 1113 as opposed to the odds being in the range 1 in 1000 to 1 in 10,000 and comparable to that of drowning, motorcycle accident, etc.


PS. Be sure to look at Derek's chart in the comments.

Bubbles of death

Thanks to Dustin J for bringing this stupendous chart to our attention.  I have to admit I have trouble understanding it.  The red curve appears to be part of a gigantic circle confirming that all life do end on this earth.  How it is connected to the rest of the chart I am unable to discern.  In addition, the trajectory of the bubbles, the overlaps between bubbles, the separation between bubbles all may or may not carry meaning.


Reference: "What are the odds of dying?", National Safety Council.

Mirror, mirror

Ec_sarko Mirror, mirror on the wall...

I don't see what the second line adds to this plot, given there were only two candidates in this election. 

Political graphs do not get much better than those at the Political Arithmetik blog.

For instance, in the chart below, he wisely chose to draw trend-lines rather than connecting the individual dots.  TopdemsAlso, typically, he plots dots for all the different polls, which allows us to assess the variability (reliability) of the observed trend.


Reference: "Sarko embraces the Anglo-Saxons", Economist, Feb 3 2007.

The sum and the parts

Over the last few years, Intrade — with headquarters in Dublin, where the gambling laws are loose — has become the biggest success story among a new crop of prediction markets. The world’s largest steel maker, Arcelor Mittal, now runs an internal market allowing its executives to predict the price of steel. Best Buy has started a market for employees to guess which DVDs and video game consoles, among other products, will be popular. Google and Eli Lilly have similar markets. The idea is to let a company’s decision-makers benefit from the collective, if often hidden, knowledge of their employees.

I haven't participated in any "prediction market" but past statistical work tells me that within each such market, you'll find say half the participants whose individual track records will be higher than the average.  Thus, you can do better than the market average if you can predict the predictors: figure out which ones would drag down your average.

In other words, averaging opinions is a double-edged sword.  While some will provide "hidden" knowledge, others may provide "bad" information, which gets averaged too.

In substance, prediction markets are no different from so-called ensemble predictors which have been studied extensively in the statistical data mining area in recent years.  I am of the opinion that such things have proven more useful in increasing the stability of error rates than in improving the average error rates themselves.

Phil's take can be read here.

Reference: "Odds Are, They'll Know '08 Winner", New York Times, Feb 13 2007.

Horrid stuff 2

Jp_horridstuff Jon P took my comment on negative correlation and explored it furtherGiven the large ranges of values cited in the original Economist chart, Jon concluded that there wasn't enough evidence to make a judgement.

I agree to a large extent.  Apart from the high variability of individual measurements, we also face the tiny sample of 5 cities. 
In his chart, he made an implicit assumption that the correlation of two factors is related to the product of the ranges (variability) of each factor by plotting the rectangles.

A different way of looking at it is to plot only the mid-range values (i.e. ignoring the within-city variability).  The graph on the left hand side shows very little pattern.

Resorting to the formula, I found that the correlation = -0.03.  So barely detectable negative correlation.  Lets visualize this. 

Redo_pollutant2 On the right graph, I added the mean lines for both variables.  This divides the graph into four quadrants; dots that fall into the lower right and upper left quadrants make the correlation value negative.  There were three of those versus two in the positive quadrants; hence, the tiny negative correlation. 

Horrid stuff

Ec_smoke Small multiples can work wonders when data are replicated, as in this case.  The chart accompanied an Economist article on pollution levels in several European cities, as indicated by the concentration of nitrogen dioxide and particulates.

In the junkart version, I plotted the data series side by side, rather than one over the other.  Further, the order of cities was according to decreasing levels of NO2, which seemed to be the worse pollutant.  All gridlines are removed except the 30 line which worked pretty well to separate out the highly polluted cities.

Redopollutant An odd pattern has now surfaced.  Namely, there is some degree of negative correlation between the concentration of the two pollutants.  Environmental scientists may be able to tell us why.

Reference: "The Big Smoke", Economist, Feb 3 2007.

Digging it out

Tr_diggbgAnother sunset photo compilation?  Not quite.

This chart acts and smells like the sunset chart, being generated by many unknowing collaborators, this time, visitors to the content aggregation site, Digg.  For those unfamiliar, web browsers can "digg" any web page they find interesting (by clicking on an image), which causes a link to be generated at Digg's web-site.  We can use the number of Diggs to judge the value or popularity of a web page.

In effect, Digg is a gigantic save folder for the masses.  What happens when we have huge amounts of data?  We have to work really hard to dig out the useful information.  This chart goes quite a long way to answer one specific question.

Digg users are plotted horizontally and the stories they Digged are plotted vertically.  The bright white vertical strip represents suspicious activity; some user digged a large number of stories within the time window of the chart, most likely a bot trying to usurp the mass rating system.

Flickr and Digg are two of the more prominent stories of the so-called "Web 2.0", or mass collaboration on the Web.    Between my last post and this post, I have kind of lost enthusiasm for this type of charts, at least from a statistical perspective.  There is no real collaboration: the photographer who contributed sunset No. 103 does not know the one who uploaded No. 31, for example.  Using this logic, every survey or census ever conducted qualifies as mass collaboration, just because there are many participants providing data. 

What's worse, a typical survey brings together results from a random sample.  These charts all have highly biased samples, and I haven't seen any discussion yet of this issue.  They cannot be interpreted without understanding who participated.

Reference: "How Digg Combats Cheater", Technology Review, Jan 24, 2007.

Error spotting

My friend Augustine pointed me to this interesting graph showing the time of sunset over the course of a year.  (The original author's write-up is here.)


Of course, one can produce a perfect chart by looking up meterological records.  The main interest in this graph is how it was constructed.  Each cell in the graph represents an hour of a day, with days running across and time running down. The cells that are not dark each contain a photograph of the sunset contributed to Flickr, the photo-sharing site.  So this is in effect a graph created through mass collaboration (about 35,000 photos).

The "white" band roughly indicates the sunset.  What intrigues me is the variability... what are the reasons for lighted cells appearing all over the graph?

Some ideas include:

  • Different time zones
  • Incorrect time setting by some photographers
  • Erroneous tagging of photos as "sunset"