## Mean and median

##### Feb 27, 2007

In the comments of the last post on on-line weather forecasts, Hadley raised the evergreen statistical question of mean vs median.  In this context, median error is unaffected by particular days in which the forecaster makes extreme errors while mean error takes into account the magnitude of every forecasting error in the sample.

Which one to use depends on the situation.  Brandon, who did the original analysis, was motivated by planning a trip to a unfamiliar location.  In this case, he might be better served by lower mean error, which would imply few extremely bad forecasts.

On the other hand, if I am interested in my local weather, then I'd likely be less concerned about a few extremely bad forecasts, and more concerned that the forecast is on the money on most days.  Then perhaps the median error would come into play.

It turns out it doesn't much matter for our weather forecast data.  In this new chart, I superimposed the mean error data (in black).  The scatter of points was exactly as it was for median error (in red).  (MSN had a particularly bad forecast for a low temperature one day, which pulled its location to the left.)

This shows further that the difference between CNN, Intellicast and The Weather Channel is negligible.

I think you misinterpreted my remark - I'll try and explain myself better: One possible predictor for the temperature is simply to use the average temperature. This is obviously a pretty bad predictor, as it doesn't take into account any extra information we have about a particular day. However, the mean (or median error) of this predictor will be very close to 0 (exactly 0 if we know the true mean), meaning that it would appear to be a good predictor on this scale. Or have I missed something?

I did misinterpret your original comment. I'd add a nuance to your new comment... these forecasters are predicting the highs and lows for each day, which means they are predicting the extreme values of a distribution, rather than the central values.

It's an intriguing problem: the historical time series of high temperatures are all extreme values; what would be the best way to derive a prediction from such a series? I reckon the within variation is so high that the strategy of predicting the average fo the time series would not work that well.

There are other ways of changing the weight of extremes in the mean than the median ; such as geometric, harmonic or logarithmic means... What about them ?

The mean or median only determines the bias of the predictor, what is important is the variance of the error, as Hadley pointed out using the mean minimum as the predictor will have zero bias but is a really poor predictor. It is quite possible that a biased predictor has a lower error variance, but for a minimum this would be likely a positive bias. The reason for a bias toward low values probably is due to the greater consequences of unexpectedly low temperatures causing safety problems.

That we are estimating the extreme of a distributiion is irrelevant, the extreme has it's own distribution, after all it is just a series of numbers.

I would look at two things, as well as the bias, the standard deviation of the errors, as this also includes the bias, and the frequency of prediction different from the true value by some number of degrees.

The comments to this entry are closed.