There is a healthy debate going on about last night's Michigan primary on the Democratic side, in which Bernie Sanders pulled off a major upset. Polls leading up to the primary put Hillary Clinton ahead by about 20 percent points. That would be a huge margin. Fivethirtyeight, which has been doing a stellar job covering the elections, uses polls heavily in generating forecasts, and they had predicted that Clinton would win with > 99% probability.

The trendy thing to do these days is to issue probability forecasts. I have found it frustrating to be on the receiving end of such forecasts.

Some 538 readers took this opportunity to knock Nate Silver (plus almost every other forecaster) for getting Michigan completely wrong. But Nate has his supporters too. Here is a typical reaction:

From a purely technical perspective, this statement is correct. But the same statement would have been correct, had the forecast been 98%, 97%, ... , 1% !

In theory, on sufficient repeated observations, one can measure the accuracy of such forecasts. But presidential elections only occur once every four years. So there isn't enough data to do this.

I am not piling on 538. I do think they are the best in the business and I respect Nate's ability. But there is the issue of evaluating accuracy when it comes to probability forecasts. If someone says there is 60% chance something will happen, and it happens, is that accurate? if it doesn't happen, is that accurate?

The other reason given above for the miss is that one can't blame the model if the input data, in this case, the polling data, were bad. This again is technically correct. It is a version of garbage in, garbage out. But as someone who builds models, I must admit using this reason is a cop-out. (Just make a model that clings to the polling data.)

There are multiple primaries though, so one can check calibration on those. Or multiple congressional races. (Of course, some of these could be dependent.)

I'm sure you read http://andrewgelman.com/2012/12/06/yes-checking-calibration-of-probability-forecasts-is-part-of-bayesian-statistics/ at some point.

Posted by: Dean Eckles | 03/09/2016 at 11:58 AM

Dean: Thanks for the link to Andrew. He and I are on the same page about the need to measure model goodness even in the Bayesian world. It's interesting that Nate Silver's book was the starting point of that discussion because I remarked positively about Nate's interest in measuring accuracy in my review of his book. That is one of the reasons why I said I trust Nate.

Checking calibration is what I had in mind. It's more complicated than it seems (or maybe I am missing something here). If we group predictions across districts and races, as suggested, we would have to group predictions by the posterior probabilities (>99%, 95-99%, etc.). Say we want to evaluate all 60%-70% predictions. For each 60% prediction, we have to score it a yes or no based on the actual binary outcome, which leads to my question above.

Another complication is the dynamic updating of the posterior probabilities. Any given probability forecast is only valid for a given time window.

And then there is a non-technical issue: unless the researcher computes and discloses such calculations, it is hard (impossible) for users to judge the accuracy based on revealed outcomes. Of course, people are doing this all over the place, often based on samples of one, which is what prompted that post.

Posted by: Kaiser | 03/10/2016 at 09:52 AM

Nate and others got Michigan wrong. The British election results in 2015 also defied the polls -- a much bigger problem.

We all know that the high level of nonresponse, and the difficulty of determining who's actually going to vote, create a lot of difficulty in prediction. We're actually dealing with responses that reflect a lot of bias, and it's likely we haven't quite figured out the conditions under which the bias is going to be more or less severe.

Posted by: zbicyclist | 03/10/2016 at 11:36 AM

I think that this probability forecasts is just like the NBA championship probability forecasts. At the beginning of the play-off games, many experts predict that there is 94% probability for the Golden State Warriors to win the championship. But now, they are down 1–3 against the Oklahoma City Thunder in the Western Conference Finals. The experts predict that the winning for championship of the warrior is only 4%.

This kind of prediction is like predicting yesterday's weather in today' weather broadcast. It seems to make no sense because the probability always changes with the things happening. However, I think maybe there are some more profound influence. This probability may influence the gamble or some other decisions models. All the people know the Golden State Warriors will mostly likely to be the championship(They are so good in the regular season, setting the best ever season record in NBA of 73–9). However, they do not know exactly how much percent of probability. If the probability is 70%, they may invest 70% money to Golden State Warriors. And if the probability is 94 %, they will naturally invest more.

So, I think maybe this is the same as the presidential elections forecasts. Many companies' market strategy and a lot of business models should be changed with the quantitative probability forecasts. This is maybe the potential influence of the quantitative probability of the prediction.:)

Some naive ideas.:)

Xiuyang

Posted by: Xiuyang | 05/26/2016 at 06:24 AM

Would you have any problem with the following procedure to test for calibration of probabilistic forecasts:

- Get the vector of predicted probabilities X and vector of outcomes Y (Y_i is 0 or 1)

- Compute a smoothing-spline GAM model

P(Y_i = 1) = logit^-1 (f(logit(X_i)))

- Look at the graph of the function y = f(x) that got estimated. If it is close to y = x then predictions are calibrated.

?

Of course it's a graphical type check so won't give a sharp yes/no distinction between calibrated and uncalibrated, but otherwise I think it's ok. I'd be very interested to know if there's some point I'm missing that makes it a bad procedure.

Posted by: Kit | 05/27/2016 at 12:44 AM

Ok, so a little investigation reveals this is similar to what Harrell recommends except that importantly he uses resampling to correct for bias. He also uses LOESS instead of smoothing splines. I can't think of any reason to prefer one or the other. See his rms package https://cran.r-project.org/web/packages/rms/index.html

Posted by: Kit | 05/29/2016 at 07:53 PM