You can follow this conversation by subscribing to the comment feed for this post.

There are multiple primaries though, so one can check calibration on those. Or multiple congressional races. (Of course, some of these could be dependent.)

I'm sure you read http://andrewgelman.com/2012/12/06/yes-checking-calibration-of-probability-forecasts-is-part-of-bayesian-statistics/ at some point.

Dean: Thanks for the link to Andrew. He and I are on the same page about the need to measure model goodness even in the Bayesian world. It's interesting that Nate Silver's book was the starting point of that discussion because I remarked positively about Nate's interest in measuring accuracy in my review of his book. That is one of the reasons why I said I trust Nate.

Checking calibration is what I had in mind. It's more complicated than it seems (or maybe I am missing something here). If we group predictions across districts and races, as suggested, we would have to group predictions by the posterior probabilities (>99%, 95-99%, etc.). Say we want to evaluate all 60%-70% predictions. For each 60% prediction, we have to score it a yes or no based on the actual binary outcome, which leads to my question above.

Another complication is the dynamic updating of the posterior probabilities. Any given probability forecast is only valid for a given time window.

And then there is a non-technical issue: unless the researcher computes and discloses such calculations, it is hard (impossible) for users to judge the accuracy based on revealed outcomes. Of course, people are doing this all over the place, often based on samples of one, which is what prompted that post.

Nate and others got Michigan wrong. The British election results in 2015 also defied the polls -- a much bigger problem.

We all know that the high level of nonresponse, and the difficulty of determining who's actually going to vote, create a lot of difficulty in prediction. We're actually dealing with responses that reflect a lot of bias, and it's likely we haven't quite figured out the conditions under which the bias is going to be more or less severe.

I think that this probability forecasts is just like the NBA championship probability forecasts. At the beginning of the play-off games, many experts predict that there is 94% probability for the Golden State Warriors to win the championship. But now, they are down 1–3 against the Oklahoma City Thunder in the Western Conference Finals. The experts predict that the winning for championship of the warrior is only 4%.

This kind of prediction is like predicting yesterday's weather in today' weather broadcast. It seems to make no sense because the probability always changes with the things happening. However, I think maybe there are some more profound influence. This probability may influence the gamble or some other decisions models. All the people know the Golden State Warriors will mostly likely to be the championship(They are so good in the regular season, setting the best ever season record in NBA of 73–9). However, they do not know exactly how much percent of probability. If the probability is 70%, they may invest 70% money to Golden State Warriors. And if the probability is 94 %, they will naturally invest more.

So, I think maybe this is the same as the presidential elections forecasts. Many companies' market strategy and a lot of business models should be changed with the quantitative probability forecasts. This is maybe the potential influence of the quantitative probability of the prediction.:)

Some naive ideas.:)

Xiuyang

Would you have any problem with the following procedure to test for calibration of probabilistic forecasts:

- Get the vector of predicted probabilities X and vector of outcomes Y (Y_i is 0 or 1)

- Compute a smoothing-spline GAM model
P(Y_i = 1) = logit^-1 (f(logit(X_i)))
- Look at the graph of the function y = f(x) that got estimated. If it is close to y = x then predictions are calibrated.

?

Of course it's a graphical type check so won't give a sharp yes/no distinction between calibrated and uncalibrated, but otherwise I think it's ok. I'd be very interested to know if there's some point I'm missing that makes it a bad procedure.

Ok, so a little investigation reveals this is similar to what Harrell recommends except that importantly he uses resampling to correct for bias. He also uses LOESS instead of smoothing splines. I can't think of any reason to prefer one or the other. See his rms package https://cran.r-project.org/web/packages/rms/index.html

The comments to this entry are closed.

##### Get new posts by email:
Kaiser Fung. Business analytics and data visualization expert. Author and Speaker.
Visit my website. Follow my Twitter. See my articles at Daily Beast, 538, HBR, Wired.

## Search3

•  only in Big Data
Amazon - Barnes&Noble

Numbersense:
Amazon - Barnes&Noble

## Junk Charts Blog

Graphics design by Amanda Lee

## Next Events

Jan: 10 NYPL Data Science Careers Talk, New York, NY

## Past Events

Aug: 15 NYPL Analytics Resume Review Workshop, New York, NY

Apr: 2 Data Visualization Seminar, Pasadena, CA

Mar: 30 ASA DataFest, New York, NY

See more here