My article on whether we can trust airfare prediction models is published today at FiveThirtyEight, the new data journalism venture launched by Nate Silver after he moved to ESPN.
This topic was originally conceived as a chapter of Numbersense (link) but I dropped it. As I have noted in my review of Nate Silver's book, he has a keen interest in evaluating predictions, and not surprisingly, he encouraged me to get this piece done.
Putting Big Data to the Test
Just like Google Flu Trends (link), Oren Etzioni's Farecast has been held up as a Big Data success story. I have been a Farecast user for years, and though I use the tool, I've always wondered how accurate are those predictions. If you're a user, you've probably wondered as well. I have also complained that Big Data practitioners are too lax in offering quantitative evidence for their Big Data projects--it's a bit ironic when we tell others to use data and throw away their gut feelings.
One of the reasons for this oversight is that it is hard work to evaluate predictions properly. In this post, I will cover how I designed the evaluation strategy.
The first rule of evaluation is to check your ego at the door. The goal of evaluating a predictive model is to measure how well it performs. It is tempting for the evaluator to reinvent the wheel, devise a new way of predicting, and prove its superiority--but that is not evaluation. The evaluator is like a quality-control analyst, or a code reviewer.
Assumptions, Assumptions, Assumptions
One of the core messages of Numbersense (link) is that every analysis has assumptions, often called "theory". People who think their analyses contain no assumptions are usually ones who haven't thought carefully about their models. Making "no assumptions" is itself an assumption. In the same way, evaluating models require assumptions, and lots of them! Bear this in mind as you keep reading.
What to Compare to
In my article, I explained why the right comparable is the most realistic alternative strategy for purchasing air tickets if one were to not use Kayak/Farecast. This is one of my most important assumptions and it took me a while to figure this out.
At first glance, you might think a "natural" comparable would be the actual price trajectory for a given route for a given travel period. In other words, consider when the algorithm recommended buying and when it recommended waiting and judge it based on whether the algorithm led you to the lowest price during your search.
Using such a metric is to commit a form of hindsight bias. You have to remember that any algorithm (or human) must make the decision to buy or wait when the future prices are not known yet. In addition, we expect that there will be substantial price volatility in the future. If we were able to re-run history many times, the price paths would be different and the algorithm's performance would also vary. When we are staring at the realized price path (ex-post), it is easy to forget about the underlying volatility.
A worse problem with judging the algorithm against the lowest possible prices is that there may be no way to get to that lowest price! Remember to check your ego, you are the evaluator, not the modeler. The question is are you able to find an existing alternative solution that would lead you to those lowest prices without cheating and using future price data?
Don't Compare to Imaginary Toys
Instead of the theoretical maximum (i.e. the model that always finds the best possible price), let's consider the existence of a Best Realizable Model (BRM). Let's suppose its performance will be 70% of the theoretical maximum. Then, in theory, we can compare Kayak/Farecast to this BRM.
The catch is we don't know anything about BRM. In particular, we don't know if it performs at 70% or 30% of the theoretical maximum. If Kayak/Farecast gets to 25% of ideal and BRM 70%, then Kayak/Farecast is pretty poor. But at the same 25%, if BRM performs at 30%, then Kayak/Farecast is impressive.
Neither the theoretical maximum nor the Best Realizable Model should be used in this evaluation, simply because they are not real strategies, and just imaginary toys.
Don't Bow to Random
At the other extreme, modelers like to compare their algorithms to the "random" strategy. In the case of airfare prediction, one such strategy might be picking a random number of days before departure and taking the lowest fare on that day. The random strategy is throwing a die, and using no skill at all. This is unsatisfactory because it creates too low a bar.
Compare to Next Best Alternative
In my view, a far better approach is to figure out what you'd have done in the absence of the predictive model. In my case, I typically wait till two weeks before departure, and so that is the comparable.
Some readers have commented that they tend to buy three or four weeks before departure, and one pointed to an analysis claiming that 54 days is exactly the right moment to get the cheapest fare. This takes us back to the point raised earlier, that every evaluation strategy makes assumptions. If we do an analysis starting 30 days out, a different reader will object, saying he or she typically purchases 21 days out. (Remember, though, the earlier you start this exercise, the longer you have to track day-by-day wait or buy recommendations.)
Another "obvious" (until you think about it) evaluation criterion is to focus on the probability estimates themselves, rather than the outcomes they produce. The probability estimates are given in the form "there is 79% chance that the fare would go up by $20 or more in the next seven days." These forecasts are, of course, given for every route, for every departure date, and for every date of search.
It may take more than a moment's thought but such probability estimates are essentially impossible to verify. The forecast statement basically asserts that if the same forecasting situation arises many many times, they would be right about 79% of the time. But we can't in real life replicate the same forecasting situation many many times.
One way around this is to forget about real life, and check the probability estimates by simulating many future worlds. A major problem of this approach is that even if you can show that the probability estimates are good relative to those simulations, the travelers who use these forecasts still may not save money. Again, it lacks the "what would you otherwise have done" dimension.
As you dig deeper, you'll find more tricky issues. One is the continuous time (24/7) nature of online travel search. If you need to measure whether "the fare went up $20 or more in the next seven days," you'd have to be monitoring fares continuously over those seven days. To make things even more complicated, in each of those seven days, for the same itinerary, Kayak/Farecast is updating its prediction in a rolling 7-day window.
One other consideration I'd like to cover. Nate Silver is famous for predicting all 50 states correctly. Remember though that anyone who has the slightest knowledge of US politics can predict 40 out of 50 states--where he demonstrated his skill was in those swing states.
Now in the context of airfare prediction, if it were true that prices are much cheaper two or three months prior to departure, then people who would be purchasing in those time frames do not really need an algorithm to help them. It is important to test predictive models under situations in which they are most likely to demonstrate their skills. I would therefore recommend that you evaluate airfare predictions closer to the departure date when you'd need it most.
As you can see, evaluating predictive analytics is filled with challenges. But as I demonstrate here, it can be done. It should be done.