In Chapter 5 of Numbers Rule Your World (link), I told the story of how Professor Rosenthal used statistical evidence to uncover a cheating scandal in a Canadian lottery. Statistics can be very powerful in identifying suspicious data - in this case, the winning probability of store owners who sell lottery tickets was found to be way higher than normal.
In a textbook presentation, you can simply do the math and compute the relevant probabilities and invoke statistical "significance" arguments. Hey, we've found evidence of cheating. End of story.
In real life, that would just signal the beginning of a long story. The statistical conclusion is merely that "someone cheated" or at best, that "someone cheated almost surely". But you can't catch the cheater(s) based on that evidence! The question left unanswered is: who cheated?
In the example from the book, we know that as a group, store owners cheated. That doesn't mean every store owner cheated. Who cheated? For that, you need old-school shoe leather. You need to find the receipts.
***
I was inspired to write this post after a friend asked me whether the Astros were the only baseball team that stole signs. Here is a summary of the cheating scandal engulfing professional baseball right now.
The statistical argument goes like this: the kind of sign-stealing using video can only happen at home stadiums. Teams who cheated in this way would have a performance advantage in their home games compared to the away games. But of course, every team enjoys a home-field advantage so we're talking about an abnormally high home-field advantage.
Where statistics demonstrates its power is here. We have guidelines for what level of home-field advantage should be considered abnormally high.
The question my friend asked is: If there was a second team who also did remarkably better at home games than away games - similar to the Astros, would that be sufficient evidence to convict that team as well?
Sadly, the power of statistics is also limited. We think cheating might be behind the abnormally high home-field advantage - but we can't be sure! We are not unsure about the observed outcome; the uncertainty concerns the cause of such an outcome.
Similar to the lottery fraud scheme described in Numbers Rule Your World (link). We are very confident that too many winners were store owners but nothing in the data tells us that the cause is cheating! The case didn't get solved until specific store owners were caught. You need to find receipts.
Unless there is a confession by the cheaters, or a reporter or detective discovering incriminating evidence, we don't know what caused the anomaly in the data. Suspicion is not proof of wrongdoing.
In baseball's case, it was a former Astros pitcher who confessed to the scheme. Without receipts, the cheaters can always offer other explanations for the performance advantage. The most powerful alternative explanation is coincidence.
***
Here's an example of an alternative story. Someone could claim that their hitters started making "superman" poses before they get on the field, and that explained the advantage.
But spinning such stories is to run "story time". We have some data - the performance advantage - that show something. Then, we seamlessly move to a story that has nothing to do with the data just presented. Just because there is a presentation of data doesn't mean the data support the story!
The Astros had a better record on the road, winning 53 rather than 48 at home.
I guess that supports your point that other things than cheating could cause success.
Posted by: Jason May | 02/17/2020 at 02:06 PM
JM: It's complicated. Here you have someone confessing. But the cheating might not show up in the outcomes, just because it might not have helped. (When I wrote the chapter on steroids testing, I learned that some athletes claimed that steroids didn't help.) But cheating itself is against the rules. Another unknown is whether other teams also cheated, which can negate some of the expected benefits.
Posted by: Kaiser | 02/17/2020 at 04:39 PM