Several friends have asked me whether statistics will be useful in proving either (a) fraud occurred or (b) fraud did not occur during the recently-concluded U.S. presidential election. So I address this issue today.
On p. 152 of Numbers Rule Your World (link), I inserted a seemingly innocent remark in parentheses. It read: "Unlike in Ontario, neither the Atlantic nor the Western Lottery Corporation has been able to catch any individual cheater".
This quiet comment rings very loudly in the aftermath of the U.S. election, as it shines a light on the outer boundary of statistical evidence.
The remark appeared within a case study about lottery fraud, which I used in the book to examine the nature of statistical evidence. In Ontario, Canada, lottery store owners were winning a disproportionate number of major prizes in the lotteries. A calculation showed the chance of Ontario store owners winning that much by luck alone was one in quindecillion (one followed by 48 zeroes!). A similar calculation for the Western province results in 1 in 23 million chance. Seeing those numbers, any competent statistician concludes that fraud is certain (with a vanishingly small margin of error).
***
Statistics is reasoning about a mass of data; it has little to say about the atoms. In fact, the commonly accepted assumption of "random" regards these atoms as interchangeable. In the Atlantic region between 2001 and 2006, about 1,300 major prizes were handed out. If store owners had equal chance of winning as non store owners, they should have won about four of those prizes. In reality, they won almost 10 times as many. This statistical analysis proves that some store owners cheated. That's as much as the statistical analysis can say.
If you ask the analysts, which owners cheated? They can't say. Name one cheater. They can't say either. Out of the 40 odd store-owner winners, exactly how many cheated? They can't say. Even though the expected number of wins by store owners is four, there is a chance that in any set of 1,300 major prizes, store owners won five or six or three or even zero.
In the case study, the Ontario investigators successfully traced down a victim, and then discovered the way the specific store owner cheated. The other provinces had no such luck. So they were left with the near certainty that some stores cheated but the improbability of catching an individual cheater.
In the ongoing electoral saga, Republicans face two challenges: they have yet to find statistical evidence of fraud, and if that surfaces, the data alone will not point to any individual malfeasance -- at best, the data allow statisticians to compute a probability of fraud.
When I taught a class using Numbers Rule Your World (link), I used to set a question about Chapter 5 that asks students to think about whether the probability calculations empower the authorities to arrest cheaters. I suppose if future students discover this blog post, they deserve to get points.
P.S. This is a good place to say something about the Bayesian school of statistics. In the above (classical) analysis, we assume no one was cheating, and ask what is the chance that store-owners win the lop-sided number of major prizes (the observed data). If that probability is low, we conclude that there is fraud. Bayesians turn this question around. Now, we ask directly what is the chance of store-owner fraud. An equivalent question is what is the share of major prizes store-owners should be winning. Bayesians think always in terms of probability. So, the answer comes in the form: there is X percent chance that the share of major prizes won by store-owners is in the range Y1 to Y2.
To compute a Bayesian probability, we need a "prior." This is like answering that question without the benefit of new data. Let's assume store-owners purchase 2 percent of lottery tickets in Ontario. One possible assumption of the prior is that the store-owner share of major prizes is between 0 and 10 percent, with equal probability (technically, this is a uniform distribution representing no specific knowledge of which number is more likely). Another possibility is a normal distribution (bell curve) centered at 2 percent, with standard deviation 0.6 percent. Now, we're claiming knowledge that numbers close to 2 percent (i.e. no fraud) are more likely than other numbers.
If you feel uneasy about the subjectivity of the previous step, you're not alone. That's a major point of contention between the classical and Bayesian viewpoints. Nevertheless, in many problems, the Bayesian solutions are not sensitive to the assumption of the prior, so we don't have to worry.
Now that we have specified the prior probabilities, we do a Bayesian update, which means modifying our prior beliefs using the observed new data. Think of this as a series of simulations. In a specific simulation, we draw a number from the assumed prior distribution; this might mean the store-owner share of major prizes is 5 percent, then with this prior probability, we give out 1,300 prizes, and count the number of prizes that go to store owners. In each simulation, a new prior probability is drawn, resulting in a different number of prizes going to store owners. Finally, we average all these simulations. (If this reminds you of election forecasting models, you're getting it. The models by the Economist and FiveThirtyEight are both Bayesian.) In short, we run a lot of simulations, each result is a split of major prizes between store owners and other players, and we weight these results using the prior probabilities.
I will have more to say about Bayesian models soon because the Pfizer vaccine trial will be analyzed using a Bayesian model.
One last thing: the Bayesian analysis does not resolve the key issue raised in this blog post. We can say there is X chance the the store-owners are winning Y percent of the major prizes. If there is not enough probability around the Y = 2 percent, we suspect fraud. But once again, we don't know how many of the 40 wins were fraudulent, and we don't know who cheated.
Click here to learn more about Numbers Rule Your World, and support my blog by getting a copy.
Comments
You can follow this conversation by subscribing to the comment feed for this post.