I was recently reminded of how statistics settled a question of authorship of The Federalist Papers. These were 85 arguments published under a pseudonym in support of the U.S. Constitution. It is now known that they were written by Alexander Hamilton, James Madison and John Jay. For a long time, there was uncertainty about the authorship of twelve of these. Around 1960, two statisticians, Frederick Mosteller and David Wallace, published a celebrated paper that solved the riddle.
The key components of the solution include:
- Noticing that people’s writing styles differ in terms of word preference. Certain writers habitually use certain words more often than others.
- “Common words” like prepositions are better differentiators than less common words. For one thing, common words are common, and therefore we have more data to establish the base rate of authors. For example, Madison almost never wrote the word “upon” while Hamilton used the word quite often; thus, a document that contains no “upon”s is much more likely to have been written by Madison.
- Each differentiator word is a signal. A model combines multiple signals coming from multiple differentiator words. The combination is more than the sum of parts.
- The Bayesian model produces a probability that a specific author wrote a specific article.
The back story is also interesting:
- The success of 1960 was not assured. Mosteller and others tried other methods before, for example, examining average sentence lengths, and failed.
- Mosteller himself expressed the self-doubt characteristic of most competent statisticians: “the odds can never be greater than the odds against an outrageous event”. In other words, stranger things can happen.
- No matter how high the probability is, we still only have proven a correlation. For example, one respondent argued that Hamilton could have written the first draft, which Madison edited. The data could not exclude such an event. (Nevertheless, anyone forwarding this possibility ought to produce evidence to support it.)
Recent Comments