The Democratic primaries are generating - unexpectedly - stories with good data angles.
First, there was the "cancelled" Selzer poll leading up to Monday's contest (link). While we still don't have a proper explanation, what was reported in the press is that there was a single instance in which Pete Buttigieg's name might have not been read, possibly due to the long list of candidates, some of whom will be "off screen". In this case, the alleged instance is due to someone increasing the font size on the screen, thus moving names off the viewable area.
I find this explanation inadequate. In any survey involving thousands, there will be a few issues - kind of like dropping bad ballots. What if, for example, a phone connection got momentarily mangled just as Buttigieg's name was read? If the Buttigieg's camp found one bad case, they surely went around looking for more examples - and as far as we're told, they didn't find any more.
Further, any competent survey designer - and Selzer is a good one, according to Nate Silver - knows that you always randomize a list of responses. In other words, the order in which the candidates are listed for the same question will be randomly assigned for every phone call. So if Buttigieg's name got missed because it's at the bottom of the list for some calls, his name would be at the top of the list for other calls. If x percent of the calls are flawed because of omitting names, each candidate will be affected by the same amount. The randomization equalizes the impact of such unpredictable errors.
Cancelling such an important poll is a decision that changes the dynamics of the race, and a much more thorough accounting is needed.
Selzer's explanation is an "abundance of caution". If true, this instance opens up a flood gate for delegitimizing polls. If a complaint based on a single anomaly is allowed to topple a poll, then any candidate can conspire to preempt poll results that s/he doesn't expect to like.
The decision to cancel the poll is not fair to all parties. Turns out someone leaked partial results of that poll on the day of the Iowa primary (I saw this on FiveThirtyEight, which was not the source of the leak). It clearly shows that some candidates would be happier with the poll result than other candidates.
***
In the turmoil around the Iowa caususes, we must differentiate between what the data show and what might explain the data. There is no doubt that this screwup hurts the winner, and any other candidate that did unexpectedly better while it helps any candidate who did unexpectedly worse. This outcome is also not fair to all parties.
The problem is when we impute a cause, or motivation to explain the effect. We don't know if someone conspired to help or hurt any particular candidate - unless specific corroborating evidence is subsequently found.
Yet, we don't need a motivation to agree on the positive/negative effects on specific candidates.
***
Further, people are underestimating the cost of complexity. Having two candidates in 2016 and having 10 candidates now makes things much more complicated. (I'm using two and ten for illustration. The list is longer if we include every fringe candidate.)
This increase in candidates is magnified by having multiple rounds of voting, and realignments. With two candidates, the realignment decision is either to join the other side or go home. With 10 candidates, there are so many possibilities!
Then, apparently, they also promised to provide vast amounts of data. This policy is itself commendable. With two candidates, validating the numbers is easy. There is really only one number per precinct to check. With ten candidates, there are nine numbers per precinct to check. There are over 1,000 precincts. The resulting tabulation has lots of missing values (holes) because by definition, most candidates become non-viable by the end of the night for each precinct but the set of candidates that remain viable vary by precinct.
A simple way to think about the cost of complexity is this: each number has some unknown error rate, say 1 percent. If we need an accurate result, we need the probability of no errors. That is, 1 minus the probability of one error minus the probability of two errors ... minus the probability of all errors. The 1-percent error gets magnified very fast as the number of entities grow!
Comments