Nate Silver, now writing his own substack, warned about reading too much into early voting tabulations (link; preview). He's right.
It is tempting to think that when you have millions of early votes cast, that's a large enough sample size that overcomes any problems. It's not.
In statistical sampling, there are two types of errors, bias and variance, or what could be called systematic vs random errors. Larger sample sizes will reduce variance or random errors but will not overcome bias which is a symptom of the un-representativeness of the observed sample.
When we compute optimal sample sizes to run experiments or A/B tests, the computation assumes zero bias, and the sample size only cures sampling variance due to random errors.
The key to removing bias is not sample size but getting a representative sample of the population of interest.
***
The biggest issue with early voting data is that citizens choose whether to vote early or not. And the decision to vote early is correlated with whom the person is voting for.
Remember that in certain states in the last election, Trump was ahead by a huge margin, after they counted the in-person votes, then as the precincts started counting the early votes, Biden came roaring back and ultimately won. That's because in those states, Democrats were much more likely to vote early than Republicans.
A less clear version of this plays itself out in states with big cities, even before the age of early voting. The counties with big cities will take more time to count all the votes while those rural counties in the same states will finish counting quickly; thus, early results during election day will be biased towards smaller counties. Rural counties tend to favor Republicans while big cities usualy vote Democrat.
Aside from timing effects, the biases in early voting are also driven by partisan or demographic or behavioral factors. For example, an interesting segment consists of those voters who would not have voted if early voting was unavailable.
Plus, those using early voting data to make predictions are making the assumption that voters vote according to party affiliation (since voting is confidential). In this election, that assumption is highly suspect. One of the major talking points of Democrats is that certain prominent Republicans are defecting to vote blue. According to reports (link, link), they have been running ads encouraging wives to lie to husbands, and husbands to lie to buddies, if they were going to go blue.
Besides, Republicans are also hoping to convince some Democrats to switch sides, e.g. minorities.
What Nate Silver says is that including early voting data (of which there aren't much) in his models doesn't help because by themselves, such data have low predictive value.
Comments
You can follow this conversation by subscribing to the comment feed for this post.