[I know, some of you are hungry for coronavirus content. Don't worry, keep reading. I'll get there in this post :) ]
As the U.S. election workers continue to count votes, the scenario predicted by election watchers has emerged: Election Day votes favor the President while mail-in ballots lean Biden. Mail-in votes arrive mostly before Election Day but the Election Day votes were counted first.
Despite predicting this scenario, most pundits have not prepared the dataviz that is necessary to follow the situation. We need the following line chart, which I found at New York Times. (Maps do not have a time dimension!)
(P.S. It's better to cut out the left side so we can see the right side more clearly.)
The order of vote counting should not matter in the final result. The winner of an election is determined by the total vote shares, regardless of the mode of voting or its timing. So once we reach the 100% point in the above chart, the path getting there has no relevance.
However, the order of vote counting matters a great deal in partial results. If mail-in votes lean Democrats while in-person votes favor Republicans, and if in-person votes are counted first (such as in Michigan, shown above), then partial results will show a big lead followed by a tightening. Depending on who the ultimate winner is, there may be a flip to the other party.
Data scientists in particular should know this and never draw conclusions based on partial results.
Nevertheless, in real life, there is always pressure to do a sneak peek. People want us to do interim analyses, to see which way the wind is blowing. They assure us they will remain patient. When we deliver the interim analyses, people want us to project the full result. Such projection is dangerous because it usually requires an assumption of serial independence.
We are implicitly assuming that the first m units tabulated are a random subsample of the entire sample of size N. Applied to vote counting, we are assuming (wrongly as the media keep reminding us) that the sequence of votes is random.
This election has just proved why we should not assume random sequencing within a sample. This is not how it works in real life when a sample accumulates over time.
***
Anyone running online AB tests has encountered the problem of serial dependence. The sample is assembled over time as visitors show up at the website or the app. If, during part of the test, the company drives traffic through advertising, then in that window of time, there is an influx of visitors, and if the advertising campaign is targeted (as is the norm), the incremental visitors are not like the average visitor. Any interim results for that period can be generalized to the types of visitors in the subsample but not to the entire universe of visitors.
The same problem is plaguing the coronavirus vaccine trials. Recall that the vaccine developers announced midway through the trials that they stepped up recruitment of minorities and high-risk people. So the interim analyses will be based on subsamples that skew whites and lower-risk people. This is potentially dangerous because once the first vaccine is approved, there will be pressure to convert placebo to treatment. We might never get a read on those minorities and high-risk people! The partial result is not useless - it just does not generalize to the entire population.
[I asked about this at a recent webinar by an NIH statistician working on these trials, and at first, he said the non-random sequence of enrollment should not matter but then he realized it does matter in any interim analysis.]
***
Thankfully, serial dependence is easy to diagnose once the full sample is available. We compare the characteristics of the subsample with the full sample. In the vote count scenario, we will find that the proportion of mail-in votes shifts over time, which also means there is a difference in party affiliation and various demographics.
What's the lesson for data scientists? Always look out for implicit assumptions in our analysis methods. In thinking about analysis of partial samples, we tend to focus energy on how statistical significane is affected by the sample size while assuming implicitly that the subsample is representative of the full sample in its contents. That is frequently correct in textbook examples but usually wrong in real-life situations.
***
One last thing: dependence comes in many stripes. This post addresses serial dependence. The post from last week concerns dependence between variables.
Comments
You can follow this conversation by subscribing to the comment feed for this post.