It's seriously dangerous to send a data scientist the data. (Andrew Gelman has wondered out loud often about why scientific researchers are so reluctant to release their datasets. Most recently, see this incident.)
Now that I got a hold of the Super Tuesday data, there is no end in sight of testing all the talking points that the mainstream media has been pushing all day, all night.
We were also supposed to believe that the Democrats have achieved a huge surge in turnout compared to 2016, and in particular, such turnout is attributed to Biden, and specifically denied to Sanders's supporters. This conclusion doesn't pass the smell test, because the same pundits told us that Biden achieved his victories on Super Tuesday without campaigning in many of those states (!), plus there is plenty of video evidence of huge crowds at rallies for Sanders.
So, I pulled out the data.
***
tldr;
Exit polls prove Sanders was successfully at turning out first-time voters. The pundits just ignored the most obvious data.
The right way to use the exit polls is to see that first-time voters are 10% to 125% more likely than repeat voters to support Sanders. First-time voters are 17% to 54% less likely than repeat voters to support Biden. Other candidates also attracted first-time voters to a lesser degree.
The wrong way to use the exit polls is to look at the share of votes of subgroups of voters like youth, Latino, etc. Exit polls have biases along all dimensions like age, gender, race, political leaning, etc. that are adjusted to match historical norms and prior polls. Therefore, exit polls do not provide any information at all about shifts in any of those dimensions.
***
In terms of state-level turnout, there was no surge in voting in Oklahoma and Arkansas, both states won by Biden. The other states fell into two groups: moderate increase of about 15% (North Carolina, Massachusetts, Alabama and Vermont) and large increase of 40-70% (California, Virginia, Texas, Tennessee). As mentioned before, the four states moving from caucuses to primaries definitely experienced a large jump in participation, as expected.
There is no clear pattern linking the winner to turnout changes at the state level.
***
Now the problem comes when the pundits start breaking down turnout by demographics or political leaning. Where does this data come from? Given that who we vote for is confidential, the official bean-counters are not the source of such data.
It turns out all such breakdowns of turnout are based on exit polls. Exit polls involve pollsters stopping voters as they exit the polling stations and asking them to fill out questionnaires. Exit polls have many statistical issues that are hard to resolve; I will come back to these later in the post. For now, I take the exit poll results at face value.
***
Let's consider how mass media pundits discuss turnout rate of subgroups of voters. Take The Hill's breathless account of the "surge" in turnout (link) as an example.
those young voters did not turn out on Tuesday at the rate that Sanders had hoped. Exit polls show that the Vermont senator won voters between the ages of 18 and 29 by wide margins across the 14 Super Tuesday states. But no state saw an increase in those voters’ share of the electorate.
Youth turnout rate is here measured by the proportion of voters between ages 18 and 29. A proportion is not the same as a rate. A proper definition of youth turnout rate is the proportion of registered voters aged 18 and 29 who actually voted. This is not the same thing as the proportion of actual voters who are aged 18 and 29. It is possible for youth turnout rate to have increased while still accounting for the same share of votes.
As with the electability issue that I discussed here, these pundits are ignoring the direct data coming from the exit polls about new voter turnout - and instead relying on flawed logic based on indirect data.
Looking through the exit poll data from CNN, I found the most direct piece of evidence for turning out new voters. This question:
The above example comes from North Carolina where Biden won comfortably. The second column are the 17 percent who said they were "first-time voters".
This is very important because every first-time voter is a turnout success story. This is directly useful data. (By coincidence, the official tally said that North Carolina's turnout went up by 16 percent so first-time voters can explain all of the increase in turnout.)
Here is something that every pundit in mainstream media missed: Bernie Sanders was extremely successfully in turning out first-time voters in North Carolina.
This is how you read the above table. You take the ratio of 43% to 19% to yield 2.3. First-time voters were 2.3 times more likely to vote for Sanders compared to repeat voters. We use repeat voters as the baseline, and we measure first-time voters against the baseline.
The corresponding number for Biden was dismal. First-time voters in NC were 30% less likely to vote for Biden than repeat voters. Finally, I combined all the other candidates into one... first-time voters in NC were also 30% less likely to vote for any of the others than repeat voters.
So, if you were a first-time voter in North Carolina who voted for Bernie Sanders, I see you.
(Image of Sanders's rally in Raleigh, NC: @elemeno on Twitter)
***
The following table shows the calculations for seven Super Tuesday states for which CNN has exit poll data for this first-voter question.
Conclusion: In every state, the first-time voters favored Sanders. In every state, Biden did comparatively worse among first-time voters than repeat voters. The evidence is clear: Sanders brought new voters, Biden relied on the same old.
Note: The first-time voter question is not asked in any of the four caucuses-to-primary states. The sample sizes in Oklahoma and Vermont were too small and the results were suppressed by the pollster. There does not appear to be an exit poll in Arkansas.
***
As with all polls, exit polls have multiple sources of error. One is non-response: the pollster (Edison) claims that about 40 to 50 percent of people they approach fill out the questionnaires. As for the refusals, the pollster claims that its workers will "guess" the age, gender, race, etc. and this data are used to adjust the data to account for non-response.
A second source of error is sampling bias. The pollsters are only at a subset of polling stations - how many and how they are selected are not disclosed. It is not clear that the respondents to exit polls are representative of the universe of voters. To tackle this bias, the pollster makes statistical adjustments by assigning different weights to different people. For example, if they think women are underrepresented in the exit poll sample relative to the voter population, they would assign a higher weight to each woman in the exit poll. They do this for age, gender, and racial groups. I'm not sure whether they adjust for political leaning or other factors. Again, the methodology is safely guarded. The pollster does not even have a page titled methodology that is publicly available.
There is also an adjustment for the overall vote shares to match the official results.
Exit poll data do not shed light on how specific subgroups of voters behaved. The data collection is biased, and the raw data are re-weighted to match some historical norm or prior polls. These adjustments are valid only if there is no major shift in subgroup characteristics. Therefore, one cannot observe subgroup trends in the adjusted data!
If you expected youth vote to be 18% and the raw data showed 15%, either you claim the raw data are unbiased, and conclude that the youth vote was 3 percent below expectation; or you accept that the collected data are biased, and adjust the numbers to match 18%, and then you have nothing to say about a trend in youth voting.
Since the pollster acknowledges the biases in data collection, and applies statistical adjustments, we know that the exit polls have nothing useful to say about subgroups of voters. In the meantime, the media has spread a thousand misinformed stories using this dataset.
Comments