One of the enduring themes of the last three Democratic primaries is the accusation that the Twitter supporters of Bernie Sanders are particularly mean. As a data scientist, I like to see the evidence.
***
What data would convince a data scientist that the accusation is real?
First, we need a definition of "mean". (This is tough.)
[Is that a mean tweet? I am ignoring the elephant in the room. Our President is sending mean tweets complete with demeaning nicknames almost every day.]
Second, we need a statistical analysis - not some screenshots. Anecdotes is the antithesis of statistics. We need to count tweets. We can use a statistical sampling approach and pull down random tweets, or in this age of Big Data, some people prefer to pull down every tweet.
Third, we need to define "disproportionate". This part is easy to compute but the mainstream media keeps failing here. It's not the number of mean tweets but it is the proportion of mean tweets. Sanders's supporters will definitely have the highest number of mean tweets - beceause they also send the most tweets.
Sanders's fans send the most tweets because (a) Sanders is the front-runner with the largest base of supporters; (b) Sanders's base is younger and younger people are more likely to be on social media (e.g. compared to 60+ year olds who are least likely to support Sanders); and (c) Sanders's supporters are more likely to believe that mainstream media ignore them, and thus seek alternative platforms to express themselves.
In addition, social-media platforms deliberately exploit the network effect, so that more begets more. Popular tweets are surfaced to the top where they get more attention; more attention leads to more likes and retweets, which is a positive reinforcement loop. [Mainstream media does this too. A single mean tweet by the President is repeated multiple times by each journalist who writes an article about it.]
As the late and great Hans Rosling once said, Big Data is a bunch of numerators without denominators. What we are missing here are the total number of tweets that are sent by each candidate's supporters. Then, you need the number of "mean" tweets by each candidate's supporters. Finally, you can compute the rate (proportion) of tweets that are "mean".
***
Fourth, the computation so far assumes every tweet has equal value. Should our analysis account for the popularity of tweets? When the media cites a mean tweet, is this coming from a pseudonymous account with few followers or from a celebrity with millions? Should tweets with likes count more? Should retweets count more or less?
Lastly, a really good analysis has to deal with the fake data menace. It's hard to ascertain which of the Twitter accounts are humans, which are bots, and which are humans paid by campaigns to tweet.
***
The biases mentioned above cannot be corrected unless the analyst goes beyond social media sources. To really answer the question of which campaign has the meanest supporters, you'd also have to assess the mainstream media, like CNN, MSNBC, and Fox News. The talking heads also say mean things. While the voices of Sanders's supporters are disproportionately heard online, the voices of fans of other candidates dominate traditional media.
A great exercise for a budding data scientist is to come up with a principled way to answer the question of whether supporters of some candidates are "meaner" than others. You'd have to use a range of skills from defining a metric to scraping the data, from labeling to cleaning the data, from correcting bias to weighting the samples.
P.S. [3/5/2020] Just saw this last night, Elizabeth Warren being interviewed by Rachel Maddow. She was complaining about mean tweets again. I respect her usually but her statement that "it's simply a fact" when Maddow prompted her saying mean tweets are particularly bad coming from Bernie Sanders's supporters is very disappointing. What makes a fact a fact? Where is her evidence? No one as far as I know have done a proper statistical analysis as outlined here.
I wanted you to answer the question!!!
Posted by: Fernando | 03/03/2020 at 06:33 PM