The Sunday Review section of New York Times on August 11 contains two pieces I'd like to discuss.
The first piece, by Ian Urbina, an investigative reporter, is called "I Flirt and Tweet. Follow Me at #Socialbot". He tells a fascinating story about how programmers create "bots" that impersonate people surfing the Web. For instance, researchers at a Brazilian university created a bot named "Carina Santos" who was rated by certain social media ranking companies as being more influential than Oprah Winfrey.
Urbina reports that:
Last year, the number of Twitter accounts topped 500 million. Some researchers estimate that only 35 percent of the average Twitter user's followers are real people. In fact, more than half of Internet traffic already comes from nonhuman sources like bots or other types of algorithms.
Somehow Urbina missed the implication of these bots. A huge part of Big Data is internet traffic, including tweets. Recall the Library of Congress made a big deal about saving every tweet ever produced (link). What does this say about the use of tweets to analyze "human" behavior and psychology? Would you ever believe a study that does not adjust for bot traffic?
Ironically, readers have the answer in the same section of the paper. There is another article, titled "Dr. Google Will See You Now", penned by Seth Stephens-Davidowitz, a recent Harvard PhD in economics. This article takes Google search data and draws conclusions about "depression" in America.
I could barely get past these words:
Not every health-related search using "depression" is a sign that someone is depressed, and not everyone who is depressed queries Google. But thanks to the incredibly large sample size, meaningful patterns emerge.
This has the same symptom as "causation creep," in which the researcher admits he/she doesn't have evidence of causation but immediately ignores that inconvenient fact, and acts as if he/she has proved causation. Here, Stephens-Davidowitz goes on and assumes that every health-related search using "depression" is a sign that someone is depressed, and that everyone who is depressed queries Google.
It's a sad state of academia if a PhD at Harvard thinks that an "incredibly large sample size" is a cure for badly conceived analysis.