The Sunday Review section of New York Times on August 11 contains two pieces I'd like to discuss.
The first piece, by Ian Urbina, an investigative reporter, is called "I Flirt and Tweet. Follow Me at #Socialbot". He tells a fascinating story about how programmers create "bots" that impersonate people surfing the Web. For instance, researchers at a Brazilian university created a bot named "Carina Santos" who was rated by certain social media ranking companies as being more influential than Oprah Winfrey.
Urbina reports that:
Last year, the number of Twitter accounts topped 500 million. Some researchers estimate that only 35 percent of the average Twitter user's followers are real people. In fact, more than half of Internet traffic already comes from nonhuman sources like bots or other types of algorithms.
Somehow Urbina missed the implication of these bots. A huge part of Big Data is internet traffic, including tweets. Recall the Library of Congress made a big deal about saving every tweet ever produced (link). What does this say about the use of tweets to analyze "human" behavior and psychology? Would you ever believe a study that does not adjust for bot traffic?
***
Ironically, readers have the answer in the same section of the paper. There is another article, titled "Dr. Google Will See You Now", penned by Seth Stephens-Davidowitz, a recent Harvard PhD in economics. This article takes Google search data and draws conclusions about "depression" in America.
I could barely get past these words:
Not every health-related search using "depression" is a sign that someone is depressed, and not everyone who is depressed queries Google. But thanks to the incredibly large sample size, meaningful patterns emerge.
This has the same symptom as "causation creep," in which the researcher admits he/she doesn't have evidence of causation but immediately ignores that inconvenient fact, and acts as if he/she has proved causation. Here, Stephens-Davidowitz goes on and assumes that every health-related search using "depression" is a sign that someone is depressed, and that everyone who is depressed queries Google.
It's a sad state of academia if a PhD at Harvard thinks that an "incredibly large sample size" is a cure for badly conceived analysis.
I had a discussion with someone thinking of using "Dr. Google" type data to prove something. I had not thought of it as much as you but my immediate response was skepticism. To put those immediate objections in line with the depression article I would posit the following three cases. First, a depressed person without access to the interntet. Second, a depressed person that googles their depression but does not interact with people. Third, a depressed person that has caregivers, friends, family, and community that care about them and 50 people touched by that person google depression in order to find out if there is anything they can do to help. Dr. Google records the three cases a 0, 1, and 50 searches. It would be nearly impossible to aggregate that up to any sort of meaningful number. This is also just one dimension that would not add up and their lots more (familiarity with the disease would impact googling but not incidence, etc.).
I would argue that time for googling is a factor that Mr. Stephens=Davidowitz does not obviously control for (people may be busy vacationing, getting ready for new school year, etc. on August 11th).
Posted by: Floormaster Squeeze | 08/26/2013 at 11:07 AM
not to mention that the people searching for his article about depression!
as privacy concerns escalate, there will also be more bots created for the purpose of confusing data mining programs
Posted by: Kaiser | 08/27/2013 at 10:53 PM
If this is an example of economics research then it is no wonder economics is called the dismal science. The "science" behind this research is truly dismal! From the article: "Google searches, the biggest data source we currently have, are unambiguous: when it comes to our happiness, climate matters a great deal."
Gigantic pile of garbage in still equals garbage out.
Posted by: Carl Lemp | 08/31/2013 at 07:25 AM