Last year, Gizmodo capitalized on the fallout from the Ashley Madison hacking scandal and published a sensational article claiming the website that, if you haven't heard, promotes adultery, has "almost no" real women on it. The subtext is that millions of gullible, disloyal males were paying monthly fees to the website to do nothing or, cue the laugh track, to converse with "badly-designed robots." These men, according to Gizmodo, were buying a "fantasy," and "almost no" hookups were ever consummated.
That conclusion was ridiculous on its face. It assumes that men have no common sense. In fact, not one man but over 30 million men with zero common sense. (Ashley Madison has already been in business for over a decade.)
It didn't however stop the journalist from all kinds of emoting, such as:
the more I examined those 5.5 million female profiles, the more obvious it became that none of them had ever talked to men on the site, or even used the site at all after creating a profile [italics from the original]
In case that isn't extreme enough, she elaborated:
Actually, scratch that. As I’ll explain below, there’s a good chance that about 12,000 of the profiles out of millions belonged to actual, real women who were active users of Ashley Madison.
In casual conversations, I keep hearing this story. Except that the story has been debunked within a week of its publication, but as per the state of the media today, the debunking got a fraction of the press lavished on the original, dreadful piece of data journalism. Most of the outlets that helped spread the initial nonsense never bothered to print the retraction.
What the journalist faced was reality. As soon as the piece got published, a number of readers, both male and female, commented on their personal experiences with the website. There were couples who found love and eventually got married. There were female users who refuted the conclusion that Ashley Madison was "a science fictional future where every woman on Earth is dead." Besides, people with inside knowledge pointed out how the data were completely misinterpreted.
For those interested in "numbersense" in data analysis, it is very instructive to read both the original article and the retraction that is thinly disguised as further juicy finding. How can a data analyst avoid falling into the traps that lead them to utterly invalid results?
A lot of "numbersense" has to do with how you process the information you have, and the information you don't. Of the data that you have, what do you believe and what you don't. Of the data you don't have, what assumptions you make.
The journalist believed the hackers when they boasted that the data dump, 20 compressed gigabytes and all, contained all of Ashley Madison's customer data. This turned out to be wrong. Belatedly, it was shown that at least 550 tables exist in the data infrastructure, and the journalist analyzed just four tables! Tellingly, this additional knowledge did not stop her from issuing even more "insights" in the second article of the series.
Further, people with inside knowledge gave the reporter the hint that the information she needed was hiding in plain sight. There was a column in the data table called "ishost" (i.e. "is this user a host?"), and a "host" is internal jargon for a "chat bot" (incidentally, these are making the news by way of Facebook, Microsoft. and Microsoft.) According to the ishost column, there were only 70,000 or so bot accounts, far from the millions of accounts claimed by the journalist!
In the original article, the journalist cheerfully related her process of discovering that no real women used the website. The high point was: "three data fields changed everything." These fields supposedly measured the frequency of specific actions on the site, such as sending emails to other users. It turned out that the columns did not measure any human activity at all. They recorded bot activity, thus invalidating her entire analysis.
For example, "mail_last_time" did not mean "a timestamp indicating the last time a member checked the messages in their Ashley Madison inbox," as asserted in the original article. In fact, insiders told the journalist it indicated the last time a bot sent an email to an Ashley Madison member.
This is amateur hour: to infer the content of a column of data by the name of the column. One can never guess the intention of the developer who names the column, let alone know whether the column has deviated from its initial definition over time.
And this little situation illustrates perfectly why analysts of Big Data owes it to consumers to be extra careful. Much of Big Data is observational, which means the origin of such data is obscure, or obscured by the organizational layering, or washed away by time. The current practice in database development scoffs at data dictionaries or data flow diagrams or any kind of documentation. The spirit of "agile" development devalues stability in the data environment. So it has become even more arduous than ever before to understand one's data.
The essence of numbersense is captured here. Are you the analyst who looks at "mail_last_time" and convinces yourself that it measures human activity and thus proves that no female humans exist on the website? Or are you the analyst who ceaselessly asks questions to get to the bottom of what that column measures?