Last year, Gizmodo capitalized on the fallout from the Ashley Madison hacking scandal and published a sensational article claiming the website that, if you haven't heard, promotes adultery, has "almost no" real women on it. The subtext is that millions of gullible, disloyal males were paying monthly fees to the website to do nothing or, cue the laugh track, to converse with "badly-designed robots." These men, according to Gizmodo, were buying a "fantasy," and "almost no" hookups were ever consummated.
That conclusion was ridiculous on its face. It assumes that men have no common sense. In fact, not one man but over 30 million men with zero common sense. (Ashley Madison has already been in business for over a decade.)
It didn't however stop the journalist from all kinds of emoting, such as:
the more I examined those 5.5 million female profiles, the more obvious it became that none of them had ever talked to men on the site, or even used the site at all after creating a profile [italics from the original]
In case that isn't extreme enough, she elaborated:
Actually, scratch that. As I’ll explain below, there’s a good chance that about 12,000 of the profiles out of millions belonged to actual, real women who were active users of Ashley Madison.
In casual conversations, I keep hearing this story. Except that the story has been debunked within a week of its publication, but as per the state of the media today, the debunking got a fraction of the press lavished on the original, dreadful piece of data journalism. Most of the outlets that helped spread the initial nonsense never bothered to print the retraction.
What the journalist faced was reality. As soon as the piece got published, a number of readers, both male and female, commented on their personal experiences with the website. There were couples who found love and eventually got married. There were female users who refuted the conclusion that Ashley Madison was "a science fictional future where every woman on Earth is dead." Besides, people with inside knowledge pointed out how the data were completely misinterpreted.
For those interested in "numbersense" in data analysis, it is very instructive to read both the original article and the retraction that is thinly disguised as further juicy finding. How can a data analyst avoid falling into the traps that lead them to utterly invalid results?
***
A lot of "numbersense" has to do with how you process the information you have, and the information you don't. Of the data that you have, what do you believe and what you don't. Of the data you don't have, what assumptions you make.
The journalist believed the hackers when they boasted that the data dump, 20 compressed gigabytes and all, contained all of Ashley Madison's customer data. This turned out to be wrong. Belatedly, it was shown that at least 550 tables exist in the data infrastructure, and the journalist analyzed just four tables! Tellingly, this additional knowledge did not stop her from issuing even more "insights" in the second article of the series.
Further, people with inside knowledge gave the reporter the hint that the information she needed was hiding in plain sight. There was a column in the data table called "ishost" (i.e. "is this user a host?"), and a "host" is internal jargon for a "chat bot" (incidentally, these are making the news by way of Facebook, Microsoft. and Microsoft.) According to the ishost column, there were only 70,000 or so bot accounts, far from the millions of accounts claimed by the journalist!
In the original article, the journalist cheerfully related her process of discovering that no real women used the website. The high point was: "three data fields changed everything." These fields supposedly measured the frequency of specific actions on the site, such as sending emails to other users. It turned out that the columns did not measure any human activity at all. They recorded bot activity, thus invalidating her entire analysis.
For example, "mail_last_time" did not mean "a timestamp indicating the last time a member checked the messages in their Ashley Madison inbox," as asserted in the original article. In fact, insiders told the journalist it indicated the last time a bot sent an email to an Ashley Madison member.
This is amateur hour: to infer the content of a column of data by the name of the column. One can never guess the intention of the developer who names the column, let alone know whether the column has deviated from its initial definition over time.
And this little situation illustrates perfectly why analysts of Big Data owes it to consumers to be extra careful. Much of Big Data is observational, which means the origin of such data is obscure, or obscured by the organizational layering, or washed away by time. The current practice in database development scoffs at data dictionaries or data flow diagrams or any kind of documentation. The spirit of "agile" development devalues stability in the data environment. So it has become even more arduous than ever before to understand one's data.
***
The essence of numbersense is captured here. Are you the analyst who looks at "mail_last_time" and convinces yourself that it measures human activity and thus proves that no female humans exist on the website? Or are you the analyst who ceaselessly asks questions to get to the bottom of what that column measures?
A quibble: you mention 30 million but the site's revenue doesn't come close to supporting that number. That is, there may be huge numbers of men signed up but revenue suggests a relatively small number buy credits necessary to engage in conversation. It is the credit buying system which I think motivates the use the bots: if you can engage a man, he might buy some credits. I note as an anecdote that one woman journalist who set up a profile was contacted by 50 men in a month. I gather they sell packages of credits with a guarantee/refund only if you buy the top level AND jump through a lot of hoops of site usage. If the info I found is correct, they get a lot of $50 buyers who then might contact 10 women's profiles at 5 credits a pop, which suggests a lot of churn and thus the never-ending barrage of ads for the site. What interests me are: a) what number of actual women are necessary to maintain a site like this, given a series of different churn assumptions, meaning what is the leverage on a female profile and b) the effect of the top tier, which is not only expensive but requires significant site engagement. Two things interest me in that: a) how many of those buyers actually complete the necessary engagement for a potential refund and b) is that how the site works best, meaning is that actually where real life hook-ups are generated on the site, that if you try the hardest then like at many things in life you get results? So in sum, I thought the AM stories were junky but I think there's a ton of interesting stuff in there.
Posted by: Jonathan | 05/09/2016 at 09:55 AM
Jonathan: My post is focused only from the perspective of a data analyst approaching a pile of data and how to prevent one from drawing nonsensical conclusions, such as that there were no active women users on the site. Given that Ashley Madison is a real business, the topics you mentioned are certainly interesting especially to anyone who is invested in the success of that business. But the same issues arise with any subscription business: Netflix, for instance, makes a ton of money from people who don't watch frequently.
Posted by: Kaiser | 05/09/2016 at 10:08 AM