Seth Stephens-Davidowitz has a new book out early this year, "Don't Trust Your Gut", which he kindly sent me for review. The book is Malcolm Gladwell meets Tim Ferriss - part counter intuition, part self help. Seth tackles big questions: how to find love? how to raise kids? how to get rich? how to be happier? He invariantly believes that big data reveal universal truths on such matters.
In 2013, I speculated about the challenges faced by big data analysts in this piece for Significance. Big Data has become a real thing in the decade since its publication. Seth's book interests me as a progress report on the state of "big data analytics".
If there is a common core of this style of analytics, as demonstrated by Seth's examples, I'd include the following elements:
a) Big Data is less defined by the amount of data, but by several characteristics I previously labeled OCCAM (see this post).
The data are typically collected by passive observation (e.g. tax records, dating app usage, artist exhibit schedules). Meaningful controls are absent (e.g. no non-app users, no failed artists). The dataset is believed to be complete. The data aren't specifically collected for the analysis (an important exception is the happiness data collected from apps for that specific purpose). Several datasets are merged to investigate correlations.
b) Much - though not all - of the analyses use the most rudimentary statistics, such as statistical averages.
This can be appropriate, if one insists one has all the data, or "essentially" all. An unstated axiom is that the sheer quantity of data crowds out any bias. This is not a new belief: as long as Google has existed, marketing analysts have always claimed that Google search data are fully representative of all searches since Google dominates the market. (Seth's previous book covers his analyses of Google search results.)
c) If the analyst incorporates model adjustments, these adjusted models are treated as full cures of all statistical concerns.
The last few chapters on activities that cause happiness or unhappiness report numerous results from adjusted models of underlying data collected from 60,000 users of specially designed mobile apps. The researchers broke down 3 million logged events by 40 activity types, hour of day, day of week, season of year, location, among other factors. For argument's sake, let's say the users came from 100 places, ignore demographic segmentation, and apply zero exclusions. Then, the 3 million points fell into 40*24*7*4*100 = 2.7 million cells... unevenly but if evenly, each cell has an average of 1.1 events. That means many cells contain zero events. By the magic of model adjustments, all cells have non-zero estimated activities and associated happiness quotients. The estimates in many cells reflect an underlying model that hasn't been confirmed with data - and the credibility of these estimates rests with the reader's trust in the model structure.
I observed a similar phenonmenon when reading the well-known observational studies of Covid-19 vaccine effectiveness. Many of these studies adjust for age, an obvious confounder. Having included the age term, which quite a few studies proclaimed to be non-significant, the researchers spoke as if their models are free of any age bias. Taking this line of logic further, such a belief implies that no one should bother with running expensive randomized clinical trials because a self-selected trial coupled with post-hoc regression adjustments is a perfect substitute!
d) A blurred line barely delineates using data as explanation and as prescription.
Take, for example, the revelation that people who own real estate businesses have the highest chance of being a top 0.1% earner in the U.S., relative to other industries. This descriptive statistic is turned into a life hack, that people who want to get rich should start real-estate businesses. Nevertheless, being able to explain past data is different from being able to predict the future.
A kind of feedback loop is also at play. The average odds of getting rich surely depends on how many people are entering the industry. So if people read Seth's book, and rush into real estate, the data may no longer rank this industry as best.
e) Most of the featured big-data research aim to discover universal truths that apply to everyone.
For example, an eye-opening chart in the book shows that women who were rated bottom of the barrel in looks have half the chance of getting a response in a dating app when they messaged men in the most attractive bucket... but the absolute response was still about 30%. This produces the advice to send more messages to presumably "unattainable" prospects.
Such a conclusion assumes that the least attractive women are identical to the average women on factors other than attractiveness. It's possible that such women who approach the most attractive-looking men have other desirable assets that the average woman does not possess.
It's an irony because with "big data", it should be possible to slice and dice the data into many more segments, moving away from the world of "universal truths," which are statistical averages, said differently. In the last chapter of the book, Seth describes one study that tackles this problem.
***
I wish Seth had devoted more space to explaining model adjustments and statistical issues, and discussing whether the remedies are sufficient. His target audience is clearly people with minimal mathematical training, and I understand his decision not to open such cans of worms. The references at the end should satisfy those interested in exploring further.
As with Gladwell, I recommend reading this genre with a critical eye. Think of these books as offering fodder to exercise your critical thinking. Don't Trust Your Gut is a light read, with some intriguing results of which I was not previously aware. I enjoyed the book, and have kept pages of notes about the materials. The above comments should give you a guide should you want to go deeper into the analytical issues.
I think there is a lot more that can be done with big data, we are just seeing the tip of the iceberg. So I agree with Seth that the potential is there. Seth is more optimistic about the current state than I am.
Recent Comments