Last week, my Columbia students discussed this nice article in the New York Times called "The Most Detailed Map of Gay Marriages in America". (link)
The center of the article is this map:
I asked the students to identify the problem that this dataviz is supposed to address. Someone responded that it tells us where gay married couples are found geographically. I asked her what is the answer to the question. She said they are mostly concentrated in the coasts and there are relatively few in middle America.
Then, I asked what assumption was needed in order for that comment to be valid, something that is not found on the map itself. If you look at the legend of the map, you'll see that the data being plotted are proportions of married couples that are same-sex, not counts of same-sex couples. Thus, when one makes that conclusion about coastal skew, one is using the knowledge that the population of the U.S. is concentrated on the coasts. (If the population density is higher in the middle than on the coasts by a certain degree, there could be more same-sex couples in the middle despite the lighter shades of orange.)
Next, I wanted an answer to a simple question: which (three-digit) area code (the unit of analysis on this map) has the highest density of same-sex married couples? Students scratched their heads. This "most detailed" map is not well equipped with the answer. The issue raised is whether the amount of details is an obvious virtue of a dataviz.
We then touched upon two other topics that are also very important:
1) How was this data obtained and computed? The author does a great job explaining two ways of arriving at these counts, a sample survey versus administrative records (tax filings).
2) How should the data be interpreted? The author walks through a number of unexpected comparisons, all of which point to the importance of statistical controls. For example, why do same-sex couples make more money than opposite-sex couples? Why do lesbian couples make less than gay-male couples?
A very important article from the Times starts with the following sentence:
Want to invisibly spy on 10 iPhone owners without their knowledge? Gather their every keystroke, sound, message and location? That will cost you $650,000, plus a $500,000 setup fee with an Israeli outfit called the NSO Group.
In the U.S., there is a disconnect between a populace whose distrust of government is at an all-time high and the same people whose trust of the same government's handling of their private data has apparently been unshaken.
NSO, the company being profiled, claims to have an "ethics" committee to determine whom they will sell (out) and whom they won't. It's not clear whether this "ethics" committee vets the methods used by NSO to collect data, such as "baiting targets to click unwittingly on texts containing malicious links".
This bit is also interesting:
Pegasus [the surveillance system] can use the camera to take snapshots or screen grabs. It can deny the phone access to certain websites and applications, and it can grab search histories or anything viewed with the phone’s web browser. And all of the data can be sent back to the agency’s server in real time.
What might make it easier for others to remote control your gadget life? More connected devices (e.g. internet of things), putting stuff in the cloud, simple user interfaces that hide inner functions, always-on connections, auto-updating, real-time anything, never logging out of accounts, etc. etc. Basically, anything that allows software to establish connections to remote servers without identifying itself.
My friend John R. sent me this excellent Buzzfeed feature on music playlists.
Here are some choice quotes to whet your appetite:
In 2014, when Tim Cook explained Apple’s stunning $3 billion purchase of Beats by repeatedly invoking its “very rare and hard to find” team of music experts, he was talking about these guys. And their efforts since, which have pointed toward curated playlists (specifically, an industrial-scale trove of 14,000 and counting) as the format of the future, have helped turn what was once a humble labor of love for music fans into an increasingly high-stakes contest between some of the richest companies in the world.
The algorithm that can judge the merits of new Gucci Mane, or intuit that you want to sing “A Thousand Miles” by Vanessa Carlton in the shower, has yet to be written... the job has fallen to an elite class of veteran music nerds — fewer than 100 working full-time at either Apple, Google, or Spotify — who are responsible for assembling, naming, and updating nearly every commute, dinner party, or TGIF playlist on your phone.
Spotify says 50% of its more than 100 million users globally are listening to its human-curated playlists (not counting those in the popular, algorithmically personalized “Discover Weekly”), which cumulatively generate more than a billion plays per week.
Machines have always been great at repetitive tasks that follow set rules. But many problems do not fall into that category.
We’ve come to expect that virtually all of our problems can be solved with code, so much so that we summon it unthinkingly before doing almost anything...But what if music is somehow different? What if there’s something immeasurable but essential in the space between what is now called “discovery” and, you know, that old stupidly human ritual of finding and falling in love with a song?
It's the revenge of the humans. Recommendation engines are not good enough. This doesn't mean "science" is not important. The article later explains:
Hypotheses, of course, are meant to be tested, and Spotify curators regularly make adjustments to playlists based on data that shows how people are actually interacting with them.
One frequently used application is a performance tracker called “PUMA,” or Playlist Usage Monitoring and Analysis, which breaks down each song on a playlist by things like number of plays, number of skips, and number of saves.
This is really the way forward for "machines". Machines and humans are both needed: the sum ought to be greater than its parts. Forget the idea that one replaces the other.
In a post titled "GIGO" (for those who don't like acronyms, Garbage In, Garbage Out), Andrew Gelman wrote this gem:
as long as the “garbage out” gets media attention, there will always be somebody willing to supply the “garbage in.”
The general drift of that post, and the previous one that led me to it, is a critique of the management consulting industry. Having worked in that industry earlier in my career, plus having studied at the school at which Michael Porter is a famous dude, I am familiar with how this industry works.
We even had a term called "blank sliding." This is exactly what you think it means... consultants are trained to write a "draft" of a client presentation consisting of all blank slides that only have headers written on them. Headers taken together tell the story of the presentation. Eventually, the slides are filled in with content. In a nutshell, the story is created before the meat is ready.
While this sounds horrible, it is not really that bad, if used properly. It creates a culture of getting the story right and focusing a lot of attention on the story. If the analyst uses this method directionally, allowing rewriting of headers, or maybe even the whole story, as the contents are filled in, it could lead to a good product. Frequently, however, it creates the situation described by the Gelman quote.
PS. Andrew is probably right about Michael Porter being paid big bucks. He is one of the most expensive public speakers (link).
A GMO labeling law has arrived in the US, albeit one that has no teeth (link). For those who don't want to click on the link, the law is passed in haste to pre-empt a more stringent Vermont law. The federal law defines GMO narrowly, businesses do not need to put word labels on packages (they can, for example, provide an 800-number), and violaters will not be punished.
One of the arguments against GMO labeling is that it is unscientific because (some) scientists are 100% certain that GMO foods are safe. (e.g. this Boston Globe editorial)
Any good scientist knows that scientific "truths" are true until they are proven otherwise. Science is a continuous process of making hypotheses, and finding data to confirm or reject them. The Bayesian way of thinking is very useful here. Being true is a probability - more confirmatory data increases the probability that a given hypothesis is true.
So why is GMO labeling good science?
In fact, I'd go so far as to say that there is no science without GMO labeling.
How is nutritional science done today? What is the research that tells us coffee is good, butter is good, salt is bad, etc.? Granted, this is a shaky field that has issued lots of false results. But the usual form of analysis goes like this: conduct a large survey of consumers and ask them about their diet (e.g. how much red meat do you eat each week?); obtain information about their health status, either through the same survey, a different survey, or direct measurements if they are part of a research study; then correlate the dietary data and the health data.
Now, imagine you want to study if eating GMO foods affects your health, either positively or negatively. Your survey question will be something along the lines of "How much GMO foods did you eat last week?"
Without GMO labeling, there is no way to conduct such research. This is why GMO labeling is good science. Not labeling GMO is bad science - actually it mandates no science.
So I recently moved and needed to find the optimal subway ride up to Columbia. I have been go back and forth between my two choices to collect some data to help make up my mind. Both routes require two train exchanges but only the first leg differs. In other words:
Route 1 : A -> B -> C Route 2 : X -> B -> C
Here, the "nodes" (A, X, B, C) are train lines and the first arrow is the Times Square subway station. There are two ways to get from my apartment to Times Square, after which the two routes are identical.
This means the problem reduces to comparing:
Route 1s : A Route 2s : X
How long does it take to get to Times Square using line A versus line X? Based on my experience so far, A > X by 5 minutes. Under normal circumstances, X is the choice as I get to Times Square 5 minutes earlier. The entire trip takes 35-40 minutes - the 5 minutes don't seem like a world of difference.
So far, I have ignored perhaps the most important piece of data: how variable are the travel times on line A versus line X? Each of those legs consists of part walking, part waiting at the platform, and part riding on the train. The waiting is the key source of variability.
The 5 minutes' difference was based on smooth transitions on both lines. If you have used NY subways, you'll know that wait times of 5-15 minutes are very common. So if line X tends to require longer waits at the platform than line A, then the 5-minute advantage can easily be wiped out!
So my next data collection task is to figure out how likely it is to suffer the distribution of wait times at the respective platforms.
I cover the average versus variability concept in Chapter 1 of Numbers Rule Your World (link). This concept is related to the signal and noise concept that Nate Silver made famous. The normal difference of 5 minutes is the "signal". The "noise", that is to say, the variability of wait times on the platform, may be so strong that you cannot "see" the signal. This is what I mean by "wiping out the difference."
Well, it didn't take long but private investigators have found the next big thing: Big Data.
Bloomberg reported on a company called IDI, who sells our data to private investigators. (link)
Unfortunately, this article is short on details and long on sensationalized catchphrases ("Every move you make. Every click you take...")
The CEO of IDI boasted that "We have data on that 21-year-old who’s living at home with mom and dad."
We really ought to be clear about what we mean by "data".
The definition of "data" is very loose. Much of this "data" are not direct records but are "model outputs."
For example, you can't measure things like how much electricity the 21-year-old in that household is using, separate from the power used by the parents. The raw data is not available, not even at the utility company. However, it is possible to create a statistical model to allocate the electricity consumption among the household members.
For a long time now, data providers have been selling businesses things like the income of individuals. Usually, the raw data come from Census records but the Census Bureau does not ever release individual data - in fact, a lot of care is taken to make sure individual incomes are not made public. Any released data are aggregated. Then, analysts build models to project the incomes at the individual (or household) level. These models are only moderately accurate.
If you are a driver, you might take notice that companies are out there taking snapshots of your license plates everywhere you go, and selling your locations. This is mentioned in the Bloomberg article. I have discussed this on this blog as well - the technology burst on the scene with Google Maps, then the police forces everywhere started using it, and now we have data companies doing the same.
One good thing about this... next time your friend tells you "I am two blocks away and will be there in 5 minutes," you can say "according to IDI, you are 20 blocks away and won't be here for another 30 minutes!"
As word of plummeting sales of Apple Watch spread around last week, an entrepreneur went on Medium to sing a different tune: his survey apparently uncovered a "paradox" - Apple Watch users are "overwhelmingly satisfied, yet not recommending" the product.
The "overwhelming" bit comes from this chart:
This data portray a hugely successful product in which almost nobody expressed any negative feelings.
This next chart is even more impressive. Apparently, "more than 94% of the people" are still wearing their Apple Watches daily! The researcher stated: "In this study, only 1% of our respondents have stopped wearing their Apple Watches." Wow!
Of course, in the world of data, there is a pretty reliable rule: "if it's too good, it's probably not true".
The researcher did also mention this little trivia: "Of course we expect the vast majority of our panelists who have stopped wearing Apple Watch have also stopped responding to our survey."
This was the perfect example for the class I was teaching this week, in which we discuss the idea that huge amounts of tracking data will supplant surveys in the future. Tracking data is the perfect example of what Hans Rosling calls "bags of numerators without denominators". Tracking data only tracks events that happen. This survey only gets responses from people who are wearing the Watch. You are not going to learn anything about the people who have stopped wearing the Watch, nor are you going to learn anything about the events that did not happen.
This past week, the New York City subway lines were suffering from egregious delays. The worst was waiting almost half an hour for a 6 train at Canal Street. The ventilation was so poor inside the station one felt being steamed alive.
Previously, I have written about, and praised, the count-down clocks in the subways (link). NYC was late to the game but better late than never!
When I entered Canal Street, the display only showed the impending arrivals for the downtown trains. I was going uptown. The count-down clocks on the uptown platform displayed no information. After waiting about 20 minutes, all of a sudden, the clocks woke up, showing the next train was arriving in 5 minutes.
You could sense the excitement on the platform. Everyone woke up from the heat stupor. An announcement drafted through the station, adding a measure of assurance.
Three minutes passed. The clock now went from "2 min" to "Delay".
Eventually, a train approached. And it went right past us. This local train was supposed to stop at Canal Street but it didn't, without warning.
The clock acted as if nothing happened. A new entry showed up, indicating that the next train would arrive in 6 minutes.
And this time, the train did stop, to the relief of dozens of sweating passengers.
Imagine you are the analyst who pulled down the data to analyze subway wait times at Canal Street. The dataset contained all notices that showed up on the count-down clocks. This is an event log, so when there are no notices, there will be no entries.
For the half hour that I waited, the following entries would have been added to the log:
Train arriving, 5 min
Train arriving, 4 min
Train arriving, 3 min
Train arriving, 2 min
Train arriving, Delay
Train arriving, 6 min
Train arriving, 5 min
Train arriving, 4 min
Train arriving, 3 min
Train arriving, 2 min
Train arriving, 1 min
Train arriving, 0 min
The analyst is in danger of making erroneous "insights" such as:
that the longest wait during this period was 6 minutes
that there were two pickups during this period
that the passengers who finally got on board the train waited a maximum of 6 minutes
Further, I am worried that the operators are gaming the system. Why was there no display when I first showed up? It should have said next train arriving in 20 minutes. If I saw that, I would have walked to the other platform and took the other line. The count-down clocks are most useful in dealing with outlier situations like this and so the outlier data should never be suppressed.