Chance magazine (link, paywalled) has a nice interview with Howard Hogan, who was Chief Demographer at the U.S. Census Bureau in the current issue. It initially caught my attention because he was a fellow Princeton graduate, who was lucky enough to be there during John Tukey's time. At the end of the interview, he was asked to give his "Sunset Salvo", which was Tukey's term for words of wisdom at the time of retirement.
He made two observations, both of which resonate strongly with me.
"If nobody's criticizing you, that means you're not working on anything important."
It is natural to be defensive about one's work but in a field like data science or statistics, which is about shades of gray - and only interesting because of it - one simply cannot do good work unless one hears about different ways of approaching the data, and what different results one obtains. Providing constructive critique requires the critic to invest significant time to decipher what the analyst has done, in addition to thinking about alternatives, which provides the context upon which an analysis can be judged.
"If you want to make an impact, you don't just get numbers, you get a story."
This is a counter-intuitive point that frequently makes practitioners uncomfortable. Statistical thinking is mostly about generalizing the observed data, drawing conclusions on a wider population than the sample. A "data story" is most often a specific example. In the interview, Hogan explained "imputation" by saying: if in the Supreme Court Library, one finds books labeled Title 11, 12, 14, 15, then one can guess with good confidence that Title 13 exists, i.e. is absent because someone checked it out. So a data story implies a kind of U-turn: the analyst has looked at the specific sample of data, and generalized them to a broader statement, and then selected a specific single example to explain that statement to an audience. It's a bit counter-intuitive. In practice, it's a highly effective technique.
***
In the interview, Hogan described his experience updating ethical guidelines for statisticians. He reported that the committee decided that there is nothing more to say about ethics in the age of Big Data.
I disagree with that conclusion. Big Data raise or exacerbate issues that are not relevant with "small" datasets.
First, the sheer number of variables being collected about each person means that it is much easier to de-anonymize Big Data than small datasets.
Second, the impact of unintended de-anonymization of the data scales with the size of the sample. With small data, a small number of people are adversely affected. With Big Data, a large number of people are hurt.
Third, more data allow more linking. It's easy to mismatch people's data - one of many reasons for this is that many people share the same names. It's true this is a pre-existing problem that has vastly expanded in reach because of the exponentially larger number of linkages made possible by Big Data.
Fourth, more linkages mean more "guilt by association". The impact of this may be mild for Census data which has no individual implications - but when businesses do such linkage with their Big Data, they make decisions that impact individuals.
Comments
You can follow this conversation by subscribing to the comment feed for this post.