This week on Junkcharts, I featured the New York Times's analysis and visualization of how New Yorkers living in the richest sections of town are staying away from the Big Apple during the pandemic. For commentary on the graphics, see this post.
The Times's article is important for another reason. It represents a type of analysis that "big data" surveillance technology enables. The key data sources are cell phones which track users everywhere they go (typically through installed apps, although cell phone providers and vendors also have the data). In the past, to know if people are home, one would conduct small surveys across all neighborhoods.
An essential difference between surveys and surveillance is consent. Anyone who answers a survey knowingly hands over the data to the data collector - traditionally, the data collector does not inquire about the survey-taker's identity (this red line may be eroding). For cell phones and mobile apps, the typical user is not aware their locations are being captured and sold. The user however most likely "consented" via one of two ways: (a) concealed language buried in terms and conditions, or (b) blackmail refusing service unless consent is granted.
The report compared data from three sources, which reveals two things. There is an entire cottage industry of data vendors who are trading our location data, and these datasets are not identical. The differences may arise from the method of surveillance, such as which apps are tracking us, and from the method of inference. The question of whether someone is home or not is not merely the question of one's current longitude and latitude. Based on location data over time, the app vendor assigns each person a home location. There is no exact way to do this. For example, we can set a time window and compute the most frequently occurring location. Or, we can limit the analysis to the wee hours, when you'd most likely to be home. There are all kinds of possibilities for errors, such as extended vacations, or night shifts. Despite the dragnet intention of the surveillance industry, much of the data aren't as precise as advertised. Errors are tolerable when aggregate statistics are the end product, but as you probably know, vendors tout the granularity of their data.
The report clearly stated that these vendors anonymized the data. Anonymized data is not the same as anonymous data. Please don't let the prevailing narrative mislead you. Anonymized implies the data collector stores your personal details, and then selects who gets to see them. Anonymous is not collecting any personal details, meaning no one, not even the data collector, have them. For the type of "big data" analysis performed by the New York Times, the analyst only needs to know that someone lives in a certain neighborhood - not the name and personal details of each person. For most applications, there is no need to know people's personal information.
Not only is personal data not needed, the analytical results shown by the Times remain unchanged if the dataset consists of a random sample of, say, 10,000, rather than millions. That's because understandable insights are aggregated statistics. The best chart of the Times article plots the average percentage of time at home for top 1%, 5%, 10%, 20% and bottom 80% segments. These lines are generally 5 to 10 percent apart. The visual story does not change with a much smaller dataset.
Personal location data can easily be used to harm us. If you're a protestor, the harm is self-evident. If burglars get a hold of such data, they can pick your pocket while you're out. Years ago, someone put up a website that was called, if I remember correctly, iamnothome.com, by scraping messages left by Facebook users that inferred they were out and about. That site was immediately shut down by authorities. Location data vendors are providing the same information, now with more and better sources, and yet have not faced scrutiny. Add in forced consent, and harm turns into self-harm.
Harm frequently arises from "story time" based on the data. I've mostly used the term "story time" to describe academics who publish reports that first present a lot of data, and then a story revealed by the data, except if you think twice, you realize the data were the lullaby to feed you the story while you're dozing off! Recent reports said some colleges have purchased location data of students. Administrators can then compile detailed records of which students spend how much time in bars, dance clubs, libraries, etc. The harm comes when the administrators (sometimes abetted by the app providers) label certain students as "likely to fail" and send them early intervention notices. This is story time! The percentage of time spent in bars and clubs does not define a failing student; it may be a reasonable interpretation but the causal linkage is not supported by the data.
***
Recently, Andrew featured another instance of story time, in which an influential psychologist found that "time trends in religious attendance correlate with time trends in homicide rates in low-IQ countries but not in high-IQ countries." Some iffy data analysis was offered to make that claim. But the conclusion was "The prescriptive values of highly educated groups (such as secularism, but also libertarianism, criminal justice reform, and unrestricted sociosexuality, among others) may work for groups that are highly cognitively sophisticated and self-controlled, but they may be injurious to groups with lower self-control and cognitive ability.” If you're interested in politics and psychology, Andrew has a lengthy discussion of the article here.
Comments
You can follow this conversation by subscribing to the comment feed for this post.