Anonymity is in my thoughts at the moment.
First, I learned that the U.S. Census Bureau has moved from paper surveys to online surveys. What does this mean for its central promise of confidentiality? Here is NPR's description of this promise:
Under current law, the federal government is not allowed to release personally identifiable information from the census until 72 years after it's gathered for the constitutionally mandated tally. The bureau has relied on that promise of confidentiality to get many of the country's residents to volunteer their information once a decade, especially among people of color, immigrants and other historically undercounted groups who may be unsure about how their responses could be used against them.
Responding to the Census is mandatory, and yet the Census Bureau does not offer anonymity - instead, it claims to protect our sensitive data. This is exactly what the tech industry says, and yet we all know our data aren't really protected: the truth is that sooner or later, our data will get leaked.
Anything online is much more easily leaked than anything on paper. If the information is on paper, the thief has to be present physically, and s/he would need a truck to steal them - after which it would take a lot of time to scan or copy the data. The efficiency and compression gains plus worldwide access from powerful technologies make large-scale theft much quicker to execute.
***
Second, at a recent conference, I heard about a scheme designed to protect "sensitive" variables from online/mobile advertisers. They notably missed the simplest solution, which is to not collect, store, or use such sensitive data.
By "they", think any of the big tech players, like Google, Microsoft, Meta, etc. In their world, they want to use sensitive data to target advertising and extract revenues from advertisers, while they "protect" their users from the prying eyes of these advertisers. So, the advertisers are cast as the bad players while they themselves are good.
The speaker did not define what he meant by sensitive data. I imagine things like someone being a political activist (especially if they live under a repressive regime), sexual orientation, divorce status, disability status, etc.
The main idea of the talk appears to be using clustering to hide sensitive information about individuals. Each user has an array of sensitive and non-sensitive variables. A cluster analysis forms groups of users that are similar across the entire array of variables, and the advertisers will be provided cluster identifiers for users, rather than the variables.
There seems to be a flaw in this logic. For example, if the sensitive variable we are trying to protect is pro-choice activism, and the clustering analysis is sharp, then we may find a cluster in which all or almost all individuals are pro-choice activists, i.e. the cluster is dominated by a single sensitive variable. In this case, not only is the sensitive information exposed but the clustering strategy has backfired - as an adversary is now able to learn the sensitive variable for a whole group via the cluster identifier.
The definition of sensitive information is itself a complex problem. For example, if I am short, and our society discriminates against short people, I might want to hide my height, which makes it a sensitive variable for me - although the tall people would see it as an advantage, and not think of height as sensitive.
But today is still credible to have information on paper?
Hasn't every information on paper transferred into a digital form?
Posted by: Antonio | 04/15/2023 at 11:21 AM
Antonio: I don't know there is much going back but I often feel that the harmful effects of technological advances are not considered enough, or ignored. For example, given the recent banking crisis/bank runs, I'm surprised there isn't much discussion around the implications of doing all your banking on mobile apps.
Posted by: Kaiser | 04/17/2023 at 02:07 AM