The tech industry has turned us into an omni-surveillance society.
Any shop that uses modern, digital, connected technologies is probably collecting, storing and selling your data to someone. The people receiving and analyzing the data form a much larger set than those collecting the data. These data analysts typically ingest the data as are, and write software that controls this or that aspect of our lives. However, such data are riddled with inaccuracies and bias, which is a form of inaccuracy.
While in Vancouver last week, I encountered the following two scenarios that illustrate the fragility of data collection.
I purchased a drink at the Dunkin Donuts store at JFK Airport, and I noticed that the receipt said "Dine In" even though the shop is a simple counter with no tables or seats, which means 100% of orders are take-out.
There can be a number of reasons for this mis-coding:
- the store manager doesn't care because it's obvious to anyone who works at the shop that every customer is take-out. Dine-in/take-out isn't a variable of any interest to management at this location since it is invariant.
- the software defaults to "dine in" and the staff is too lazy to toggle it for each transaction
- the employees may have been told to toggle the setting but they realize that no one cares and so do not follow the instructions
- the employees are not trained to toggle the setting
- there is a known bug that prevents the setting to be toggled
Now, imagine a data analyst who got a hold of this "data exhaust." This person likely has no expertise in fast food operations. Unlike the store manager or employees, s/he can't tell that when it says "Dine in" in that particular location, it actually means "Take out".
I went to a supermarket to buy a beverage for the road. The lady in front of me had only three items, so I thought it should be a short wait. Or so I thought!
The first item was scanned without issues. The second item was some Anjou pears, one of the most popular varieties in North America. See picture. Surprisingly, the worker didn't know what kind of pears those were, and had to ask around. That took a while.
The last item was a couple of zucchini. The worker scanned the bag, then cancelled the transaction. She apparently noticed that the system recorded those as "gray squash $1.18" when it should have said zucchini. So she unfolded a lookup table, found the line for zucchini and re-entered the number. It came out as "gray squash $1.18" again. She cancelled the item again.
The third time the charm. Maybe she read the wrong line but when it showed "zucchini," she was satisfied. The price? Exactly $1.18.
Hat #1: I'm the impatient customer waiting in line for my turn. It seems ridiculous that she spent 5 minutes figuring the code for zucchini and ended up charging $1.18, the same amount as the "error". She should have just let the mistake live.
Hat #2: I'm the data analyst possibly getting this data as "data exhaust", or perhaps the data analyst who works for this supermarket doing product sales projections. Kudos to this worker! She corrected the error at source, so that when the data flowed through the system, there was no miscoding!
Which side are you on? I have to say, in that moment, I just wanted the line to move faster, and would rather she left the data error in the system.
In my OCCAM framework for thinking about contemporary datasets, these are problems related to "adapting" data collected by other people for other purposes. You are never sure whether the data measure what you think they are measuring. Miscoding can be caused by management decisions, laziness, inattention, expediency, etc.
Most importantly, in both these examples, the data would come in looking fine. It's a miscoding, not a typo, not missing. Further, collecting more data does not solve the problem. It may even reinforce existing errors by having more samples of the wrong things.
One of the key differentiators of a good data analyst is whether s/he will diagnose tricky problems like this.