Several recent news stories cover the topic of “labeling” data. For example, this Bloomberg article says Amazon is sending voice recordings from its Echo speakers to be heard and transcribed by human listeners. This Reuters article discusses Facebook’s contractors in India, Romania, Philippines, etc. hired to “label” status updates, shared links, event posts, Stories feature uploads, videos and photos.
The authors of these articles express genuine shock and awe. They apparently believed that “machine learning” means no humans involved. The tech industry allows this misconception to fester by being opaque about how machine learning works.
(The reporters are also dismayed by the privacy invasion. The Echo speakers are constantly recording in users’ households. Facebook did not have explicit permission from users to it send their data out for labeling.)
***
Humans have always been a part of the machine learning workflow, and will continue to be. Let’s use one of the examples in the Facebook report to illustrate this point.
A machine learning model predicts profane language in a Facebook video. For the time being, I assume that a predictive model already exists. This model is being fed videos, and automatically makes predictions - without human involvement. Video 1 – it predicts profanity; Video 2 – no; 3 – no; 4 – yes; etc.
Computers work fast, and can make tons of predictions quickly. The question is whether these predictions are accurate. It’s one thing to create these models in the laboratory; it’s another thing when they are unleashed to the world, and affecting Facebook users, e.g. by deleting videos that are predicted to contain profanity.
Why should Facebook and its data scientists care about predictive accuracy?
User complaints. When users find their videos deleted due to profanity, they complain if said clips do not in fact contain profane language. Other users are upset to unsuspectingly encounter videos with profanity (that the machine fails to identify and delete).
***
It’s not easy to measure if the machine-learning model is correctly predicting profanity. The machine can’t be both decider and judge at the same time. The judge typically is a human who views the video to determine if it contains profanity. These human judges are the “annotators” described in the news articles. They are hired to look through videos and apply a profanity “label” if they find profanity.
As disclosed in the articles, companies typically hire two or three judges for each item because profanity is a somewhat subjective opinion. They might order more detailed labeling, e.g. label types of profanity as opposed to one overall label of profanity.
***
Now let’s remove the assumption that we already have a machine learning model. Where does this model come from? Such a machine has to know what features of the video are correlated with presence of profanity. To discover this correlation, the machine needs to be told which videos have profane language, and which do not.
This is a chicken-and-egg problem and it is solved by having humans label a big batch of videos at the start, building the “training data”. In the Facebook example, they hired over 200 people to create data labels, later reduced to 30. The first team was building a large training dataset; after the predictive model has been produced, the future labeling by the reduced workforce is used to monitor the accuracy of the predictions.
Any company that claims to use our data to predict our behavior must create training data, i.e. labeled data. In most cases, humans must create the labels – by reading our emails, listening to our conversations, viewing our videos, reviewing our calendars, scanning our receipts, and so on.
How far companies should go and what methods they shoudl use in collecting such data are ethics questions that should be discussed.
Comments
You can follow this conversation by subscribing to the comment feed for this post.