As your smartphone scans your face and then unlocks the device, have you ever asked how well biometrics authentication work? Does it work just as well as fingerprints? How does one go about measuring its accuracy?
***
Biometrics started with fingerprints, expanded to face recognition, and now encompasses other types of measurements such as voiceprints.
The basic steps of biometrics authentication are: capturing the signal (image, voice, video, etc.), turning the signal into data, converting data into scores, which measure the likelihood that the biometrics data came from the device owner, and determining whether to grant access based on exceeding a certain threshold.
As with any AI/predictive model, the software must be pre-trained using labeled datasets, e.g. images known to be those of the device owners.
The authentication software makes a binary decision (Allow/Block). Block might involve iterative Retries but I'll ignore this complication. Such a prediction system makes two types of errors: blocking someone who is the device owner (false rejection error FRR) or accepting someone who isn't the owner (false acceptance error FAR). While we've all heard anecdotes of people who've been erroneously shut out of their phones, I haven't actually seen a numeric error rate ... until now.
Reading Significance magazine recently, I came across one study that quantifies the error rates of biometrics authentication. A more detailed article by the same team is found in Communications of ACM.
We tend to think fingerprints are unique. It turns out that authenticating users with other biometrics data are much less accurate - multiple orders of magnitude less so! According to these authors, the field summarizes the two error rates with one number known as "Equal Error Rate" (EER), which is the setting under which FRR and FAR have the same values. If you've read Chapter 4 of Numbers Rule Your World (link), you'll hopefully question why FRR should equal FAR. The costs of the two types of errors are clearly different, and reflect an individual treadoff between convenience and security. (Given the lack of scrutiny of these systems, one infers that most people sacrifice security at the altar of convenience.) Note also that falsely rejecting the owner is an annoyance each and every time it happens, directly felt by the true owner, while falsely accepting an imposter may not be discovered until the owner realizes s/he has been harmed.
The researchers said that their face recognition software has EER of 4 percent, which means that 4 of 100 times the owner's face is read, the software erroneously deny entry, and 4 out of 100 times someone else requests access, their intrusion goes undetected.
Voice recognition software (voice-printing) is shown to be more than 8 times worse than face recognition: the error rates are 35% each. About a third of the time the owner requests access, the software would decide to block. (These authors are pitching a system that combines multiple sources of biometrics data.)
***
When reading reports about predictive models, we should examine how the error rates are measured.
Take the false rejection rate. One would have to present images of true owners to the phone, and then compute what proportion of these images the phone erroneously decides to be imposters. Whether the FRR is credible depends on how the investigators select the set of test images. For example, if the test images are the same images used to train the model, the chance of error is smaller.
Each face recognition system has its strengths and weaknesses. Some, for example, are fooled by glasses. People find that they have to take off their glasses to unlock their phones. If the set of test images does not include images of the true owners wearing glasses, then the FRR is under-estimated. If detecting glasses is a strength, not a weakness, of the system being evaluated, the FRR is now over-estimated.
The false acceptance rate is even harder to measure accurately because the investigators must compile a set of images of imposters. There are infinitely many ways to evade the software. The researchers of this study adopted a popular method: "We performed the testing through a randomly selected face-and-voice sample from a subject we selected randomly from among the 54 subjects in the database, leaving out the training samples." This method relies on "randomization".
Almost always, such a method of selecting test samples over-estimates the accuracy rates. That's because in real life, the bad guys are not randomly selected from the entire population - but from the subset of bad guys. The bad guys who are attempting to unlock your phones are likely to exploit weaknesses of the technology. The authors included an interesting example of one such attack strategy. Certain software assesses the quality of the images submitted for verification, and assigns higher importance to higher-quality images. This makes sense but the system is open to a form of attack: the bad guys deliberately submit poor-quality images, knowing that the software can't keep locking out true owners who provide low-quality images.
It's challenging to compile test images for measuring false acceptance rate, as it requires specifying how imposters are likely to attack the system.
The key takeaway is that error rates coming from research studies is heavily affected by the investigators' choice of test images. As modelers, we can fool ourselves by presenting easy but unrealistic images for validation. Ideally, we should employ a neutral third party to evaluate these systems.
***
This last section dives into some technical details, which you can skip if not interested.
Below is a table and a figure included in the Significance article, which provides data on the error rates:
The researchers compared three authentication schemes: face recognition, voice recognition and one that fuses both face images and voice prints (described as "feature-level multimodal fusion"). Table 1 says that the fusion scheme has the lowest EER, followed by face recognition; voice recognition has a terrible EER of 35%.
Figure 2 presents the components of the EER in the form of an ROC curve. However, the results shown in this Figure does not match what is shown in Table 1.
The ROC curve plots the "true positive rate" against the "false positive rate". According to the authors, the false positive rate is the FAR, the chance of letting an imposter through. (This definition implies that a positive result is positively identifying the true owner.) The inverse of a "true positive" is a false negative, that is to say, to mistakenly block the true owner of the device. Thus the FRR (false rejection rate) is the false negative rate, which is 1 - the true positive rate.
With ROC curves, the top left corner represents the perfect system that makes zero false positiive mistakes and 100% true positives (i.e. zero false negative mistakes). The ranking of the three authentication schemes should therefore be Blue > Red > Green, i.e. Feature/Fusion > Voice > Face. But this chart contradicts the data shown in Table 1 where the ranking is Feature/Fusion > Face >>> Voice.
The EER is "the value at which FAR and FRR are equal". The table shows a summary statistic computed from the FAR and FRR while the chart plots the two metrics separately. If the underlying values of FAR and FRR are the same, the ranking should not differ.
Let's locate the points at which FAR equals FRR on the ROC curve. Recall that the vertical axis is 1-FRR while the horizontal axis is FAR. Thus, when FAR equals FRR, y = 1-x.
As shown above, all three schemes have EERs around 20%, with the best one closer to 10%. None of these have EER under 10% and voice recognition isn't 5 times worse than face recognition. So I'm very confused by these figures. I can't find any more details about this research beyond those two articles.
A 20% error rate is a far cry from what we have come to expect from fingerprinting.
Comments
You can follow this conversation by subscribing to the comment feed for this post.