Facial Recognition’s Big 3: Accuracy (without Bias), Speed and Size
Having witnessed advancements in facial recognition from the start, I’ve marveled at the accumulation of use cases as companies familiarize themselves with the technology. Facial recognition facilitates enhancements to everyday life — helping us organize our online photos, replacing badges for secure access on corporate campuses, rewarding repeat customers in retail spaces and more. On the other end of the spectrum, facial recognition has been used to make life-changing outcomes possible, such as locating missing children or correctly identifying suspects in a terror attack. With high-stakes speed and accuracy, it’s remarkable what the technology can achieve.
However, anyone who’s deployed a facial recognition system will tell you it’s not enough to achieve lightning-fast speed, nor to be tops in accuracy — in fact, for many companies, one is accomplished at the expense of the other. Yes, it’s critical that accuracy and speed be balanced, but there are more than two pieces to the puzzle. Size matters: The more compact the model, the easier it can be embedded in devices. Accuracy, speed, size. We call this the Big Three of facial recognition for live video.
Let’s begin with accuracy. Delivering rapid recognition results without accuracy would result in false positives, false negatives, or bias. The National Institute of Standards & Technology (NIST) first began testing facial recognition algorithms in 2010, and the industry has seen massive breakthroughs in accuracy since, particularly after 2013. Testing shows the 2018 algorithm to be 20 times more accurate than the 2013 equivalent at searching for a face in a database of photographs, according to evaluation of 127 software algorithms. This massive reduction in error rates is due to the wholesale replacement of old algorithms with new ones based on deep neural networks. Consider that 95 percent of the matches that failed in 2013 now yield the correct result – that’s how significantly machine learning has revolutionized the industry.
Thinking back to those early days — less than a decade ago — facial recognition algorithms struggled to match forward-facing people from still images. Today, in an increasingly crowded field, the challenge has become to accurately match faces appearing on live video, or camera-unaware faces moving across video feeds. The industry term is “in the wild” images: Faces appear on camera feeds with variations in rotation and tilt, not to mention occlusions like facial hair and accessories. NIST measures individual algorithms’ False Non-Match Rate, or the rate at which a system miscategorizes two pieces of biometric data from the same person as data from two different people. This test determines the most accurate algorithm in the industry. The University of Massachusetts, which provides its own testing benchmark, uses the term labeled "Faces in the Wild."
A major contributor to accuracy is consistency across a range of skin tones. If algorithms are trained on a highly diverse data set that includes global representation of both male and female faces, they’re using a more accurate depiction of the real world. The variation of accuracy across skin tone, gender and age is known as bias, and facial recognition developers are highly cognizant of the need to reduce bias. Far too often, software is built with data sets that only go as far as the demographics of the developers themselves. As an industry, we must reduce bias through diverse and balanced training data with faces across a range of geographic regions, ages, skin tones and genders.
So, how is it that algorithms can be tuned to the real-world conditions of faces in motion under poor lighting and still deliver optimal results in real time? That’s the crux of it: Whether a person is wearing makeup, glasses, a hat, or is clean-shaven versus mustachioed, facial recognition systems must be adept at identifying people in live video with high accuracy. Systems trained on well-aligned, well-lit, unoccluded passport photos will not perform well when matching against in-the-wild faces captured from live video.
With so much emphasis on accuracy, we must remember that speed matters. Delivering perfectly accurate results in anything other than near real time renders facial recognition for live video use cases unusable. This is particularly true for the security industry, where reaction time is everything. Systems must be synced to the recognition event, or the opportunity to respond with action will have passed.
Additionally, matching faces with extraordinary speed makes a system well suited to large-scale deployments — live entertainment venues, sports stadiums, public transit centers, etc. — where it may be necessary to process hundreds of thousands of faces in real time. For real-world applications, a facial recognition system needs to be economically scalable. Shipping all the video to the cloud for processing is not viable. A cost-effective partitioning of the problem between edge and cloud computing results in efficient use of resources, faster response times and lower operation costs. While we want to enable computer vision for any IP camera, a new smarter generation of cameras allows us to embed the edge processing directly into the device. And this is where size comes in.
Live video demands compactness — also known as embeddability or computational efficiency — from detection to recognition. Operating at the edge leads to superior performance because the distributed architecture enables efficient bandwidth consumption, reducing roundtrip latency of recognition speed and reducing TCO.
Compared with cloud-only solutions, compact facial recognition solutions can operate at the edge while using less power to process live video. Smaller size also means scalable deployment — up to thousands of cameras — reducing TCO and optimizing facial recognition platforms for a wide variety of applications. Using off-the-shelf hardware and leveraging inexpensive GPUs can also translate to significant cost savings.
For any trusted industrial-grade facial recognition solution to be effective, it must include the combination of accuracy, speed, and size. These are what make algorithms capable of recognizing faces in live video. The future of world-class facial recognition technology comes down to the Big Three.
Reza Rassool is CTO of RealNetworks, a digital media software and services company.