Subscribe to the PwC Newsletter

Join the community, add a new evaluation result row, face recognition.

614 papers with code • 23 benchmarks • 64 datasets

Facial Recognition is the task of making a positive identification of a face in a photo or video image against a pre-existing database of faces. It begins with detection - distinguishing human faces from other objects in the image - and then works on identification of those detected faces.

The state of the art tables for this task are contained mainly in the consistent parts of the task : the face verification and face identification tasks.

( Image credit: Face Verification )

face recognition project research papers

Benchmarks Add a Result

--> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> -->
Trend Dataset Best ModelPaper Code Compare
GhostFaceNetV2-1 (MS1MV3)
GhostFaceNetV2-1
MS1MV2, R100, SFace
Fine-tuned ArcFace
Fine-tuned ArcFace
ArcFace+CSFM
PIC - QMagFace
Prodpoly
Prodpoly
PIC - MagFace
PIC - ArcFace
FaceNet+Adaptive Threshold
FaceNet+Adaptive Threshold
FaceNet+Adaptive Threshold
Model with Up Convolution + DoG Filter (Aligned)
Model with Up Convolution + DoG Filter
Wang et al. [5]
GhostFaceNetV2-1
Model with Up Convolution + DoG Filter
GhostFaceNetV2-1
FaceTransformer+OctupletLoss
Partial FC
MCN

face recognition project research papers

Most implemented papers

Facenet: a unified embedding for face recognition and clustering.

On the widely used Labeled Faces in the Wild (LFW) dataset, our system achieves a new record accuracy of 99. 63%.

ArcFace: Additive Angular Margin Loss for Deep Face Recognition

face recognition project research papers

Recently, a popular line of research in face recognition is adopting margins in the well-established softmax loss function to maximize class separability.

VGGFace2: A dataset for recognising faces across pose and age

The dataset was collected with three goals in mind: (i) to have both a large number of identities and also a large number of images for each identity; (ii) to cover a large range of pose, age and ethnicity; and (iii) to minimize the label noise.

SphereFace: Deep Hypersphere Embedding for Face Recognition

This paper addresses deep face recognition (FR) problem under open-set protocol, where ideal face features are expected to have smaller maximal intra-class distance than minimal inter-class distance under a suitably chosen metric space.

A Light CNN for Deep Face Representation with Noisy Labels

This paper presents a Light CNN framework to learn a compact embedding on the large-scale face data with massive noisy labels.

Learning Face Representation from Scratch

The current situation in the field of face recognition is that data is more important than algorithm.

Circle Loss: A Unified Perspective of Pair Similarity Optimization

This paper provides a pair similarity optimization viewpoint on deep feature learning, aiming to maximize the within-class similarity $s_p$ and minimize the between-class similarity $s_n$.

MS-Celeb-1M: A Dataset and Benchmark for Large-Scale Face Recognition

In this paper, we design a benchmark task and provide the associated datasets for recognizing face images and link them to corresponding entity keys in a knowledge base.

DeepID3: Face Recognition with Very Deep Neural Networks

Very deep neural networks recently achieved great success on general object recognition because of their superb learning capacity.

Can we still avoid automatic face detection?

Recognito-Vision/Linux-FaceRecognition-FaceLivenessDetection • 14 Feb 2016

In this setting, is it still possible for privacy-conscientious users to avoid automatic face detection and recognition?

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • HHS Author Manuscripts

Logo of nihpa

Face Recognition by Humans and Machines: Three Fundamental Advances from Deep Learning

Alice j. o’toole.

1 School of Behavioral and Brain Sciences, The University of Texas at Dallas, Richardson, Texas 75080, USA;

Carlos D. Castillo

2 Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, Maryland 21218, USA;

Deep learning models currently achieve human levels of performance on real-world face recognition tasks. We review scientific progress in understanding human face processing using computational approaches based on deep learning. This review is organized around three fundamental advances. First, deep networks trained for face identification generate a representation that retains structured information about the face (e.g., identity, demographics, appearance, social traits, expression) and the input image (e.g., viewpoint, illumination). This forces us to rethink the universe of possible solutions to the problem of inverse optics in vision. Second, deep learning models indicate that high-level visual representations of faces cannot be understood in terms of interpretable features. This has implications for understanding neural tuning and population coding in the high-level visual cortex. Third, learning in deep networks is a multistep process that forces theoretical consideration of diverse categories of learning that can overlap, accumulate over time, and interact. Diverse learning types are needed to model the development of human face processing skills, cross-race effects, and familiarity with individual faces.

1. INTRODUCTION

The fields of vision science, computer vision, and neuroscience are at an unlikely point of convergence. Deep convolutional neural networks (DCNNs) now define the state of the art in computer-based face recognition and have achieved human levels of performance on real-world face recognition tasks ( Jacquet & Champod 2020 , Phillips et al. 2018 , Taigman et al. 2014 ). This behavioral parity allows for meaningful comparisons of representations in two successful systems. DCNNs also emulate computational aspects of the ventral visual system ( Fukushima 1988 , Krizhevsky et al. 2012 , LeCun et al. 2015 ) and support surprisingly direct, layer-to-layer comparisons with primate visual areas ( Yamins et al. 2014 ). Nonlinear, local convolutions, executed in cascaded layers of neuron-like units, form the computational engine of both biological and artificial neural networks for human and machine-based face recognition. Enormous numbers of parameters, diverse learning mechanisms, and high-capacity storage in deep networks enable a wide variety of experiments at multiple levels of analysis, from reductionist to abstract. This makes it possible to investigate how systems and subsystems of computations support face processing tasks.

Our goal is to review scientific progress in understanding human face processing with computational approaches based on deep learning. As we proceed, we bear in mind wise words written decades ago in a paper on science and statistics: “All models are wrong, but some are useful” ( Box 1979 , p. 202) (see the sidebar titled Perspective: Theories and Models of Face Processing and the sidebar titled Caveat: Iteration Between Theory and Practice ). Since all models are wrong, in this review, we focus on what is useful. For present purposes, computational models are useful when they give us insight into the human visual and perceptual system. This review is organized around three fundamental advances in understanding human face perception, using knowledge generated from deep learning models. The main elements of these advances are as follows.

PERSPECTIVE: THEORIES AND MODELS OF FACE PROCESSING

Box (1976) reminds us that scientific progress comes from motivated iteration between theory and practice. In understanding human face processing, theories should be used to generate the questions, and machines (as models) should be used to answer the questions. Three elemental concepts are required for scientific progress. The first is flexibility. Effective iteration between theory and practice requires feedback between what the theory predicts and what the model reveals. The second is parsimony. Because all models are wrong, excessive elaboration will not find the correct model. Instead, economical descriptions of a phenomenon should be preferred over complex descriptions that capture less fundamental elements of human perception. Third, Box (1976 , p. 792) cautions us to avoid “worrying selectivity” in model evaluation. As he puts it, “since all models are wrong, the scientist must be alert to what is importantly wrong.”

These principles represent a scientific ideal, rather than a reality in the field of face perception by humans and machines. Applying scientific principles to computational modeling of human face perception is challenging for diverse reasons (see the sidebar titled Caveat: Iteration Between Theory and Practice below). We argue, as Cichy & Kaiser (2019) have, that although the utility of scientific models is usually seen in terms of prediction and explanation, their function for exploration should not be underrated. As scientific models, DCNNs carry out high-level visual tasks in neurally inspired ways. They are at a level of development that is ripe for exploring computational and representational principles that actually work but are not understood. This is a classic problem in reverse engineering—yet the use of deep learning as a model introduces a dilemma. The goal of reverse engineering is to understand how a functional but highly complex system (e.g., the brain and human visual system) solves a problem (e.g., recognizes a face). To accomplish this, a well-understood model is used to test hypotheses about the underlying mechanisms of the complex system. A prerequisite of reverse engineering is that we understand how the model works. Failing that, we risk using one poorly understood system to test hypotheses about another poorly understood system. Although deep networks are not black boxes (every parameter is knowable) ( Hasson et al. 2020 ), we do not fully understand how they recognize faces ( Poggio et al. 2020 ). Therefore, the primary goal should be to understand deep networks for face recognition at a conceptual and representational level.

CAVEAT: ITERATION BETWEEN THEORY AND PRACTICE

Box (1976) noted that scientific progress depends on motivated iteration between theory and practice. Unfortunately, a motivation to iterate between theory and practice is not a reasonable expectation for the field of computer-based face recognition. Automated face recognition is big business, and the best models were not developed to study human face processing. DCNNs provide a neurally inspired, but not copied, solution to face processing tasks. Computer scientists formulated DCNNs at an abstract level, based on neural networks from the 1980s ( Fukushima 1988 ). Current DCNN-based models of human face processing are computationally refined, scaled-up versions of these older networks. Algorithm developers make design and training decisions for performance and computational efficiency. In using DCNNs to model human face perception, researchers must choose between smaller, controlled models and larger-scale, uncontrolled networks (see also Richards et al. 2019 ). Controlled models are easier to analyze but can be limited in computational power and training data diversity. Uncontrolled models better emulate real neural systems but may be intractable. The easy availability of cutting-edge pretrained face recognition models, with a variety of architectures, has been the deciding factor for many research labs with limited resources and expertise to develop networks. Given the widespread use of these models in vision science, brain-similarity metrics for artificial neural networks have been developed ( Schrimpf et al. 2018 ). These produce a Brain-Score made up of a composite of neural and behavioral benchmarks. Some large-scale (uncontrolled) network architectures used in modeling human face processing (See Section 2.1 ) score well on these metrics.

A promising long-term strategy is to increase the neural accuracy of deep networks ( Grill-Spector et al. 2018 ). The ventral visual stream and DCNNs both enable hierarchical and feedforward processing. This offers two computational benefits consistent with DCNNs as models of human face processing. First, the universal approximation theorem ( Hornik et al. 1989 ) ensures that both types of networks can approximate any complex continuous function relating the input (visual image) to the output (face identity). Second, linear and nonlinear feedforward connections enable fast computation consistent with the speed of human facial recognition ( Grill-Spector et al. 2018 , Thorpe et al. 1996 ). Although current DCNNs lack other properties of the ventral visual system, these can be implemented as the field progresses.

  • Deep networks force us to rethink the universe of possible solutions to the problem of inverse optics in vision. The face representations that emerge from deep networks trained for identification operate invariantly across changes in image and appearance, but they are not themselves invariant.
  • Computational theory and simulation studies of deep learning indicate a reconsideration of a long-standing axiom in vision science that face or object representations can be understood in terms of interpretable features. Instead, in deep learning models, the concept of a nameable deep feature, localized in an output unit of the network or in the latent variables of the space, should be reevaluated.
  • Natural environments provide highly variable training data that can structure the development of face processing systems using a variety of learning mechanisms that overlap, accumulate over time, and interact. It is no longer possible to invoke learning as a generic theoretical account of a behavioral or neural phenomenon.

We focus on deep learning findings that are relevant for understanding human face processing—broadly construed. The human face provides us with diverse information, including identity, gender, race or ethnicity, age, and emotional state. We use the face to make inferences about a person’s social traits ( Oosterhof & Todorov 2008 ). As we discuss below, deep networks trained for identification retain much of this diverse facial information (e.g., Colón et al. 2021 , Dhar et al. 2020 , Hill et al. 2019 , Parde et al. 2017 , Terhörst et al. 2020 ). The use of face recognition algorithms in applied settings (e.g., law enforcement) has spurred detailed performance comparisons between DCNNs and humans (e.g., Phillips et al. 2018 ). For analogous reasons, the problem of human-like race bias in DCNNs has also been studied (e.g., Cavazos et al. 2020 ; El Khiyari & Wechsler 2016 ; Grother et al. 2019 ; Krishnapriya et al. 2019 , 2020 ). Developmental data on infants’ exposure to faces in the first year(s) of life offer insight into how to structure the training of deep networks ( Smith & Slone 2017 ). These topics are within the scope of this review. Although we consider general points of comparison between DCNNs and neural responses in face-selective areas of the primate inferotemporal (IT) cortex, a detailed discussion of this topic is beyond the scope of this review. (For a review of primate face-selective areas that considers computational perspectives, see Hesse & Tsao 2020 ). In this review, we focus on the computational and representational principles of neural coding from a deep learning perspective.

The review is organized as follows. We begin with a brief review of where machine performance on face identification stands relative to humans in quantitative terms. Qualitative performance comparisons on identification and other face processing tasks (e.g., expression classification, social perception, development) are integrated into Sections 2 – 4 . These sections consider advances in understanding human face processing from deep learning approaches. We close with a discussion of where the next steps might lead.

1.1. Where We Are Now: Human Versus Machine Face Recognition

Deep learning models of face identification map widely variable images of a face onto a representation that supports identification accuracy comparable to that of humans. The steady progress of machines over the past 15 years can be summarized in terms of the increasingly challenging face images that they can recognize ( Figure 1 ). By 2007, the best algorithms surpassed humans on a task of identity matching for unfamiliar faces in frontal images taken indoors ( O’Toole et al. 2007 ). By 2012, well-established algorithms exceeded human performance on frontal images with moderate changes in illumination and appearance ( Kumar et al. 2009 , Phillips & O’Toole 2014 ). Machine ability to match identity for in-the-wild images appeared with the advent of DCNNs in 2013–2014. Human face recognition was marginally more accurate than DeepFace ( Taigman et al. 2014 ), an early DCNN, on the Labeled Faces in the Wild (LFW) data set ( Huang et al. 2008 ). LFW contains in-the-wild images taken mostly from the front. DCNNs now fare well on in-the-wild images with significant pose variation (e.g., Maze et al. 2018 , data set). Sengupta et al. (2016) found parity between humans and machines on frontal-to-frontal identity matching but human superiority on frontal-to-profile matching.

An external file that holds a picture, illustration, etc.
Object name is nihms-1766682-f0001.jpg

The progress of computer-based face recognition systems can be tracked by their ability to recognize faces with increasing levels of image and appearance variability. In 2006, highly controlled, cropped face images with moderate variability, such as the images of the same person shown, were challenging (images adapted with permission from Sim et al. 2002 ). In 2012, algorithms could tackle moderate image and appearance variability (the top 4 images are extreme examples adapted with permission from Huang et al. 2012 ; the bottom two images adapted with permission from Phillips et al. 2011 ). By 2018, deep convolutional neural networks (DCNNs) began to tackle wide variation in image and appearance, (images adapted with permission from the database in Maze et al. 2018 ). In the 2012 and 2018 images, all side-by side images show the same person except the bottom pair of 2018 panels.

Identity matching:

process of determining if two or more images show the same identity or different identities; this is the most common task performed by machines

Human face recognition:

the ability to determine whether a face is known

1.2. Expert Humans and State-of-the-Art Machines Work Together

DCNNs can sometimes even surpass normal human performance. Phillips et al. (2018) compared humans and machines matching the identity of faces in high-quality frontal images. Although this is generally considered an easy task, the images tested were chosen to be highly challenging based on previous human and machine studies. Four DCNNs developed between 2015 and 2017 were compared to human participants from five groups: professional forensic face examiners, professional forensic face reviewers, superrecognizers ( Noyes et al. 2017 , Russell et al. 2009 ), professional fingerprint examiners, and students. Face examiners, reviewers, and superrecognizers performed more accurately than fingerprint examiners, and fingerprint examiners performed more accurately than students. Machine performance, from 2015 to 2017, tracked human skill levels. The 2015 algorithm ( Parkhi et al. 2015 ) performed at the level of the students; the 2016 algorithm ( Chen et al. 2016 ) performed at the level of the fingerprint examiners ( Ranjan et al. 2017c ); and the two 2017 algorithms ( Ranjan et al. 2017 , c ) performed at the level of professional face reviewers and examiners, respectively. Notably, combining the judgments of individual professional face examiners with those of the best algorithm ( Ranjan et al. 2017 ) yielded perfect performance. This suggests a degree of strategic diversity for the face examiners and the DCNN and demonstrates the potential for effective human–machine collaboration ( Phillips et al. 2018 ).

Combined, the data indicate that machine performance has improved from a level comparable to that of a person recognizing unfamiliar faces to one comparable to that of a person recognizing more familiar faces ( Burton et al. 1999 , Hancock et al. 2000 , Jenkins et al. 2011 ) (see Section 4.1 ).

2. RETHINKING INVERSE OPTICS AND FACE REPRESENTATIONS

Deep networks force us to rethink the universe of possible solutions to the problem of inverse optics in vision. These networks operate with a degree of invariance to image and appearance that was unimaginable by researchers less than a decade ago. Invariance refers to the model’s ability to consistently identify a face when image conditions (e.g., viewpoint, illumination) and appearance (e.g., glasses, facial hair) vary. The nature of the representation that accomplishes this is not well understood. The inscrutability of DCNN codes is due to the enormous number of computations involved in generating a face representation from an image and the uncontrolled training data. To create a face representation, millions of nonlinear, local convolutions are executed over tens (to hundreds) of layers of units. Researchers exert little or no control over the training data, but instead source face images from the web with the goal of finding as much labeled training data as possible. The number of images per identity and the types of images (e.g., viewpoint, expression, illumination, appearance, quality) are left (mostly) to what is found through web scraping. Nevertheless, DCNNs produce a surprisingly structured and rich face representation that we are beginning to understand.

2.1. Mining the Face Identity Code in Deep Networks

The face representation generated by DCNNs for the purpose of identifying a face also retains detailed information about the characteristics of the input image (e.g., viewpoint, illumination) and the person pictured (e.g., gender, age). As shown below, this unified representation can solve multiple face processing tasks in addition to identification.

2.1.1. Image characteristics.

Face representations generated by deep networks both are and are not invariant to image variation. These codes can identify faces invariantly over image change, but they are not themselves invariant. Instead, face representations of a single identity vary systematically as a function of the characteristics of the input image. The representations generated by DCNNs are, in fact, representations of face images.

Work to dissect face identity codes draws on the metaphor of a face space ( Valentine 1991 ) adapted to representations generated by a DCNN. Visualization and simulation analyses demonstrate that identity codes for face images retain ordered information about the input image ( Dhar et al. 2020 , Hill et al. 2019 , Parde et al. 2017 ). Viewpoint (yaw and pitch) can be predicted accurately from the identity code, as can media source (still image or video frame) ( Parde et al. 2017 ). Image quality (blur, usability, occlusion) is also available as the identity code norm (vector length). 1 Poor-quality images produce face representations centered in the face space, creating a DCNN garbage dump. This organizational structure was replicated in two DCNNs with different architectures, one developed by Chen et al. (2016) with seven convolutional layers and three fully connected layers and another developed by Sankaranarayanan et al. (2016) with 11 convolutional layers and one fully connected layer. Image quality estimates can also be optimized directly in a DCNN using human ratings ( Best-Rowden & Jain 2018 ).

Face space:

representation of the similarity of faces in a multidimensional space

For a closer look at the structure of DCNN face representations, Hill et al. (2019) examined the representations of highly controlled face images in a face space generated by a deep network trained with in-the-wild images. The network processed images of three-dimensional laser scans of human heads rendered from five viewpoints under two illumination conditions (ambient, harsh spotlight). Visualization of these representations in the resulting face space showed a highly ordered pattern (see Figure 2 ). Consistent with the network’s high accuracy at face identification, images clustered by identity. Identity clusters separated into regions of male and female faces (see Section 2.1.2 ). Within each identity cluster, the images separated by illumination condition—visible in the face space as chains of images. Within each illumination chain, the image representations were arranged in the space by viewpoint, which varied systematically along the image chain. To further probe the coding of identity, Hill et al. (2019) processed images of caricatures of the 3D heads (see also Blanz & Vetter 1999 ). Caricature representations were centered in each identity cluster, indicating that the network perceived a caricature as a good likeness of the identity.

An external file that holds a picture, illustration, etc.
Object name is nihms-1766682-f0002.jpg

Visualization of the top-level deep convolutional neural network (DCNN) similarity space for all images from Hill et al. (2019) . ( a – f ) Points are colored according to different variables. Grey polygonal borders are for illustration purposes only and show the convex hull of all images of each identity. These convex hulls are expanded by a margin for visibility. The network separates identities accurately. In panels a and d , the space is divided into male and female sections. In panels b and e , illumination conditions subdivide within identity groupings. In panels c and f , the viewpoint varies sequentially within illumination clusters. Dotted-line boxes in panels a – c show areas enlarged in panels d – g . Figure adapted with permission from Hill et al. (2019) .

DCNN face representation:

output vector produced for a face image processed through a deep network trained for faces

All results from Hill et al. (2019) were replicated using two networks with starkly different architectures. The first, developed by Ranjan et al. (2019) , was based on a ResNet-101 with 101 layers and skip connections; the second, developed by Chen et al. (2016) , had 15 convolution and pooling layers, a dropout layer, and one fully connected top layer. As measured using the brain-similarity metrics developed in Brain-Score ( Schrimpf et al. 2018 ), one of these architectures (ResNet-101) was the third most brain-like of the 25 networks tested. The ResNet-101 network scored well on both neural (V4 and IT cortex) and behavioral predictability for object recognition. Hill et al.’s (2019) replication of this face space using a shallower network ( Chen et al. 2016 ), however, suggests that network architecture may be less important than computational capacity in understanding high-level visual codes for faces (see Section 3.2 ).

Brain-Score:

neural and behavioral benchmarks that score an artificial neural network on its similarity to brain mechanisms for object recognition

Returning to the issue of human-like view invariance in a DCNN, Abudarham & Yovel (2020) compared the similarity of face representations computed within and across identities and viewpoints. Consistent with view-invariant performance, same-identity, different-view face pairs were more similar than different-identity, same-view face pairs. Consistent with a noninvariant face representation, correlations between similarity scores across head view decreased monotonically with increasing view disparity. These results support the characterization of DCNN codes as being functionally view invariant but with a view-specific code. Notably, earlier layers in the network showed view specificity, whereas higher layers showed view invariance.

It is worth digressing briefly to consider invariance in the context of neural approaches to face processing. An underlying assumption of neural approaches is that “a major purpose of the face patches is thus to construct a representation of individual identity invariant to view direction” ( Hesse & Tsao 2020 , pp. 703). Ideas about how this is accomplished have evolved. Freiwald & Tsao (2010) posited the progressive computation of invariance via the pooling of neurons across face patches, as follows. In early patches, a neuron responds to a specific identity from specific views; in middle face patches, greater invariance is achieved by pooling the responses of mirror-symmetric views of an identity; in later face patches, each neuron pools inputs representing all views of the same individual to create a fully view-invariant representation. More recently, Chang & Tsao (2017) proposed that the brain computes a view-invariant face code using shape and appearance parameters analogous to those used in a computer graphics model of face synthesis ( Cootes et al. 1995 ) (see the sidebar titled Neurons, Neural Tuning, Population Codes, Features, and Perceptual Constancy ). This code retains information about the face, but not about the particular image viewed.

NEURONS, NEURAL TUNING, POPULATION CODES, FEATURES, AND PERCEPTUAL CONSTANCY

Barlow (1972 , p. 371) wrote, “Results obtained by recording from single neurons in sensory pathways…obviously tell us something important about how we sense the world around us; but what exactly have we been told?” In answer, Barlow (1972 , p. 371) proposed that “our perceptions are caused by the activity of a rather small number of neurons selected from a very large population of predominantly silent cells. The activity of each single cell is thus an important perceptual event and it is thought to be related quite simply to our subjective experience.” Although this proposal is sometimes caricatured as the grandmother cell doctrine (see also Gross 2002 ), Barlow simply asserts that single-unit activity can be interpreted in perceptual terms, and that the responses of small numbers of units, in combination, underlie subjective perceptual experience. This proposal reflects ideas gleaned from studies of early visual areas that have been translated, at least in part, to studies of high-level vision.

Over the past decade, single neurons in face patches have been characterized as selective for facial features (e.g., aspect ratio, hair length, eyebrow height) ( Freiwald et al. 2009 ), face viewpoint and identity ( Freiwald & Tsao 2010 ), eyes ( Issa & DiCarlo 2012 ), and shape or appearance parameters from an active appearance model of facial synthesis ( Chang & Tsao 2017 ). Neurophysiological studies of face and object processing also employ techniques aimed at understanding neural population codes. Using the pattern of neural responses in a population of neurons (e.g., IT), linear classifiers are used often to predict subjective percepts (commonly defined as the image viewed). For example, Chang & Tsao (2017) showed that face images viewed by a macaque could be reconstructed using a linear combination of the activity of just 205 face cells in face patches ML–MF and AM. This classifier provides a real neural network model of the face-selective cortex that can be interpreted in simple terms.

Population code models generated from real neural data (a few hundred units), however, differ substantially in scale from the face- and object-selective cortical regions that they model (1 mm 3 of the cerebral cortex contains approximately 50,000 neurons and 300 million adjustable parameters; Azevedo et al. 2009 , Kandel et al. 2000 , Hasson et al. 2020 ). This difference in scale is at the core of a tension between model interpretability and real-world task generalizability ( Hasson et al. 2020 ). It also creates tension between the neural coding hypotheses suggested by deep learning and the limitations of current neuroscience techniques for testing these hypotheses. To model neural function, an electrode gives access to single neurons and (with multi-unit recordings) to relatively small numbers of neurons (a few hundred). Neurocomputational theory based on direct fit models posits that overparameterization (i.e., the extremely high number of parameters available for neural computation) is critical to the brain’s solution to real-world problems (see Section 3.2 ). Bridging the gap between the computational and neural scale of these perspectives remains an ongoing challenge for the field.

Deep networks suggest an alternative that is largely consistent with neurophysiological data but interprets the data in a different light. Neurocomputational theory posits that the ventral visual system untangles face identity information from image parameters ( DiCarlo & Cox 2007 ). The idea is that visual processing starts in the image domain, where identity and viewpoint information are entangled. With successive levels of neural processing, manifolds corresponding to individual identities are untangled from image variation. This creates a representational space where identities can be separated with hyperplanes. Image information is not lost, but rather, is rearranged (for object recognition results, see Hong et al. 2016 ). The retention of image and identity information in DCNN face representations is consistent with this theory. It is also consistent with basic neuroscience findings indicating the emergence of a representation dominated by identity that retains sensitivity to image features (See Section 2.2 ).

2.1.2. Appearance and demographics.

Faces can be described using what computer vision researchers have called attributes or soft biometrics (hairstyle, hair color, facial hair, and accessories such as makeup and glasses). The definition of attributes in the computational literature is vague and can include demographics (e.g., gender, age, race) and even facial expression. Identity codes from deep networks retain a wide variety of face attributes. For example, Terhörst et al. (2020) built a massive attribute classifier (MAC) to test whether 113 attributes could be predicted from the face representations produced by deep networks [ArcFace ( Deng et al. 2019 ) or FaceNet ( Schroff et al. 2015 )] for images from in-the-wild data sets ( Huang et al. 2008 , Liu et al. 2015 ). The MAC learned to map from DCNN-generated face representations to attribute labels. Cross-validated results showed that 39 of the attributes were easily predictable, and 74 of the 113 were predictable at reliable levels. Hairstyle, hair color, beard, and accessories were predicted easily. Attributes such as face geometry (e.g., round), periocular characteristics (e.g., arched eyebrows), and nose were moderately predictable. Skin and mouth attributes were not well predicted.

The continuous shuffling of identity, attribute, and image information across layers of the network was demonstrated by Dhar et al. (2020) . They tracked the expressivity of attributes (identity, sex, age, pose) across layers of a deep network. Expressivity was defined as the degree to which a feature vector, from any given layer of a network, specified an attribute. Dhar et al. (2020) computed expressivity using a second neural network that estimated the mutual information between attributes and DCNN features. Expressivity order in the final fully connected layer of both networks (Resnet-101 and Inception Resnet v2; Ranjan et al. 2019 ) indicated that identity was most expressed, followed by age, sex, and yaw. Identity expressivity increased dramatically from the final pooling layer to the last fully connected layer. This echos the progressive increase in the detectability of view-invariant face identity representations seen across face patches in the macaque ( Freiwald & Tsao 2010 ). It also raises the computational possibility of undetected viewpoint sensitivity in these neurons (see Section 3.1 ).

Mutual information:

a statistical term from information theory that quantifies the codependence of information between two random variables

2.1.3. Social traits.

People make consistent (albeit invalid) inferences about a person’s social traits based on their face ( Todorov 2017 ). These judgments have profound consequences. For example, competence judgments about faces predict election success at levels far above chance ( Todorov et al. 2005 ). The physical structure of the face supports these trait inferences ( Oosterhof & Todorov 2008 , Walker & Vetter 2009 ), and thus it is not surprising that deep networks retain this information. Using face representations produced by a network trained for face identification ( Sankaranarayanan et al. 2016 ), 11 traits (e.g., shy, warm, impulsive, artistic, lazy), rated by human participants, were predicted at levels well above chance ( Parde et al. 2019 ). Song et al. (2017) found that more than half of 40 attributes were predicted accurately by a network trained for object recognition (VGG-16; Simonyan & Zisserman 2014 ). Human and machine trait ratings were highly correlated.

Other studies show that deep networks can be optimized to predict traits from images. Lewenberg et al. (2016) crowd-sourced large numbers of objective (e.g., hair color) and subjective (e.g., attractiveness) attribute ratings from faces. DCNNs were trained to classify images for the presence or absence of each attribute. They found highly accurate classification for the objective attributes and somewhat less accurate classification for the subjective attributes. McCurrie et al. (2017) trained a DCNN to classify faces according to trustworthiness, dominance, and IQ. They found significant accord with human ratings, with higher agreement for trustworthiness and dominance than for IQ.

2.1.4. Facial expressions.

Facial expressions are also detectable in face representations produced by identity-trained deep networks. Colón et al. (2021) found that expression classification was well above chance for face representations of images from the Karolinska data set ( Lundqvist et al. 1998 ), which includes seven facial expressions (happy, sad, angry, surprised, fearful, disgusted, neutral) seen from five viewpoints (frontal and 90- and 45-degree left and right profiles). Consistent with human data, happiness was classified most accurately, followed by surprise, disgust, anger, neutral, sadness, and fear. Notably, accuracy did not vary across viewpoint. Visualization of the identities in the emergent face space showed a structured ordering of similarity in which viewpoint dominated over expression.

2.2. Functional Invariance, Useful Variability

The emergent code from identity-trained DCNNs can be used to recognize faces robustly, but it also retains extraneous information that is of limited, or no, value for identification. Although demographic and trait information offers weak hints to identity, image characteristics and facial expression are not useful for identification. Attributes such as glasses, hairstyle, and facial hair are, at best, weak identity cues and, at worst, misleading cues that will not remain constant over extended time periods. In purely computational terms, the variability of face representations for different images of an identity can lead to errors. Although this is problematic in security applications, coincidental features and attributes can be diagnostic enough to support acceptably accurate identification performance in day-to-day face recognition ( Yovel & O’Toole 2016 ). (For related arguments based on adversarial images for object recognition, see Ilyas et al. 2019 , Xie et al. 2020 , Yuan et al. 2020 .) A less-than-perfect identification system in computational terms, however, can be a surprisingly efficient, multipurpose face processing system that supports identification and the detection of visually derived semantic information [called attributes by Bruce & Young (1986) ].

What do we learn from these studies that can be useful in understanding human visual processing of faces? First, we learn that it is computationally feasible to accommodate diverse information about faces (identity, demographics, visually derived semantic information), images (viewpoint, illumination, quality), and emotions (expression) in a unified representation. Furthermore, this diverse information can be accessed selectively from the representation. Thus, identity, image parameters, and attributes are all untangled when learning prioritizes the difficult within-category discrimination problem of face identification.

Second, we learn that to understand high-level visual representations for faces, we need to think in terms of categorical codes unbound from a spatial frame of reference. Although remnants of retinotopy and image characteristics remain in high-level visual areas (e.g., Grill-Spector et al. 1999 , Kay et al. 2015 , Kietzmann et al. 2012 , Natu et al. 2010 , Yue et al. 2010 ), the expressivity of spatial layout weakens dramatically from early visual areas to categorically structured areas in the IT cortex. Categorical face representations should capture what cognitive and perceptual psychologists call facial features (e.g., face shape, eye color). Indeed, altering these types of features in a face affects identity perception similarly for humans and deep networks ( Abudarham et al. 2019 ). However, neurocomputational theory suggests that finding these features in the neural code will likely require rethinking the interpretation of neural tuning and population coding (see Section 3.2 ).

Third, if the ventral stream untangles information across layers of computations, then we should expect traces of identity, image data, and attributes at many, if not all, neural network layers. These may variously dominate the strength of the neural signal at different layers (see Section 3.1 ). Thus, various layers in the network will likely succeed in predicting several types of information about the face and/or image, though with differing accuracy. For now, we should not ascribe too much importance to findings about which specific layer(s) of a particular network predict specific attributes. Instead, we should pay attention to the pattern of prediction accuracy across layers. We would expect the following pattern. Clearly, for the optimized attribute (identity), the output offers the clearest access. For subject-related attributes (e.g., demographics), this may also be the case. For image-related attributes, we would expect every layer in the network to retain some degree of prediction ability. Exactly how, where, and whether the neural system makes use of these attributes for specific tasks remain open questions.

3. RETHINKING VISUAL FEATURES: IMPLICATIONS FOR NEURAL CODES

Deep learning models force us to rethink the definition and interpretation of facial features in high-level representations. Theoretical ideas about the brain’s solution to complex real-world tasks such as face recognition must be reconciled at the level of neural units and representational spaces. Deep learning models can be used to test hypotheses about how faces are stored in the high-dimensional representational space defined by the pattern of responses of large numbers of neurons.

3.1. Units Confound Information that Separates in the Representation Space

Insight into interpreting facial features comes from deep network simulations aimed at understanding the relationship between unit responses and the information retained in the face representation. Parde et al. (2021) compared identification, gender classification, and viewpoint estimation in subspaces of a DCNN face space. Using an identity-trained network capable of all three tasks, they tested performance on the tasks using randomly sampled subsets of output units. Beginning at full dimensionality (512-units) and progressively decreasing sample size, they found no notable decline in identification accuracy for more than 3,000 in-the-wild-faces until the sample size reached 16 randomly chosen units (3% of full dimensionality). Correlations between unit responses across representations were near zero, indicating that individual units captured nonredundant identity cues. Statistical power for identification (i.e., separating identities) was uniformly high for all output units, demonstrating that units used their entire response range to separate identities. A unit firing at its maximum provided no more, and no less, information than any other response value. This distinction may seem trivial, but it is not. The data suggest that every output unit acts to separate identities to the maximum degree possible. As such, all units participate in coding all identities. In information theory terms, this is an ideal use of neural resources.

For gender classification and viewpoint estimation, performance declined at a much faster rate than for identification as units were deleted ( Parde et al. 2021 ). Statistical power for predicting gender and viewpoint was strong in the distributed code but weak at the level of the unit. Prediction power for these attributes was again roughly equivalent for all units. Thus, individual units contributed to coding all three attributes, but identity modulated individual unit responses far more strongly than did gender or viewpoint. Notably, a principal component (PC) analysis of representations in the full-dimensional space revealed subspaces aligned with identity, gender, and viewpoint ( Figure 3 ). Consistent with the strength of the categorical identity code in the representation, identity information dominated PCs explaining large amounts of variance, gender dominated the middle range of PCs, and viewpoint dominated PCs explaining small amounts of variation.

An external file that holds a picture, illustration, etc.
Object name is nihms-1766682-f0003.jpg

Illustration of the separation of the task-relevant information into subspaces for an identity-trained deep convolutional neural network (DCNN). Each plot shows the similarity (cosine) between principal components (PCs) of the face space and directional vectors in the space that are diagnostic of identity ( top ), gender ( middle ), and viewpoint ( bottom ). Figure adapted with permission from Parde et al. (2021) .

The emergence and effectiveness of these codes in DCNNs suggest that caution is needed in ascribing significance only to stimuli that drive a neuron to high rates of response. Small-scale modulations of neural responses can also be meaningful. Let us consider a concrete example. A neurophysiologist probing the network used by Parde et al. (2021) would find some neurons that respond strongly to a few identities. Interpreting this as identity tuning, however, would be an incorrect characterization of a code in which all units participate in coding all identities. Concomitantly, few units in the network would appear responsive to viewpoint or gender variations because unit firing rates would modulate only slightly with changes in viewpoint or gender. Thus, the distributed coding of view and gender across units would likely be missed. The finding that neurons in macaque face patch AM respond selectively (i.e., with high response rates) to identity over variable views ( Freiwald & Tsao 2010 ) is consistent with DCNN face representations. It is possible, however, that these units also encode other face and image attributes, but with differential degrees of expressivity. This would be computationally consistent with the untangling theory and with DCNN codes.

Macaque face patches:

regions of the macaque cortex that respond selectively to faces, including the posterior lateral (PL), middle lateral (ML), middle fundus (MF), anterior lateral (AL), anterior fundus (AF), and anterior medial (AM)

Another example comes from the use of generative adversarial networks and related techniques to characterize the response properties of single (or multiple) neuron(s) in the primate visual cortex ( Bashivan et al. 2019 , Ponce et al. 2019 , Yuan et al. 2020 ). These techniques have examined neurons in areas V4 ( Bashivan et al. 2019 ) and IT ( Ponce et al. 2019 , Yuan et al. 2020 ). The goal is to progressively evolve images that drive neurons to their maximum response or that selectively (in)activate subsets of neurons. Evolved images show complex mosaics of textures, shapes, and colors. They sometimes show animals or people and sometimes reveal spatial patterns that are not semantically interpretable. However, these techniques rely on two strong assumptions. First, they assume that a neuron’s response can be characterized completely in terms of the stimuli that activate it maximally, thereby discounting other response rates as noninformative. The computational utility of a unit’s full response range in DCNNs suggests that reconsideration of this assumption is necessary. Second, these techniques assume that a neuron’s response properties can be visualized accurately as a two-dimensional image. Given the categorical, nonretinotopic nature of representations in high-level visual areas, this seems problematic. If the representation under consideration is not in the image or pixel domain, then image-based visualization may offer limited, and possibly misleading, insight into the underlying nature of the code.

3.2. Direct-Fit Models and Deep Learning

In rethinking visual features at a theoretical level, direct-fit models of neural coding appear to best explain deep learning findings in multiple domains (e.g., face recognition, language) ( Hasson et al. 2020 ). These models posit that neural computation fits densely sampled data from the environment. Implementation is accomplished using “overparameterized optimization algorithms that increase predictive (generalization) power, without explicitly modeling the underlying generative structure of the world” ( Hasson et al. 2020 , p. 418). Hasson et al. (2020) begins with an ideal model in a small-parameter space ( Figure 4 ). When the underlying structure of the world is simple, a small-parameter model will find the underlying generative function, thereby supporting generalization via interpolation and extrapolation. Despite decades of effort, small-parameter functions have not solved real-world face recognition with performance anywhere near that of humans.

An external file that holds a picture, illustration, etc.
Object name is nihms-1766682-f0004.jpg

( a ) A model with too few parameters fails to fit the data. ( b ) The ideal-fit model fits with a small number of parameters and has generative power that supports interpolation and extrapolation. ( c ) An overfit function can model noise in the training data. ( d ) An overparameterized model generalizes well to new stimuli within the scope of the training samples. Figure adapted with permission from Hasson et al. (2020) .

When the underlying structure of the world is complex and multivariate, direct-fit models offer an alternative to models based on small-parameter functions. With densely sampled real-world training data, each new observation can be placed in the context of past experience. More formally, direct-fit models solve the problem of generalization to new exemplars by experience-scaffolded interpolation ( Hasson et al. 2020 ). This produces face recognition performance in the range of that of humans. A fundamental element of the success of deep networks is that they model the environment with big data, which can be structured in overparameterized spaces. The scale of the parameterization and the requirement to operate on real-world data are pivotal. Once the network is sufficiently parameterized to fit the data, the exact details of its architecture are not important. This may explain why starkly different network architectures arrive at similarly structured representations ( Hill et al. 2019 , Parde et al. 2017 , Storrs et al. 2020 ).

Returning to the issue of features, in neurocomputational terms, the strength of connectivity between neurons at synapses is the primary locus of information, just as weights between units in a deep network comprise information. We expect features, whatever they are, to be housed in the combination of connection strengths among units, not in the units themselves. In a high-dimensional multivariate encoding space, they are hyperplane directions through the space. Thus, features are represented across many computing elements, and each computing element participates in encoding many features ( Hasson et al. 2020 , Parde et al. 2021 ). If features are directions in a high-dimensional coding space ( Goodfellow et al. 2014 ), then units act as an arbitrary projection surface from which this information can be accessed—albeit in a nontransparent form.

A downside of direct-fit models is that they cannot generalize via extrapolation. The other-race effect is an example of how face recognition may fail due to limited experience ( Malpass & Kravitz 1969 ) (see Section 4.3.2 ). The extrapolation limit may be countered, however, by the capacity of direct-fit models to acquire expertise within the confines of experience. For example, in human perception, category experience selectively structures representations as new exemplars are learned. Collins & Behrmann (2020) show that this occurs in a way that reflects the greater experience that humans have with faces and computer-generated objects from novel made-up categories of objects, which the authors call YUFOs. They tracked the perceived similarity of pairs of other-race faces and YUFOs as people learned novel exemplars of each. Experience changed perceived similarities more selectively for faces than for YUFOs, enabling more nuanced discrimination of exemplars from the experienced category of faces.

In summary, direct-fit models offer a framework for thinking about high-level visual codes for faces in a way that unifies disparate data on single units and high-dimensional coding spaces. These models are fueled by the rich experience that we (models) gain from learning (training on) real-world data. They solve complex visual tasks with interpolated solutions that elude transparent semantic interpretation.

4. RETHINKING LEARNING IN HUMANS AND DEEP NETWORKS

Deep network models of human face processing force us to consider learning as a complex and diverse set of mechanisms that can overlap, accumulate over time, and interact. Learning in both humans and artificial neural networks can refer to qualitatively different phenomena. In both cases, learning involves multiple steps. For DCNNs, these steps are fundamental to a network’s ability to recognize faces across image and appearance variation. Human visual learning is likewise diverse and unfolds across the developmental lifespan in a process governed by genetics and environmental input ( Goodman & Shatz 1993 ). The stepwise implementation of learning is one way that DCNNs differ from previous face recognition networks. Considered as manipulable modeling tools, the learning steps in DCNNs force us to think in concrete and nuanced ways about how humans learn faces.

In this section, we outline the learning layers in human face processing ( Section 4.1 ), introduce the layers of learning used in training machines ( Section 4.2 ), and consider the relationship between the two in the context of human behavior ( Section 4.3.1 ). The human learning layers support a complex, biologically realized face processing system. The machine learning layers can be thought of as building blocks that can be combined in a variety of ways to model human behavioral phenomena. At the outset, we note that machine learning is designed to maximize performance—not to model the development of the human face processing system ( Smith & Slone 2017 ). Concomitantly, the sequential presentation of training data in DCNNs differs from the pattern of exposure that infants and young children have with faces and objects ( Jayaraman et al. 2015 ). The machine learning steps, however, can be modified to model human learning more closely. In practical terms, fully trained DCNNs, available on the web, are used (almost exclusively) to model human neural systems (see the sidebar titled Caveat: Iteration Between Theory and Practice ). It is important, therefore, to understand how (and why) these models are configured as they are and to understand the types of learning tools available for modeling human face processing. These steps may provide computational grounding for basic learning mechanisms hypothesized in humans.

4.1. Human Learning for Face Processing

To model human face processing, researchers need to consider the following types of learning. The most specific form of learning is familiar face recognition. People learn the faces of specific familiar individuals (e.g., friends, family, celebrities). Familiar faces are recognized robustly over challenging changes in appearance and image characteristics. The second-most specific is local population tuning. People recognize own-race faces more accurately than other-race faces, a phenomenon referred to as the other-race effect (e.g., Malpass & Kravitz 1969 ). This likely results from tuning to the statistical properties of the faces that we see most frequently—typically faces of our own race. The third-most specific is nfamiliar face recognition. People can differentiate unfamiliar faces perceptually. Unfamiliar refers to faces that a person has not encountered previously or has encountered infrequently. Unfamiliar face recognition is less robust to image and appearance change than is familiar face recognition. The least specific form of learning is object recognition. At a fundamental level of analysis, faces are objects, and both share early visual processing wetware.

4.2. How Deep Convolutional Neural Networks Learn Face Identification

Training DCNNs for face recognition involves a sequence of learning stages, each with a concrete objective. Unlike human learning, machine learning stages are executed in strict sequence. The goal across all stages of training is to build an effective method for converting images of faces into points in a high-dimensional space. The resulting high-dimensional space allows for easy comparison among faces, search, and clustering. In this section, we sketch out the engineering approach to learning, working forward from the most general to the most specific form of learning. This follows the implementation order used by engineers.

4.2.1. Object classification (between-category learning): Stage 1.

Deep networks for face identification are commonly built on top of DCNNs that have been pretrained for object classification. Pretraining is carried out using large data sets of objects, such as those available in ImageNet ( Russakovsky et al. 2015 ), which contains more than 14 million images of over 1,000 classes of objects (e.g., volcanoes, cups, chihuahuas). The object categorization training procedure involves adjusting the weights on all layers of the network. For training to converge, a large training set is required. The loss function optimized in this procedure typically uses the well-understood cross-entropy loss + Softmax combination. Most practitioners do not execute this step because it has been performed already in a pretrained model downloaded from a public repository in a format compatible with DCNN software libraries [e.g., PyTorch ( Paszke et al. 2019 ), TensorFlow ( Abadi et al. 2016 )]. Networks trained for object recognition have proven better for face identification than networks that start with a random configuration ( Liu et al. 2015 , Yi et al. 2014 ).

4.2.2. Face recognition (within-category learning): Stage 2.

Face recognition training is implemented in a second stage of training. In this stage, the last fully connected layer that connects to object-category nodes (e.g., volcanoes, cups) is removed from the results of the Stage 1 training. Next, a fully connected layer that maps to the number of face identities available for face training is connected. Depending on the size of the face training set, the weights of either all layers or all but a few layers at the beginning of the network are updated. The former is common when very large numbers of face identities are available for training. In academic laboratories, data sets include 5–10 million face images of 40,000–100,000 identities. In industry, far larger data sets are often used ( Schroff et al. 2015 ). A technical difficulty encountered in retraining an object classification network to a face recognition network is the large increase in the number of categories involved (approximately 1,000 objects versus 50,000+ faces). Special loss functions can address this issue [e.g., L2-Softmax/crystal loss ( Ranjan et al. 2017 ), NormFace ( Wang et al. 2017 ), angular Softmax ( Li et al. 2018 ), additive Softmax ( Wang et al. 2018 ), additive angular margins ( Deng et al. 2019 )].

When the Stage 2 face training is complete, the last fully connected layer that connects to the 50,000+ face identity nodes is removed, leaving below it a relatively low-dimensional (128- to 5,000-unit) layer of output units. This can be thought of as the face representation. This output represents a face image, not a face identity. At this point in training, any arbitrary face image from any identity (known or unknown to the network) can be processed by the DCNN to produce a compact face image descriptor across the units of this layer. If the network functions perfectly, then it will produce identical codes for all images of the same person. This would amount to perfect image and appearance generalization. This is not usually achieved, even when the network is highly accurate (see Section 2 ).

In this state, the network is commonly employed to recognize faces not seen in training (unfamiliar faces). Stage 2 training supports a surprising degree of generalization (e.g., pose, expression, illumination, and appearance) for images of unfamiliar faces. This general face learning gives the system special knowledge of faces and enables it to perform within-category face discrimination for unfamiliar faces ( O’Toole et al. 2018 ). With or without Stage 3 training, the network is now capable of converting images of faces into points in a high-dimensional space, which, as noted above, is the primary goal of training. In practice, however, Stages 3 and 4 can provide a critical bridge to modeling behavioral characteristics of the human face processing system.

4.2.3. Adapting to local statistics of people and visual environments: Stage 3.

The objective of Stage 3 training is to finalize the modification of the DCNN weights to better adapt to the application domain. The term application domain can refer to faces from a particular race or ethnicity or, as it is commonly used in industry, to the type of images to be processed (e.g., in-the-wild faces, passport photographs). This training is a crucial step in many applications because there will be no further transformation of the weights. Special care is needed in this training to avoid collapsing the representation into a form that is too specific. Training at this stage can improve performance for some faces and decrease it for others.

Whereas Stages 1 and 2 are used in the vast majority of published computational work, in Stage 3, researchers diverge. Although there is no standard implementation for this training, fine-tuning and learning a triplet loss embedding ( van der Maaten & Weinberger 2012 ) are common methods. These methods are conceptually similar but differ in implementation. In both methods, ( a ) new layers are added to the network, ( b ) specific subsets of layers are frozen or unfrozen, and ( c ) optimization continues with an appropriate loss function using a new data set with the desired domain characteristics. Fine-tuning starts from an already-viable network state and updates a nonempty subset of weights, or possibly all weights. It is typically implemented with smaller learning rates and can use smaller training sets than those needed for full training. Triplet loss is implemented by freezing all layers and adding a new, fully connected layer. Minimization is done with the triplet loss, again on a new (smaller) data set with the desired domain characteristics.

A natural question is why Stage 2 (general face training) is not considered fine-tuning. The answer, in practice, comes down to viability and volume. When the training for Stage 2 starts, the network is not in a viable state to perform face recognition. Therefore, it requires a voluminous, diverse data set to function. Stage 3 begins with a functional network and can be tuned effectively with a small targeted data set.

This face knowledge history provides a tool for adapting to local face statistics (e.g., race) ( O’Toole et al. 2018 ).

4.2.4. Learning individual people: Stage 4.

In psychological terms, learning individual familiar faces involves seeing multiple, diverse images of the individuals to whom the faces belong. As we see more images of a person, we become more familiar with their face and can recognize it from increasingly variable images ( Dowsett et al. 2016 , Murphy et al. 2015 , Ritchie & Burton 2017 ). In computational terms, this translates into the question of how a network can learn to recognize a random set of special (familiar) faces with greater accuracy and robustness than other nonspecial (unfamiliar) faces—assuming, of course, the availability of multiple, variable images of the special faces. This stage of learning is defined, in nearly all cases, outside of the DCNN, with no change to weights within the DCNN.

The problem is as follows. The network starts with multiple images of each familiar identity and can produce a representation for each of the images–but what then? There is no standard familiarization protocol, but several approaches exist. We categorize these approaches first and link them to theoretical accounts of face familiarity in Section 4.3.3 .

The first approach is averaging identity codes, or 1-class learning. It is common in machine learning to use an average (or weighted average) of the DCNN-generated face image representations as an identity code (see also Crosswhite et al. 2018 , Su et al. 2015 ). Averaging creates a person-identity prototype ( Noyes et al. 2021 ) for each familiar face.

The second is individual face contrast, or 2-class learning. This technique employs direct learning of individual identities by contrasting them with all other identities. There are two classes because the model learns what makes each identity (positive class) different than all other identities (negative class). The distinctiveness of each familiar face is enhanced relative to all other known faces (e.g., Noyes et al. 2021 ).

The third is multiple face contrast, or K-class learning. This refers to the use of identification training for a random set of (familiar) faces with a simple network (often a one-layer network). The network learns to map DCNN-generated face representations of the available images onto identity nodes.

The fourth approach is fine-tuning individual face representations. Fine-tuning has also been used for learning familiar identities ( Blauch et al. 2020a ). It is an unusual method because it alters weights within the DCNN itself. This can improve performance for the familiarized faces but can limit the network’s ability to represent other faces.

These methods create a personal face learning history that supports more accurate and robust face processing for familiar people ( O’Toole et al. 2018 ).

4.3. Mapping Learning Between Humans and Machines

Deep networks rely on multiple types of learning that can be useful in formulating and testing complex, nuanced hypotheses about human face learning. Manipulable variables include order of learning, training data, and network plasticity at different learning stages. We consider a sample of topics in human face processing that can be investigated by manipulating learning in deep networks. Because these investigations are just beginning, we provide an overview of the work in progress and discuss possible next steps in modeling.

4.3.1. Development of face processing.

Early infants’ experience with faces is critical for the development of face processing skills ( Maurer et al. 2002 ). The timing of this experience has become increasingly clear with the availability of data sets gathered using head-mounted cameras in infants (1–15 months of age) (e.g., Jayaraman et al. 2015 , Yoshida & Smith 2008 ). In seeing the world from the perspective of the infant, it becomes clear that the development of sensorimotor abilities drives visual experience. Infants’ experience transitions from seeing only what is made available to them (often faces in the near range), to seeing the world from the perspective of a crawler (objects and environments), to seeing hands and the objects that they manipulate ( Fausey et al. 2016 , Jayaraman et al. 2015 , Smith & Slone 2017 , Sugden & Moulson 2017 ). Between 1 and 3 months of age, faces are frequent, temporally persistent, and viewed frontally at close range. This early experience with faces is limited to a few individuals. Faces become less frequent as the child’s first year progresses and attention shifts to the environment, to objects, and later to hands ( Jayaraman & Smith 2019 ).

The prevalence of a few important faces in the infants’ visual world suggests that early face learning may have an out-sized influence on structuring visual recognition systems. Infants’ visual experience of objects, faces, and environments can provide a curriculum for teaching machines ( Smith et al. 2018 ). DCNNs can be used to test hypotheses about the emergence of competence on different face processing tasks. Some basic computational challenges, however, need to be addressed. Training with very large numbers of objects (or faces) is required for deep network learning to converge (see Section 4.2.1 ). Starting small and building competence on multiple domains (faces, objects, environments) might require basic changes to deep network training. Alternatively, the small number of special faces in an infant’s life might be considered familiar faces. Perception and memory of these faces may be better modeled using tools that operate outside the deep network on representations that develop within the network (Stage 4 learning; Section 4.2.4 ). In this case, the quality of the representation produced at different points in a network’s development of more general visual knowledge varies (Stages 1 and 2 of training; Sections 4.2.1 and 4.2.2 ). The learning of these special faces early in development might interact with the learning of objects and scenes at the categorical level ( Rosch et al. 1976 , Yovel et al. 2012 ). A promising approach would involve pausing training in Stages 1 and 2 to test face representation quality at various points along the way to convergence.

4.3.2. Race bias in the performance of humans and deep networks.

People recognize own-race faces more accurately than other-race faces. For humans, this other-race effect begins in infancy ( Kelly et al. 2005 , 2007 ) and is manifest in children ( Pezdek et al. 2003 ). Although it is possible to reverse these effects in childhood ( Sangrigoli et al. 2005 ), training adults to recognize other-race faces yields only modest gains (e.g., Cavazos et al. 2019 , Hayward et al. 2017 , Laurence et al. 2016 , Matthews & Mondloch 2018 , Tanaka & Pierce 2009 ). Concomitantly, evidence for the experience-based contact hypothesis is weak when it is evaluated in adulthood ( Levin 2000 ). Clearly, the timing of experience is critical in the other-race effect. Developmental learning, which results in perceptual narrowing during a critical childhood period, may provide a partial account of the other-race effect ( Kelly et al. 2007 , Sangrigoli et al. 2005 , Scott & Monesson 2010 ).

Perceptual narrowing:

sculpting of neural and perceptual processing via experience during a critical period in child development

Face recognition algorithms from the 1990s and present-day DCNNs differ in accuracy for faces of different races (for a review, see Cavazos et al. 2020 ; for a comprehensive test of race bias in DCNNs, see Grother et al. 2019 ). Although training with faces of different races is often cited as a cause of race effects, it is unclear which training stage(s) contribute to the bias. It is likely that biased learning affects all learning stages. From the human perspective, for many people, experience favors own-race faces across the lifespan, potentially impacting performance through multiple learning mechanisms (developmental, unfamiliar, and familiar face learning). DCNN training may also use race-biased data at all stages. For humans, understanding the role of different types of learning in the other-race effect is challenging because experience with faces cannot be controlled. DCNNs can serve as a tool for studying critical periods and perceptual narrowing. It is possible to compare the face representations that emerge from training regimes that vary in the time course of exposure to faces of different races. The ability to manipulate training stage order, network plasticity, and training set diversity in deep networks offers an opportunity to test hypotheses about how bias emerges. The major challenge for DCNNs is the limited availability of face databases that represent the diversity of humans.

4.3.3. Familiar versus unfamiliar face recognition.

Face familiarity in a deep network can be modeled in more ways than we can count. The approaches presented in Section 4.2.4 are just a beginning. Researchers should focus first on the big questions. How do familiar and unfamiliar face representations differ—beyond simple accuracy and robustness? This has been much debated recently, and many questions remain ( Blauch et al. 2020a , b ; Young & Burton 2020 ; Yovel & Abudarham 2020 ). One approach is to ask where in the learning process representations for familiar and unfamiliar faces diverge. The methods outlined in Section 4.2.4 make some predictions.

In the individual and multiple face contrast methods, familiar and unfamiliar face representations are not differentiated within the deep network. Instead, familiar face representations generated by the DCNN are enhanced in another, simpler network populated with known faces. A familiar face’s representation is affected, therefore, by the other faces that we know well. Contrast techniques have preliminary empirical support. In the work of Noyes et al. (2021) , familiarization using individual-face contrast improved identification for both evasion and impersonation disguise. It also produced a pattern of accuracy similar to that seen for people familiar with the disguised individuals ( Noyes & Jenkins 2019 ). For humans who were unfamiliar with the disguised faces, the pattern of accuracy resembled that seen after general face training inside of the DCNN. There is also support for multiple-face contrast familiarization. Perceptual expertise findings that emphasize the selective effects of the exemplars experienced during highly skilled learning are consistent with this approach ( Collins & Behrmann 2020 ) (see Section 3.2 ).

Familiarization by averaging and fine-tuning both improve performance, but at a cost. For example, averaging the DCNN representations increased performance for evasion disguise by increasing tolerance for appearance variation ( Noyes et al. 2021 ). It decreased performance, however, for imposter disguise by allowing too much tolerance for appearance variation. Averaging methods highlight the need to balance the perception of identity across variable images with an ability to tell similar faces apart.

Familiarization via fine-tuning was explored by Blauch et al. (2020a) , who varied the number of layers tuned (all layers, fully connected layers, only the fully connected layer mapping the perceptual layer to identity nodes). Fine-tuning applied at lower layers alters the weights within the deep network to produce a perceptual representation potentially affected by familiar faces. Fine-tuning in the mapping layer is equivalent to multiclass face contrast learning ( Blauch et al. 2020b ). Blauch et al. (2020b) show that fine-tuning the perceptual representation, which they consider analogous to perceptual learning, is not necessary for producing a familiarity effect ( Blauch et al. 2020a ).

These approaches are not (necessarily) mutually exclusive and therefore can be combined to exploit useful features of each.

4.3.4. Objects, faces, both.

The organization of face-, body-, and object-selective areas in the ventral temporal cortex has been studied intensively (cf. Grill-Spector & Weiner 2014 ). Neuroimaging studies in childhood reveal the developmental time course of face selectivity and other high-level visual tasks (e.g., Natu et al. 2016 ; Nordt et al. 2019 , 2020 ). How these systems interact during development in the context of constantly changing input from the environment is an open question. DCNNs can be used to test functional hypotheses about the development of object and face learning (see also Grill-Spector et al. 2018 ).

In the case of machine learning, face recognition networks are more accurate when pretrained to categorize objects ( Liu et al. 2015 , Yi et al. 2014 ), and networks trained with only faces are more accurate for face recognition than networks trained with only objects ( Abudarham & Yovel 2020 , Blauch et al. 2020a ). Human-like viewpoint invariance was found in a DCNN trained for face recognition but not in one trained for object recognition ( Abudarham & Yovel 2020 ). In machine learning, networks are trained first with objects, and then with faces. Moreover, networks can simultaneously learn object and face recognition ( Dobs et al. 2020 ), which incurs minimal duplication of neural resources.

4.4. New Tools, New Questions, New Data, and a New Look at Old Data

Psychologists have long posited diverse and complex learning mechanisms for faces. Deep networks provide new tools that can be used to model human face learning with greater precision than was possible previously. This is useful because it encourages theoreticians to articulate hypotheses in ways specific enough to model. It may no longer be sufficient to explain a phenomenon in terms of generic learning or contact. Concepts such as perceptual narrowing should include ideas about where and how in the learning process this narrowing occurs. A major challenge ahead is the sheer number of knobs to be set in deep networks. Plasticity, for example, can be dialed up or down, and it can be applied to selected network layers or specific face diets administered across multiple learning stages (in sequence or simultaneously). The list goes on. In all of the topics discussed, and others not discussed, theoretical ideas should specify the manipulations thought to be most critical. We should follow the counsel of Box (1976) to avoid worrying selectivity and instead focus on what is most important. New tools succeed when they facilitate the discovery of things that we did not know or had not hypothesized. Testing these hypotheses will require new data and may suggest a reevaluation of existing data.

5. THE PATH FORWARD

In this review, we highlight fundamental advances in thinking brought about by deep learning approaches. These networks solve the inverse optics problem for face identification by untangling image, appearance, and identity over layers of neural-like processing. This demonstrates that robust face identification can be achieved with a representation that includes specific information about the face image(s) actually experienced. These representations retain information about appearance, perceived traits, expressions, and identity.

Direct-fit models posit that deep networks operate by placing new observations into the context of past experience. These models depend on overparameterized networks that create a high-dimensional space from real-world training data. Face representations housed within this space project onto units, thereby confounding stimulus features that (may) separate in the high-dimensional space. This raises questions about the transparency and interpretability of information gained by examining the response properties of network units. Deep networks can be studied at the both micro- and macroscale simultaneously and can be used to formulate hypotheses about the underlying neural code for faces. A key to understanding face representations is to reconcile the responses of neurons to the structure of the code in the high-dimensional space. This is a challenging problem best approached by combining psychological, neural, and computational methods.

The process of training a deep network is complex and layered. It draws on learning mechanisms aimed at objects and faces, visual categories of faces (e.g., race), and special familiar faces. Psychological and neural theory considers the many ways in which people and brains learn faces from real-world visual experience. DCNNs offer the potential to implement and test sophisticated hypotheses about how humans learn faces across the lifespan.

We should not lose sight of the fact that a compelling reason to study deep networks is that they actually work, i.e., they perform nearly as well as humans, on face recognition tasks that have stymied computational modelers for decades. This might qualify as a property of deep networks that is importantly right ( Box 1976 ). There is a difference, of course, between working and working like humans. Determining whether a deep network can work like humans, or could be made to do so by manipulating other properties of the network (e.g., architectures, training data, learning rules), is work that is just beginning.

SUMMARY POINTS

  • Face representations generated by DCNN networks trained for identification retain information about the face (e.g., identity, demographics, attributes, traits, expression) and the image (e.g., viewpoint).
  • Deep learning face networks generate a surprisingly structured face representation from unstructured training with in-the-wild face images.
  • Individual output units from deep networks are unlikely to signal the presence of interpretable features.
  • Fundamental structural aspects of high-level visual codes for faces in deep networks replicate over a wide variety of network architectures.
  • Diverse learning mechanisms in DCNNs, applied simultaneously or in sequence, can be used to model human face perception across the lifespan.

FUTURE ISSUES

  • Large-scale systematic manipulations of training data (race, ethnicity, image variability) are needed to give insight into the role of experience in structuring face representations.
  • Fundamental challenges remain in understanding how to combine deep networks for face, object, and scene recognition in ways analogous to the human visual system.
  • Deep networks model the ventral visual stream at a generic level, arguably up to the level of the IT cortex. Future work should examine how downstream systems, such as face patches, could be connected into this system.
  • In rethinking the goals of face processing, we argue in this review that some longstanding assumptions about visual representations should be reconsidered. Future work should consider novel experimental questions and employ methods that do not rely on these assumptions.

ACKNOWLEDGMENTS

The authors are supported by funding provided by National Eye Institute grant R01EY029692-03 to A.J.O. and C.D.C.

DISCLOSURE STATEMENT

C.D.C. is an equity holder in Mukh Technologies, which may potentially benefit from research results.

1 This is the case in networks trained with the Softmax objective function.

LITERATURE CITED

  • Abadi M, Barham P, Chen J, Chen Z, Davis A, et al. 2016. Tensorflow: a system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) , pp. 265–83. Berkeley, CA: USENIX [ Google Scholar ]
  • Abudarham N, Shkiller L, Yovel G. 2019. Critical features for face recognition . Cognition 182 :73–83 [ PubMed ] [ Google Scholar ]
  • Abudarham N, Yovel G. 2020. Face recognition depends on specialized mechanisms tuned to view-invariant facial features: insights from deep neural networks optimized for face or object recognition . bioRxiv 2020.01.01.890277 . 10.1101/2020.01.01.890277 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Azevedo FA, Carvalho LR, Grinberg LT, Farfel JM, Ferretti RE, et al. 2009. Equal numbers of neuronal and nonneuronal cells make the human brain an isometrically scaled-up primate brain . J. Comp. Neurol 513 ( 5 ):532–41 [ PubMed ] [ Google Scholar ]
  • Barlow HB. 1972. Single units and sensation: a neuron doctrine for perceptual psychology? Perception 1 ( 4 ):371–94 [ PubMed ] [ Google Scholar ]
  • Bashivan P, Kar K, DiCarlo JJ. 2019. Neural population control via deep image synthesis . Science 364 ( 6439 ):eaav9436 [ PubMed ] [ Google Scholar ]
  • Best-Rowden L, Jain AK. 2018. Learning face image quality from human assessments . IEEE Trans. Inform. Forensics Secur 13 ( 12 ):3064–77 [ Google Scholar ]
  • Blanz V, Vetter T. 1999. A morphable model for the synthesis of 3d faces. In Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques , pp. 187–94. New York: ACM [ Google Scholar ]
  • Blauch NM, Behrmann M, Plaut DC. 2020a. Computational insights into human perceptual expertise for familiar and unfamiliar face recognition . Cognition 208 :104341. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Blauch NM, Behrmann M, Plaut DC. 2020b. Deep learning of shared perceptual representations for familiar and unfamiliar faces: reply to commentaries . Cognition 208 :104484. [ PubMed ] [ Google Scholar ]
  • Box GE. 1976. Science and statistics . J. Am. Stat. Assoc 71 ( 356 ):791–99 [ Google Scholar ]
  • Box GEP. 1979. Robustness in the strategy of scientific model building. In Robustness in Statistics , ed. Launer RL, Wilkinson GN, pp. 201–36. Cambridge, MA: Academic Press [ Google Scholar ]
  • Bruce V, Young A. 1986. Understanding face recognition . Br. J. Psychol 77 ( 3 ):305–27 [ PubMed ] [ Google Scholar ]
  • Burton AM, Bruce V, Hancock PJ. 1999. From pixels to people: a model of familiar face recognition . Cogn. Sci 23 ( 1 ):1–31 [ Google Scholar ]
  • Cavazos JG, Noyes E, O’Toole AJ. 2019. Learning context and the other-race effect: strategies for improving face recognition . Vis. Res 157 :169–83 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Cavazos JG, Phillips PJ, Castillo CD, O’Toole AJ. 2020. Accuracy comparison across face recognition algorithms: Where are we on measuring race bias? IEEE Trans. Biom. Behav. Identity Sci 3 ( 1 ):101–11 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Chang L, Tsao DY. 2017. The code for facial identity in the primate brain . Cell 169 ( 6 ):1013–28 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Chen JC, Patel VM, Chellappa R. 2016. Unconstrained face verification using deep CNN features. In Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision (WACV) , pp. 1–9. Piscataway, NJ: IEEE [ Google Scholar ]
  • Cichy RM, Kaiser D. 2019. Deep neural networks as scientific models . Trends Cogn. Sci 23 ( 4 ):305–17 [ PubMed ] [ Google Scholar ]
  • Collins E, Behrmann M. 2020. Exemplar learning reveals the representational origins of expert category perception . PNAS 117 ( 20 ):11167–77 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Colón YI, Castillo CD, O’Toole AJ. 2021. Facial expression is retained in deep networks trained for face identification . J. Vis 21 ( 4 ):4 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Cootes TF, Taylor CJ, Cooper DH, Graham J. 1995. Active shape models-their training and application . Comput. Vis. Image Underst 61 ( 1 ):38–59 [ Google Scholar ]
  • Crosswhite N, Byrne J, Stauffer C, Parkhi O, Cao Q, Zisserman A. 2018. Template adaptation for face verification and identification . Image Vis. Comput 79 :35–48 [ Google Scholar ]
  • Deng J, Guo J, Xue N, Zafeiriou S. 2019. Arcface: additive angular margin loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp. 4690–99. Piscataway, NJ: IEEE [ PubMed ] [ Google Scholar ]
  • Dhar P, Bansal A, Castillo CD, Gleason J, Phillips P, Chellappa R. 2020. How are attributes expressed in face DCNNs? In Proceedings of the 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020) , pp. 61–68. Piscataway, NJ: IEEE [ Google Scholar ]
  • DiCarlo JJ, Cox DD. 2007. Untangling invariant object recognition . Trends Cogn. Sci 11 ( 8 ):333–41 [ PubMed ] [ Google Scholar ]
  • Dobs K, Kell AJ, Martinez J, Cohen M, Kanwisher N. 2020. Using task-optimized neural networks to understand why brains have specialized processing for faces . J. Vis 20 ( 11 ):660 [ Google Scholar ]
  • Dowsett A, Sandford A, Burton AM. 2016. Face learning with multiple images leads to fast acquisition of familiarity for specific individuals . Q. J. Exp. Psychol 69 ( 1 ):1–10 [ PubMed ] [ Google Scholar ]
  • El Khiyari H, Wechsler H. 2016. Face verification subject to varying (age, ethnicity, and gender) demographics using deep learning . J. Biom. Biostat 7 :323 [ Google Scholar ]
  • Fausey CM, Jayaraman S, Smith LB. 2016. From faces to hands: changing visual input in the first two years . Cognition 152 :101–7 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Freiwald WA, Tsao DY. 2010. Functional compartmentalization and viewpoint generalization within the macaque face-processing system . Science 330 ( 6005 ):845–51 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Freiwald WA, Tsao DY, Livingstone MS. 2009. A face feature space in the macaque temporal lobe . Nat. Neurosci 12 ( 9 ):1187–96 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Fukushima K 1988. Neocognitron: a hierarchical neural network capable of visual pattern recognition . Neural Netw 1 ( 2 ):119–30 [ Google Scholar ]
  • Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, et al. 2014. Generative adversarial nets. In NIPS’14: Proceedings of the 27th International Conference on Neural Information Processing Systems , pp. 2672–80. New York: ACM [ Google Scholar ]
  • Goodman CS, Shatz CJ. 1993. Developmental mechanisms that generate precise patterns of neuronal connectivity . Cell 72 :77–98 [ PubMed ] [ Google Scholar ]
  • Grill-Spector K, Kushnir T, Edelman S, Avidan G, Itzchak Y, Malach R. 1999. Differential processing of objects under various viewing conditions in the human lateral occipital complex . Neuron 24 ( 1 ):187–203 [ PubMed ] [ Google Scholar ]
  • Grill-Spector K, Weiner KS. 2014. The functional architecture of the ventral temporal cortex and its role in categorization . Nat. Rev. Neurosci 15 ( 8 ):536–48 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Grill-Spector K, Weiner KS, Gomez J, Stigliani A, Natu VS. 2018. The functional neuroanatomy of face perception: from brain measurements to deep neural networks . Interface Focus 8 ( 4 ):20180013. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Gross CG. 2002. Genealogy of the “grandmother cell” . Neuroscientist 8 ( 5 ):512–18 [ PubMed ] [ Google Scholar ]
  • Grother P, Ngan M, Hanaoka K. 2019. Face recognition vendor test (FRVT) part 3: demographic effects . Rep., Natl. Inst. Stand. Technol., US Dept. Commerce, Gaithersburg, MD [ Google Scholar ]
  • Hancock PJ, Bruce V, Burton AM. 2000. Recognition of unfamiliar faces . Trends Cogn. Sci 4 ( 9 ):330–37 [ PubMed ] [ Google Scholar ]
  • Hasson U, Nastase SA, Goldstein A. 2020. Direct fit to nature: an evolutionary perspective on biological and artificial neural networks . Neuron 105 ( 3 ):416–34 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Hayward WG, Favelle SK, Oxner M, Chu MH, Lam SM. 2017. The other-race effect in face learning: using naturalistic images to investigate face ethnicity effects in a learning paradigm . Q. J. Exp. Psychol 70 ( 5 ):890–96 [ PubMed ] [ Google Scholar ]
  • Hesse JK, Tsao DY. 2020. The macaque face patch system: a turtle’s underbelly for the brain . Nat. Rev. Neurosci 21 ( 12 ):695–716 [ PubMed ] [ Google Scholar ]
  • Hill MQ, Parde CJ, Castillo CD, Colon YI, Ranjan R, et al. 2019. Deep convolutional neural networks in the face of caricature . Nat. Mach. Intel 1 ( 11 ):522–29 [ Google Scholar ]
  • Hong H, Yamins DL, Majaj NJ, DiCarlo JJ. 2016. Explicit information for category-orthogonal object properties increases along the ventral stream . Nat. Neurosci 19 ( 4 ):613–22 [ PubMed ] [ Google Scholar ]
  • Hornik K, Stinchcombe M, White H. 1989. Multilayer feedforward networks are universal approximators . Neural Netw 2 ( 5 ):359–66 [ Google Scholar ]
  • Huang GB, Lee H, Learned-Miller E. 2012. Learning hierarchical representations for face verification with convolutional deep belief networks. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pp. 2518–25. Piscataway, NJ: IEEE [ Google Scholar ]
  • Huang GB, Mattar M, Berg T, Learned-Miller E. 2008. Labeled faces in the wild: a database for studying face recognition in unconstrained environments . Paper presented at the Workshop on Faces in “Real-Life” Images: Detection, Alignment, and Recognition, Marseille, France [ Google Scholar ]
  • Ilyas A, Santurkar S, Tsipras D, Engstrom L, Tran B, Madry A. 2019. Adversarial examples are not bugs, they are features . arXiv:1905.02175 [stat.ML] [ Google Scholar ]
  • Issa EB, DiCarlo JJ. 2012. Precedence of the eye region in neural processing of faces . J. Neurosci 32 ( 47 ):16666–82 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Jacquet M, Champod C. 2020. Automated face recognition in forensic science: review and perspectives . Forensic Sci. Int 307 :110124. [ PubMed ] [ Google Scholar ]
  • Jayaraman S, Fausey CM, Smith LB. 2015. The faces in infant-perspective scenes change over the first year of life . PLOS ONE 10 ( 5 ):e0123780. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Jayaraman S, Smith LB. 2019. Faces in early visual environments are persistent not just frequent . Vis. Res 157 :213–21 [ PubMed ] [ Google Scholar ]
  • Jenkins R, White D, Van Montfort X, Burton AM. 2011. Variability in photos of the same face . Cognition 121 ( 3 ):313–23 [ PubMed ] [ Google Scholar ]
  • Kandel ER, Schwartz JH, Jessell TM, Siegelbaum S, Hudspeth AJ, Mack S, eds. 2000. Principles of Neural Science , Vol. 4 . New York: McGraw-Hill [ Google Scholar ]
  • Kay KN, Weiner KS, Grill-Spector K. 2015. Attention reduces spatial uncertainty in human ventral temporal cortex . Curr. Biol 25 ( 5 ):595–600 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Kelly DJ, Quinn PC, Slater AM, Lee K, Ge L, Pascalis O. 2007. The other-race effect develops during infancy: evidence of perceptual narrowing . Psychol. Sci 18 ( 12 ):1084–89 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Kelly DJ, Quinn PC, Slater AM, Lee K, Gibson A, et al. 2005. Three-month-olds, but not newborns, prefer own-race faces . Dev. Sci 8 ( 6 ):F31–36 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Kietzmann TC, Swisher JD, König P, Tong F. 2012. Prevalence of selectivity for mirror-symmetric views of faces in the ventral and dorsal visual pathways . J. Neurosci 32 ( 34 ):11763–72 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Krishnapriya KS, Albiero V, Vangara K, King MC, Bowyer KW. 2020. Issues related to face recognition accuracy varying based on race and skin tone . IEEE Trans. Technol. Soc 1 ( 1 ):8–20 [ Google Scholar ]
  • Krishnapriya K, Vangara K, King MC, Albiero V, Bowyer K. 2019. Characterizing the variability in face recognition accuracy relative to race. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops , Vol. 1 , pp. 2278–85. Piscataway, NJ: IEEE [ Google Scholar ]
  • Krizhevsky A, Sutskever I, Hinton GE. 2012. Imagenet classification with deep convolutional neural networks. In NIPS’12: Proceedings of the 25th International Conference on Neural Information Processing Systems , pp. 1097–105. New York: ACM [ Google Scholar ]
  • Kumar N, Berg AC, Belhumeur PN, Nayar SK. 2009. Attribute and simile classifiers for face verification. In Proceedings of the 2009 IEEE International Conference on Computer Vision , pp. 365–72. Piscataway, NJ: IEEE [ Google Scholar ]
  • Laurence S, Zhou X, Mondloch CJ. 2016. The flip side of the other-race coin: They all look different to me . Br. J. Psychol 107 ( 2 ):374–88 [ PubMed ] [ Google Scholar ]
  • LeCun Y, Bengio Y, Hinton G. 2015. Deep learning . Nature 521 ( 7553 ):436–44 [ PubMed ] [ Google Scholar ]
  • Levin DT. 2000. Race as a visual feature: using visual search and perceptual discrimination tasks to understand face categories and the cross-race recognition deficit . J. Exp. Psychol. Gen 129 ( 4 ):559–74 [ PubMed ] [ Google Scholar ]
  • Lewenberg Y, Bachrach Y, Shankar S, Criminisi A. 2016. Predicting personal traits from facial images using convolutional neural networks augmented with facial landmark information . arXiv:1605.09062 [cs.CV] [ Google Scholar ]
  • Li Y, Gao F, Ou Z, Sun J. 2018. Angular softmax loss for end-to-end speaker verification. In Proceedings of the 11th International Symposium on Chinese Spoken Language Processing (ISCSLP) , pp. 190–94. Baixas, France: ISCA [ Google Scholar ]
  • Liu Z, Luo P, Wang X, Tang X. 2015. Deep learning face attributes in the wild. In Proceedings of the 2015 IEEE International Conference on Computer Vision , pp. 3730–38. Piscataway, NJ: IEEE [ Google Scholar ]
  • Lundqvist D, Flykt A, Ohman A. 1998. Karolinska directed emotional faces . Database of standardized facial images, Psychol. Sect., Dept. Clin. Neurosci. Karolinska Hosp., Solna, Swed. https://www.kdef.se/#:~:text=The%20Karolinska%20Directed%20Emotional%20Faces,from%20the%20original%20KDEF%20images [ Google Scholar ]
  • Malpass RS, Kravitz J. 1969. Recognition for faces of own and other race . J. Personal. Soc. Psychol 13 ( 4 ):330–34 [ PubMed ] [ Google Scholar ]
  • Matthews CM, Mondloch CJ. 2018. Improving identity matching of newly encountered faces: effects of multi-image training . J. Appl. Res. Mem. Cogn 7 ( 2 ):280–90 [ Google Scholar ]
  • Maurer D, Le Grand R, Mondloch CJ. 2002. The many faces of configural processing . Trends Cogn. Sci 6 ( 6 ):255–60 [ PubMed ] [ Google Scholar ]
  • Maze B, Adams J, Duncan JA, Kalka N, Miller T, et al. 2018. IARPA Janus Benchmark—C: face dataset and protocol. In Proceedings of the 2018 International Conference on Biometrics (ICB) , pp. 158–65. Piscataway, NJ: IEEE [ Google Scholar ]
  • McCurrie M, Beletti F, Parzianello L, Westendorp A, Anthony S, Scheirer WJ. 2017. Predicting first impressions with deep learning. In Proceedings of the 2017 IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), pp. 518–25. Piscataway, NJ: IEEE [ Google Scholar ]
  • Murphy J, Ipser A, Gaigg SB, Cook R. 2015. Exemplar variance supports robust learning of facial identity . J. Exp. Psychol. Hum. Percept. Perform 41 ( 3 ):577–81 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Natu VS, Barnett MA, Hartley J, Gomez J, Stigliani A, Grill-Spector K. 2016. Development of neural sensitivity to face identity correlates with perceptual discriminability . J. Neurosci 36 ( 42 ):10893–907 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Natu VS, Jiang F, Narvekar A, Keshvari S, Blanz V, O’Toole AJ. 2010. Dissociable neural patterns of facial identity across changes in viewpoint . J. Cogn. Neurosci 22 ( 7 ):1570–82 [ PubMed ] [ Google Scholar ]
  • Nordt M, Gomez J, Natu V, Jeska B, Barnett M, Grill-Spector K. 2019. Learning to read increases the informativeness of distributed ventral temporal responses . Cereb. Cortex 29 ( 7 ):3124–39 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Nordt M, Gomez J, Natu VS, Rezai AA, Finzi D, Grill-Spector K. 2020. Selectivity to limbs in ventral temporal cortex decreases during childhood as selectivity to faces and words increases . J. Vis 20 ( 11 ):152 [ Google Scholar ]
  • Noyes E, Jenkins R. 2019. Deliberate disguise in face identification . J. Exp. Psychol. Appl 25 ( 2 ):280–90 [ PubMed ] [ Google Scholar ]
  • Noyes E, Parde C, Colon Y, Hill M, Castillo C, et al. 2021. Seeing through disguise: getting to know you with a deep convolutional neural network . Cognition . In press [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Noyes E, Phillips P, O’Toole A. 2017. What is a super-recogniser. In Face Processing: Systems, Disorders and Cultural Differences , ed. Bindemann M, pp. 173–201. Hauppage, NY: Nova Sci. Publ. [ Google Scholar ]
  • Oosterhof NN, Todorov A. 2008. The functional basis of face evaluation . PNAS 105 ( 32 ):11087–92 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • O’Toole AJ, Castillo CD, Parde CJ, Hill MQ, Chellappa R. 2018. Face space representations in deep convolutional neural networks . Trends Cogn. Sci 22 ( 9 ):794–809 [ PubMed ] [ Google Scholar ]
  • O’Toole AJ, Phillips PJ, Jiang F, Ayyad J, Pénard N, Abdi H. 2007. Face recognition algorithms surpass humans matching faces over changes in illumination . IEEE Trans. Pattern Anal. Mach. Intel ( 9 ):1642–46 [ PubMed ] [ Google Scholar ]
  • Parde CJ, Castillo C, Hill MQ, Colon YI, Sankaranarayanan S, et al. 2017. Face and image representation in deep CNN features. In Proceedings of the 2017 IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017) , pp. 673–80. Piscataway, NJ: IEEE [ Google Scholar ]
  • Parde CJ, Colón YI, Hill MQ, Castillo CD, Dhar P, O’Toole AJ. 2021. Face recognition by humans and machines: closing the gap between single-unit and neural population codes—insights from deep learning in face recognition . J. Vis In press [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Parde CJ, Hu Y, Castillo C, Sankaranarayanan S, O’Toole AJ. 2019. Social trait information in deep convolutional neural networks trained for face identification . Cogn. Sci 43 ( 6 ):e12729. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Parkhi OM, Vedaldi A, Zisserman A. 2015. Deep face recognition . Rep., Vis. Geom. Group, Dept. Eng. Sci., Univ. Oxford, UK [ Google Scholar ]
  • Paszke A, Gross S, Massa F, Lerer A, Bradbury J, et al. 2019. Pytorch: an imperative style, high-performance deep learning library. In NeurIPS 2019: Proceedings of the 32nd International Conference on Neural Information Processing Systems , pp. 8024–35. New York: ACM [ Google Scholar ]
  • Pezdek K, Blandon-Gitlin I, Moore C. 2003. Children’s face recognition memory: more evidence for the cross-race effect . J. Appl. Psychol 88 ( 4 ):760–63 [ PubMed ] [ Google Scholar ]
  • Phillips PJ, Beveridge JR, Draper BA, Givens G, O’Toole AJ, et al. 2011. An introduction to the good, the bad, & the ugly face recognition challenge problem. In Proceedings of the 2011 IEEE International Conference on Automatic Face & Gesture Recognition (FG) , pp. 346–53. Piscataway, NJ: IEEE [ Google Scholar ]
  • Phillips PJ, O’Toole AJ. 2014. Comparison of human and computer performance across face recognition experiments . Image Vis. Comput 32 ( 1 ):74–85 [ Google Scholar ]
  • Phillips PJ, Yates AN, Hu Y, Hahn CA, Noyes E, et al. 2018. Face recognition accuracy of forensic examiners, superrecognizers, and face recognition algorithms . PNAS 115 ( 24 ):6171–76 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Poggio T, Banburski A, Liao Q. 2020. Theoretical issues in deep networks . PNAS 117 ( 48 ):30039–45 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Ponce CR, Xiao W, Schade PF, Hartmann TS, Kreiman G, Livingstone MS. 2019. Evolving images for visual neurons using a deep generative network reveals coding principles and neuronal preferences . Cell 177 ( 4 ):999–1009 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Ranjan R, Bansal A, Zheng J, Xu H, Gleason J, et al. 2019. A fast and accurate system for face detection, identification, and verification . IEEE Trans. Biom. Behav. Identity Sci 1 ( 2 ):82–96 [ Google Scholar ]
  • Ranjan R, Castillo CD, Chellappa R. 2017. L2-constrained softmax loss for discriminative face verification . arXiv:1703.09507 [cs.CV] [ Google Scholar ]
  • Ranjan R, Sankaranarayanan S, Castillo CD, Chellappa R. 2017c. An all-in-one convolutional neural network for face analysis. In Proceedings of the 2017 IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017) , pp. 17–24. Piscataway, NJ: IEEE [ Google Scholar ]
  • Richards BA, Lillicrap TP, Beaudoin P, Bengio Y, Bogacz R, et al. 2019. A deep learning framework for neuroscience . Nat. Neurosci 22 ( 11 ):1761–70 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Ritchie KL, Burton AM. 2017. Learning faces from variability . Q. J. Exp. Psychol 70 ( 5 ):897–905 [ PubMed ] [ Google Scholar ]
  • Rosch E, Mervis CB, Gray WD, Johnson DM, Boyes-Braem P. 1976. Basic objects in natural categories . Cogn. Psychol 8 ( 3 ):382–439 [ Google Scholar ]
  • Russakovsky O, Deng J, Su H, Krause J, Satheesh S, et al. 2015. ImageNet Large Scale Visual Recognition Challenge . Int. J. Comput. Vis 115 ( 3 ):211–52 [ Google Scholar ]
  • Russell R, Duchaine B, Nakayama K. 2009. Super-recognizers: people with extraordinary face recognition ability . Psychon. Bull. Rev 16 ( 2 ):252–57 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Sangrigoli S, Pallier C, Argenti AM, Ventureyra V, de Schonen S. 2005. Reversibility of the other-race effect in face recognition during childhood . Psychol. Sci 16 ( 6 ):440–44 [ PubMed ] [ Google Scholar ]
  • Sankaranarayanan S, Alavi A, Castillo C, Chellappa R. 2016. Triplet probabilistic embedding for face verification and clustering . arXiv:1604.05417 [cs.CV] [ Google Scholar ]
  • Schrimpf M, Kubilius J, Hong H, Majaj NJ, Rajalingham R, et al. 2018. Brain-score: Which artificial neural network for object recognition is most brain-like? bioRxiv 407007 . 10.1101/407007 [ CrossRef ] [ Google Scholar ]
  • Schroff F, Kalenichenko D, Philbin J. 2015. Facenet: a unified embedding for face recognition and clustering. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition , pp. 815–23. Piscataway, NJ: IEEE [ Google Scholar ]
  • Scott LS, Monesson A. 2010. Experience-dependent neural specialization during infancy . Neuropsychologia 48 ( 6 ):1857–61 [ PubMed ] [ Google Scholar ]
  • Sengupta S, Chen JC, Castillo C, Patel VM, Chellappa R, Jacobs DW. 2016. Frontal to profile face verification in the wild. In Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision (WACV) , pp. 1–9. Piscataway, NJ: IEEE [ Google Scholar ]
  • Sim T, Baker S, Bsat M. 2002. The CMU pose, illumination, and expression (PIE) database. In Proceedings of Fifth IEEE International Conference on Automatic Face Gesture Recognition , pp. 53–58. Piscataway, NJ: IEEE [ Google Scholar ]
  • Simonyan K, Zisserman A. 2014. Very deep convolutional networks for large-scale image recognition . arXiv:1409.1556 [cs.CV] [ Google Scholar ]
  • Smith LB, Jayaraman S, Clerkin E, Yu C. 2018. The developing infant creates a curriculum for statistical learning . Trends Cogn. Sci 22 ( 4 ):325–36 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Smith LB, Slone LK. 2017. A developmental approach to machine learning? Front. Psychol 8 :2124. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Song A, Linjie L, Atalla C, Gottrell G. 2017. Learning to see people like people: predicting social impressions of faces . Cogn. Sci 2017 :1096–101 [ Google Scholar ]
  • Storrs KR, Kietzmann TC, Walther A, Mehrer J, Kriegeskorte N. 2020. Diverse deep neural networks all predict human it well, after training and fitting . bioRxiv 2020.05.07.082743 . 10.1101/2020.05.07.082743 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Su H, Maji S, Kalogerakis E, Learned-Miller E. 2015. Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the 2015 IEEE International Conference on Computer Vision , pp. 945–53. Piscataway, NJ: IEEE [ Google Scholar ]
  • Sugden NA, Moulson MC. 2017. Hey baby, what’s “up”? One-and 3-month-olds experience faces primarily upright but non-upright faces offer the best views . Q. J. Exp. Psychol 70 ( 5 ):959–69 [ PubMed ] [ Google Scholar ]
  • Taigman Y, Yang M, Ranzato M, Wolf L. 2014. Deepface: closing the gap to human-level performance in face verification. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition , pp. 1701–8. Piscataway, NJ: IEEE [ Google Scholar ]
  • Tanaka JW, Pierce LJ. 2009. The neural plasticity of other-race face recognition . Cogn. Affect. Behav. Neurosci 9 ( 1 ):122–31 [ PubMed ] [ Google Scholar ]
  • Terhörst P, Fährmann D, Damer N, Kirchbuchner F, Kuijper A. 2020. Beyond identity: What information is stored in biometric face templates? arXiv:2009.09918 [cs.CV] [ Google Scholar ]
  • Thorpe S, Fize D, Marlot C. 1996. Speed of processing in the human visual system . Nature 381 ( 6582 ):520–22 [ PubMed ] [ Google Scholar ]
  • Todorov A 2017. Face Value: The Irresistible Influence of First Impressions . Princeton, NJ: Princeton Univ. Press [ Google Scholar ]
  • Todorov A, Mandisodza AN, Goren A, Hall CC. 2005. Inferences of competence from faces predict election outcomes . Science 308 ( 5728 ):1623–26 [ PubMed ] [ Google Scholar ]
  • Valentine T 1991. A unified account of the effects of distinctiveness, inversion, and race in face recognition . Q. J. Exp. Psychol. A 43 ( 2 ):161–204 [ PubMed ] [ Google Scholar ]
  • van der Maaten L, Weinberger K. 2012. Stochastic triplet embedding. In Proceedings of the 2012 IEEE International Workshop on Machine Learning for Signal Processing , pp. 1–6. Piscataway, NJ: IEEE [ Google Scholar ]
  • Walker M, Vetter T. 2009. Portraits made to measure: manipulating social judgments about individuals with a statistical face model . J. Vis 9 ( 11 ):12 [ PubMed ] [ Google Scholar ]
  • Wang F, Liu W, Liu H, Cheng J. 2018. Additive margin softmax for face verification . IEEE Signal Process. Lett 25 :926–30 [ Google Scholar ]
  • Wang F, Xiang X, Cheng J, Yuille AL. 2017. Normface: L 2 hypersphere embedding for face verification. In MM ‘17: Proceedings of the 25th ACM International Conference on Multimedia , pp. 1041–49. New York: ACM [ Google Scholar ]
  • Xie C, Tan M, Gong B, Wang J, Yuille AL, Le QV. 2020. Adversarial examples improve image recognition. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition , pp. 819–28. Piscataway, NJ: IEEE [ Google Scholar ]
  • Yamins DL, Hong H, Cadieu CF, Solomon EA, Seibert D, DiCarlo JJ. 2014. Performance-optimized hierarchical models predict neural responses in higher visual cortex . PNAS 111 ( 23 ):8619–24 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Yi D, Lei Z, Liao S, Li SZ. 2014. Learning face representation from scratch . arXiv:1411.7923 [cs.CV] [ Google Scholar ]
  • Yoshida H, Smith LB. 2008. What’s in view for toddlers? Using a head camera to study visual experience . Infancy 13 ( 3 ):229–48 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Young AW, Burton AM. 2020. Insights from computational models of face recognition: a reply to Blauch, Behrmann and Plaut . Cognition 208 :104422. [ PubMed ] [ Google Scholar ]
  • Yovel G, Abudarham N. 2020. From concepts to percepts in human and machine face recognition: a reply to Blauch, Behrmann & Plaut . Cognition 208 :104424. [ PubMed ] [ Google Scholar ]
  • Yovel G, Halsband K, Pelleg M, Farkash N, Gal B, Goshen-Gottstein Y. 2012. Can massive but passive exposure to faces contribute to face recognition abilities? J. Exp. Psychol. Hum. Percept. Perform 38 ( 2 ):285–89 [ PubMed ] [ Google Scholar ]
  • Yovel G, O’Toole AJ. 2016. Recognizing people in motion . Trends Cogn. Sci 20 ( 5 ):383–95 [ PubMed ] [ Google Scholar ]
  • Yuan L, Xiao W, Kreiman G, Tay FE, Feng J, Livingstone MS. 2020. Adversarial images for the primate brain . arXiv:2011.05623 [q-bio.NC] [ Google Scholar ]
  • Yue X, Cassidy BS, Devaney KJ, Holt DJ, Tootell RB. 2010. Lower-level stimulus features strongly influence responses in the fusiform face area . Cereb. Cortex 21 ( 1 ):35–47 [ PMC free article ] [ PubMed ] [ Google Scholar ]

Information

  • Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

  • Active Journals
  • Find a Journal
  • Proceedings Series
  • For Authors
  • For Reviewers
  • For Editors
  • For Librarians
  • For Publishers
  • For Societies
  • For Conference Organizers
  • Open Access Policy
  • Institutional Open Access Program
  • Special Issues Guidelines
  • Editorial Process
  • Research and Publication Ethics
  • Article Processing Charges
  • Testimonials
  • Preprints.org
  • SciProfiles
  • Encyclopedia

sensors-logo

Article Menu

  • Subscribe SciFeed
  • Recommended Articles
  • PubMed/Medline
  • Google Scholar
  • on Google Scholar
  • Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

Face recognition systems: a survey.

face recognition project research papers

1. Introduction

  • We first introduced face recognition as a biometric technique.
  • We presented the state of the art of the existing face recognition techniques classified into three approaches: local, holistic, and hybrid.
  • The surveyed approaches were summarized and compared under different conditions.
  • We presented the most popular face databases used to test these approaches.
  • We highlighted some new promising research directions.

2. Face Recognition Systems Survey

2.1. essential steps of face recognition systems.

  • Face Detection : The face recognition system begins first with the localization of the human faces in a particular image. The purpose of this step is to determine if the input image contains human faces or not. The variations of illumination and facial expression can prevent proper face detection. In order to facilitate the design of a further face recognition system and make it more robust, pre-processing steps are performed. Many techniques are used to detect and locate the human face image, for example, Viola–Jones detector [ 24 , 25 ], histogram of oriented gradient (HOG) [ 13 , 26 ], and principal component analysis (PCA) [ 27 , 28 ]. Also, the face detection step can be used for video and image classification, object detection [ 29 ], region-of-interest detection [ 30 ], and so on.
  • Feature Extraction : The main function of this step is to extract the features of the face images detected in the detection step. This step represents a face with a set of features vector called a “signature” that describes the prominent features of the face image such as mouth, nose, and eyes with their geometry distribution [ 31 , 32 ]. Each face is characterized by its structure, size, and shape, which allow it to be identified. Several techniques involve extracting the shape of the mouth, eyes, or nose to identify the face using the size and distance [ 3 ]. HOG [ 33 ], Eigenface [ 34 ], independent component analysis (ICA), linear discriminant analysis (LDA) [ 27 , 35 ], scale-invariant feature transform (SIFT) [ 23 ], gabor filter, local phase quantization (LPQ) [ 36 ], Haar wavelets, Fourier transforms [ 31 ], and local binary pattern (LBP) [ 3 , 10 ] techniques are widely used to extract the face features.
  • Face Recognition : This step considers the features extracted from the background during the feature extraction step and compares it with known faces stored in a specific database. There are two general applications of face recognition, one is called identification and another one is called verification. During the identification step, a test face is compared with a set of faces aiming to find the most likely match. During the identification step, a test face is compared with a known face in the database in order to make the acceptance or rejection decision [ 7 , 19 ]. Correlation filters (CFs) [ 18 , 37 , 38 ], convolutional neural network (CNN) [ 39 ], and also k-nearest neighbor (K-NN) [ 40 ] are known to effectively address this task.

2.2. Classification of Face Recognition Systems

3. local approaches, 3.1. local appearance-based techniques.

  • Local binary pattern (LBP) and it’s variant: LBP is a great general texture technique used to extract features from any object [ 16 ]. It has widely performed in many applications such as face recognition [ 3 ], facial expression recognition, texture segmentation, and texture classification. The LBP technique first divides the facial image into spatial arrays. Next, within each array square, a 3 × 3 pixel matrix ( p 1 … … p 8 ) is mapped across the square. The pixel of this matrix is a threshold with the value of the center pixel ( p 0 ) (i.e., use the intensity value of the center pixel i ( p 0 ) as a reference for thresholding) to produce the binary code. If a neighbor pixel’s value is lower than the center pixel value, it is given a zero; otherwise, it is given one. The binary code contains information about the local texture. Finally, for each array square, a histogram of these codes is built, and the histograms are concatenated to form the feature vector. The LBP is defined in a matrix of size 3 × 3, as shown in Equation (1). LBP = ∑ p = 1 8 2 p s ( i 0 − i p ) ,      w i t h   s ( x ) = { 1 x ≥ 0 0 x < 0 , (1) where i 0 and i p are the intensity value of the center pixel and neighborhood pixels, respectively. Figure 3 illustrates the procedure of the LBP technique. Khoi et al. [ 20 ] propose a fast face recognition system based on LBP, pyramid of local binary pattern (PLBP), and rotation invariant local binary pattern (RI-LBP). Xi et al. [ 15 ] have introduced a new unsupervised deep learning-based technique, called local binary pattern network (LBPNet), to extract hierarchical representations of data. The LBPNet maintains the same topology as the convolutional neural network (CNN). The experimental results obtained using the public benchmarks (i.e., LFW and FERET) have shown that LBPNet is comparable to other unsupervised techniques. Laure et al. [ 40 ] have implemented a method that helps to solve face recognition issues with large variations of parameters such as expression, illumination, and different poses. This method is based on two techniques: LBP and K-NN techniques. Owing to its invariance to the rotation of the target image, LBP become one of the important techniques used for face recognition. Bonnen et al. [ 42 ] proposed a variant of the LBP technique named “multiscale local binary pattern (MLBP)” for features’ extraction. Another LBP extension is the local ternary pattern (LTP) technique [ 43 ], which is less sensitive to the noise than the original LBP technique. This technique uses three steps to compute the differences between the neighboring ones and the central pixel. Hussain et al. [ 36 ] develop a local quantized pattern (LQP) technique for face representation. LQP is a generalization of local pattern features and is intrinsically robust to illumination conditions. The LQP features use the disk layout to sample pixels from the local neighborhood and obtain a pair of binary codes using ternary split coding. These codes are quantized, with each one using a separately learned codebook.
  • Histogram of oriented gradients (HOG) [ 44 ]: The HOG is one of the best descriptors used for shape and edge description. The HOG technique can describe the face shape using the distribution of edge direction or light intensity gradient. The process of this technique done by sharing the whole face image into cells (small region or area); a histogram of pixel edge direction or direction gradients is generated of each cell; and, finally, the histograms of the whole cells are combined to extract the feature of the face image. The feature vector computation by the HOG descriptor proceeds as follows [ 10 , 13 , 26 , 45 ]: firstly, divide the local image into regions called cells, and then calculate the amplitude of the first-order gradients of each cell in both the horizontal and vertical direction. The most common method is to apply a 1D mask, [–1 0 1]. G x ( x ,   y ) = I ( x + 1 ,   y ) − I ( x − 1 ,   y ) , (2) G y ( x ,   y ) = I ( x ,   y + 1 ) − I ( x ,   y − 1 ) , (3) where I ( x ,   y ) is the pixel value of the point ( x ,   y ) and G x ( x ,   y ) and G y ( x ,   y ) denote the horizontal gradient amplitude and the vertical gradient amplitude, respectively. The magnitude of the gradient and the orientation of each pixel ( x , y ) are computed as follows: G ( x ,   y ) = G x ( x ,   y ) 2 + G y ( x ,   y ) 2 , (4) θ ( x ,   y ) = tan − 1 ( G y ( x ,   y ) G x ( x ,   y ) ) . (5) The magnitude of the gradient and the orientation of each pixel in the cell are voted in nine bins with the tri-linear interpolation. The histograms of each cell are generated pixel based on direction gradients and, finally, the histograms of the whole cells are combined to extract the feature of the face image. Karaaba et al. [ 44 ] proposed a combination of different histograms of oriented gradients (HOG) to perform a robust face recognition system. This technique is named “multi-HOG”. The authors create a vector of distances between the target and the reference face images for identification. Arigbabu et al. [ 46 ] proposed a novel face recognition system based on the Laplacian filter and the pyramid histogram of gradient (PHOG) descriptor. In addition, to investigate the face recognition problem, support vector machine (SVM) is used with different kernel functions.
  • Correlation filters: Face recognition systems based on the correlation filter (CF) have given good results in terms of robustness, location accuracy, efficiency, and discrimination. In the field of facial recognition, the correlation techniques have attracted great interest since the first use of an optical correlator [ 47 ]. These techniques provide the following advantages: high ability for discrimination, desired noise robustness, shift-invariance, and inherent parallelism. On the basis of these advantages, many optoelectronic hybrid solutions of correlation filters (CFs) have been introduced such as the joint transform correlator (JTC) [ 48 ] and VanderLugt correlator (VLC) [ 47 ] techniques. The purpose of these techniques is to calculate the degree of similarity between target and reference images. The decision is taken by the detection of a correlation peak. Both techniques (VLC and JTC) are based on the “ 4 f ” optical configuration [ 37 ]. This configuration is created by two convergent lenses ( Figure 4 ). The face image F is processed by the fast Fourier transform (FFT) based on the first lens in the Fourier plane S F . In this Fourier plane, a specific filter P is applied (for example, the phase-only filter (POF) filter [ 2 ]) using optoelectronic interfaces. Finally, to obtain the filtered face image F ′ (or the correlation plane), the inverse FFT (IFFT) is made with the second lens in the output plane. For example, the VLC technique is done by two cascade Fourier transform structures realized by two lenses [ 4 ], as presented in Figure 5 . The VLC technique is presented as follows: firstly, a 2D-FFT is applied to the target image to get a target spectrum S . After that, a multiplication between the target spectrum and the filter obtain with the 2D-FFT of a reference image is affected, and this result is placed in the Fourier plane. Next, it provides the correlation result recorded on the correlation plane, where this multiplication is affected by inverse FF. The correlation result, described by the peak intensity, is used to determine the similarity degree between the target and reference images. C = F F T − 1 { S ∗ ∘ P O F } , (6) where F F T − 1 stands for the inverse fast FT (FFT) operation, * represents the conjugate operation, and ∘ denotes the element-wise array multiplication. To enhance the matching process, Horner and Gianino [ 49 ] proposed a phase-only filter (POF). The POF filter can produce correlation peaks marked with enhanced discrimination capability. The POF is an optimized filter defined as follows: H P O F ( u , v ) = S ∗ ( u , v ) | S ( u , v ) | , (7) where S ∗ ( u , v ) is the complex conjugate of the 2D-FFT of the reference image. To evaluate the decision, the peak to correlation energy (PCE) is defined as the energy in the correlation peaks’ intensity normalized to the overall energy of the correlation plane. P C E = ∑ i , j N E p e a k ( i , j ) ∑ i , j M E c o r r e l a t i o n − p l a n e ( i , j ) , (8) where i , j are the coefficient coordinates; M and N are the size of the correlation plane and the size of the peak correlation spot, respectively; E p e a k is the energy in the correlation peaks; and E c o r r e l a t i o n − p l a n e is the overall energy of the correlation plane. Correlation techniques are widely applied in recognition and identification applications [ 4 , 37 , 50 , 51 , 52 , 53 ]. For example, in the work of [ 4 ], the authors presented the efficiency performances of the VLC technique based on the “4f” configuration for identification using GPU Nvidia Geforce 8400 GS. The POF filter is used for the decision. Another important work in this area of research is presented by Leonard et al. [ 50 ], which presented good performance and the simplicity of the correlation filters for the field of face recognition. In addition, many specific filters such as POF, BPOF, Ad, IF, and so on are used to select the best filter based on its sensitivity to the rotation, scale, and noise. Napoléon et al. [ 3 ] introduced a novel system for identification and verification fields based on an optimized 3D modeling under different illumination conditions, which allows reconstructing faces in different poses. In particular, to deform the synthetic model, an active shape model for detecting a set of key points on the face is proposed in Figure 6 . The VanderLugt correlator is proposed to perform the identification and the LBP descriptor is used to optimize the performances of a correlation technique under different illumination conditions. The experiments are performed on the Pointing Head Pose Image Database (PHPID) database with an elevation ranging from −30° to +30°.

3.2. Key-Points-Based Techniques

  • Scale invariant feature transform (SIFT) [ 56 , 57 ]: SIFT is an algorithm used to detect and describe the local features of an image. This algorithm is widely used to link two images by their local descriptors, which contain information to make a match between them. The main idea of the SIFT descriptor is to convert the image into a representation composed of points of interest. These points contain the characteristic information of the face image. SIFT presents invariance to scale and rotation. It is commonly used today and fast, which is essential in real-time applications, but one of its disadvantages is the time of matching of the critical points. The algorithm is realized in four steps: (1) detection of the maximum and minimum points in the space-scale, (2) location of characteristic points, (3) assignment of orientation, and (4) a descriptor of the characteristic point. A framework to detect the key-points based on the SIFT descriptor was proposed by L. Lenc et al. [ 56 ], where they use the SIFT technique in combination with a Kepenekci approach for the face recognition.
  • Speeded-up robust features (SURF) [ 29 , 57 ]: the SURF technique is inspired by SIFT, but uses wavelets and an approximation of the Hessian determinant to achieve better performance [ 29 ]. SURF is a detector and descriptor that claims to achieve the same, or even better, results in terms of repeatability, distinction, and robustness compared with the SIFT descriptor. The main advantage of SURF is the execution time, which is less than that used by the SIFT descriptor. Besides, the SIFT descriptor is more adapted to describe faces affected by illumination conditions, scaling, translation, and rotation [ 57 ]. To detect feature points, SURF seeks to find the maximum of an approximation of the Hessian matrix using integral images to dramatically reduce the processing computational time. Figure 7 shows an example of SURF descriptor for face recognition using AR face datasets [ 58 ].
  • Binary robust independent elementary features (BRIEF) [ 30 , 57 ]: BRIEF is a binary descriptor that is simple and fast to compute. This descriptor is based on the differences between the pixel intensity that are similar to the family of binary descriptors such as binary robust invariant scalable (BRISK) and fast retina keypoint (FREAK) in terms of evaluation. To reduce noise, the BRIEF descriptor smoothens the image patches. After that, the differences between the pixel intensity are used to represent the descriptor. This descriptor has achieved the best performance and accuracy in pattern recognition.
  • Fast retina keypoint (FREAK) [ 57 , 59 ]: the FREAK descriptor proposed by Alahi et al. [ 59 ] uses a retinal sampling circular grid. This descriptor uses 43 sampling patterns based on retinal receptive fields that are shown in Figure 8 . To extract a binary descriptor, these 43 receptive fields are sampled by decreasing factors as the distance from the thousand potential pairs to a patch’s center yields. Each pair is smoothed with Gaussian functions. Finally, the binary descriptors are represented by setting a threshold and considering the sign of differences between pairs.

3.3. Summary of Local Approaches

4. holistic approach, 4.1. linear techniques.

  • Eigenface [ 34 ] and principal component analysis (PCA) [ 27 , 62 ]: Eigenfaces is one of the popular methods of holistic approaches used to extract features points of the face image. This approach is based on the principal component analysis (PCA) technique. The principal components created by the PCA technique are used as Eigenfaces or face templates. The PCA technique transforms a number of possibly correlated variables into a small number of incorrect variables called “principal components”. The purpose of PCA is to reduce the large dimensionality of the data space (observed variables) to the smaller intrinsic dimensionality of feature space (independent variables), which are needed to describe the data economically. Figure 9 shows how the face can be represented by a small number of features. PCA calculates the Eigenvectors of the covariance matrix, and projects the original data onto a lower dimensional feature space, which are defined by Eigenvectors with large Eigenvalues. PCA has been used in face representation and recognition, where the Eigenvectors calculated are referred to as Eigenfaces (as shown in Figure 10 ). An image may also be considering the vector of dimension M × N , so that a typical image of size 4 × 4 becomes a vector of dimension 16. Let the training set of images be { X 1 , X 2 ,   X 3 …   X N } . The average face of the set is defined by the following: X ¯ = 1 N ∑ i = 1 N X   i . (9) Calculate the estimate covariance matrix to represent the scatter degree of all feature vectors related to the average vector. The covariance matrix Q is defined by the following: Q = 1 N ∑ i = 1 N ( X ¯ − X i ) ( X ¯ − X i ) T . (10) The Eigenvectors and corresponding Eigen-values are computed using C V = λ V ,       ( V ϵ R n ,   V ≠ 0 ) , (11) where V is the set of eigenvectors matrix Q associated with its eigenvalue λ . Project all the training images of i t h person to the corresponding Eigen-subspace: y k i = w T    ( x i ) ,       ( i = 1 ,   2 ,   3   …   N ) , (12) where the y k i are the projections of x and are called the principal components, also known as eigenfaces. The face images are represented as a linear combination of these vectors’ “principal components”. In order to extract facial features, PCA and LDA are two different feature extraction algorithms that are used. Wavelet fusion and neural networks are applied to classify facial features. The ORL database is used for evaluation. Figure 10 shows the first five Eigenfaces constructed from the ORL database [ 63 ].
  • Fisherface and linear discriminative analysis (LDA) [ 64 , 65 ]: The Fisherface method is based on the same principle of similarity as the Eigenfaces method. The objective of this method is to reduce the high dimensional image space based on the linear discriminant analysis (LDA) technique instead of the PCA technique. The LDA technique is commonly used for dimensionality reduction and face recognition [ 66 ]. PCA is an unsupervised technique, while LDA is a supervised learning technique and uses the data information. For all samples of all classes, the within-class scatter matrix S W and the between-class scatter matrix S B are defined as follows: S B = ∑ I = 1 C M i ( x i − μ ) ( x i − μ ) T , (13) S w = ∑ I = 1 C ∑ x k ϵ X i M i ( x k − μ ) ( x k − μ ) T , (14) where μ is the mean vector of samples belonging to class i , X i represents the set of samples belonging to class i with x k being the number image of that class, c is the number of distinct classes, and M i is the number of training samples in class i . S B describes the scatter of features around the overall mean for all face classes and S w describes the scatter of features around the mean of each face class. The goal is to maximize the ratio d e t | S B | / d e t | S w |, in other words, minimizing S w while maximiz ing   S B . Figure 11 shows the first five Eigenfaces and Fisherfaces obtained from the ORL database [ 63 ].
  • Independent component analysis (ICA) [ 35 ]: The ICA technique is used for the calculation of the basic vectors of a given space. The goal of this technique is to perform a linear transformation in order to reduce the statistical dependence between the different basic vectors, which allows the analysis of independent components. It is determined that they are not orthogonal to each other. In addition, the acquisition of images from different sources is sought in uncorrelated variables, which makes it possible to obtain greater efficiency, because ICA acquires images within statistically independent variables.
  • Improvements of the PCA, LDA, and ICA techniques: To improve the linear subspace techniques, many types of research are developed. Z. Cui et al. [ 67 ] proposed a new spatial face region descriptor (SFRD) method to extract the face region, and to deal with noise variation. This method is described as follows: divide each face image in many spatial regions, and extract token-frequency (TF) features from each region by sum-pooling the reconstruction coefficients over the patches within each region. Finally, extract the SFRD for face images by applying a variant of the PCA technique called “whitened principal component analysis (WPCA)” to reduce the feature dimension and remove the noise in the leading eigenvectors. Besides, the authors in [ 68 ] proposed a variant of the LDA called probabilistic linear discriminant analysis (PLDA) to seek directions in space that have maximum discriminability, and are hence most suitable for both face recognition and frontal face recognition under varying pose.
  • Gabor filters: Gabor filters are spatial sinusoids located by a Gaussian window that allows for extracting the features from images by selecting their frequency, orientation, and scale. To enhance the performance under unconstrained environments for face recognition, Gabor filters are transformed according to the shape and pose to extract the feature vectors of face image combined with the PCA in the work of [ 69 ]. The PCA is applied to the Gabor features to remove the redundancies and to get the best face images description. Finally, the cosine metric is used to evaluate the similarity.
  • Frequency domain analysis [ 70 , 71 ]: Finally, the analysis techniques in the frequency domain offer a representation of the human face as a function of low-frequency components that present high energy. The discrete Fourier transform (DFT), discrete cosine transform (DCT), or discrete wavelet transform (DWT) techniques are independent of the data, and thus do not require training.
  • Discrete wavelet transform (DWT): Another linear technique used for face recognition. In the work of [ 70 ], the authors used a two-dimensional discrete wavelet transform (2D-DWT) method for face recognition using a new patch strategy. A non-uniform patch strategy for the top-level’s low-frequency sub-band is proposed by using an integral projection technique for two top-level high-frequency sub-bands of 2D-DWT based on the average image of all training samples. This patch strategy is better for retaining the integrity of local information, and is more suitable to reflect the structure feature of the face image. When constructing the patching strategy using the testing and training samples, the decision is performed using the neighbor classifier. Many databases are used to evaluate this method, including Labeled Faces in Wild (LFW), Extended Yale B, Face Recognition Technology (FERET), and AR.
  • Discrete cosine transform (DCT) [ 71 ] can be used for global and local face recognition systems. DCT is a transformation that represents a finite sequence of data as the sum of a series of cosine functions oscillating at different frequencies. This technique is widely used in face recognition systems [ 71 ], from audio and image compression to spectral methods for the numerical resolution of differential equations. The required steps to implement the DCT technique are presented as follows.
DCT Algorithm
      where , and        

4.2. Nonlinear Techniques

Kernel PCA Algorithm
using kernel function: . and normalize with the function: . using kernel function:
  • Kernel linear discriminant analysis (KDA) [ 73 ]: the KLDA technique is a kernel extension of the linear LDA technique, in the same kernel extension of PCA. Arashloo et al. [ 73 ] proposed a nonlinear binary class-specific kernel discriminant analysis classifier (CS-KDA) based on the spectral regression kernel discriminant analysis. Other nonlinear techniques have also been used in the context of facial recognition:
  • Gabor-KLDA [ 74 ].
  • Evolutionary weighted principal component analysis (EWPCA) [ 75 ].
  • Kernelized maximum average margin criterion (KMAMC), SVM, and kernel Fisher discriminant analysis (KFD) [ 76 ].
  • Wavelet transform (WT), radon transform (RT), and cellular neural networks (CNN) [ 77 ].
  • Joint transform correlator-based two-layer neural network [ 78 ].
  • Kernel Fisher discriminant analysis (KFD) and KPCA [ 79 ].
  • Locally linear embedding (LLE) and LDA [ 80 ].
  • Nonlinear locality preserving with deep networks [ 81 ].
  • Nonlinear DCT and kernel discriminative common vector (KDCV) [ 82 ].

4.3. Summary of Holistic Approaches

5. hybrid approach, 5.1. technique presentation.

  • Gabor wavelet and linear discriminant analysis (GW-LDA) [ 91 ]: Fathima et al. [ 91 ] proposed a hybrid approach combining Gabor wavelet and linear discriminant analysis (HGWLDA) for face recognition. The grayscale face image is approximated and reduced in dimension. The authors have convolved the grayscale face image with a bank of Gabor filters with varying orientations and scales. After that, a subspace technique 2D-LDA is used to maximize the inter-class space and reduce the intra-class space. To classify and recognize the test face image, the k-nearest neighbour (k-NN) classifier is used. The recognition task is done by comparing the test face image feature with each of the training set features. The experimental results show the robustness of this approach in different lighting conditions.
  • Over-complete LBP (OCLBP), LDA, and within class covariance normalization (WCCN): Barkan et al. [ 92 ] proposed a new representation of face image based over-complete LBP (OCLBP). This representation is a multi-scale modified version of the LBP technique. The LDA technique is performed to reduce the high dimensionality representations. Finally, the within class covariance normalization (WCCN) is the metric learning technique used for face recognition.
  • Advanced correlation filters and Walsh LBP (WLBP): Juefei et al. [ 93 ] implemented a single-sample periocular-based alignment-robust face recognition technique based on high-dimensional Walsh LBP (WLBP). This technique utilizes only one sample per subject class and generates new face images under a wide range of 3D rotations using the 3D generic elastic model, which is both accurate and computationally inexpensive. The LFW database is used for evaluation, and the proposed method outperformed the state-of-the-art algorithms under four evaluation protocols with a high accuracy of 89.69%.
  • Multi-sub-region-based correlation filter bank (MS-CFB): Yan et al. [ 94 ] propose an effective feature extraction technique for robust face recognition, named multi-sub-region-based correlation filter bank (MS-CFB). MS-CFB extracts the local features independently for each face sub-region. After that, the different face sub-regions are concatenated to give optimal overall correlation outputs. This technique reduces the complexity, achieves higher recognition rates, and provides a better feature representation for recognition compared with several state-of-the-art techniques on various public face databases.
  • SIFT features, Fisher vectors, and PCA: Simonyan et al. [ 64 ] have developed a novel method for face recognition based on the SIFT descriptor and Fisher vectors. The authors propose a discriminative dimensionality reduction owing to the high dimensionality of the Fisher vectors. After that, these vectors are projected into a low dimensional subspace with a linear projection. The objective of this methodology is to describe the image based on dense SIFT features and Fisher vectors encoding to achieve high performance on the challenging LFW dataset in both restricted and unrestricted settings.
  • CNNs and stacked auto-encoder (SAE) techniques: Ding et al. [ 95 ] proposed multimodal deep face representation (MM-DFR) framework based on convolutional neural networks (CNNs) technique from the original holistic face image, rendered frontal face by 3D face model (stand for holistic facial features and local facial features, respectively), and uniformly sampled image patches. The proposed MM-DFR framework has two steps: a CNNs technique is used to extract the features and a three-layer stacked auto-encoder (SAE) technique is employed to compress the high-dimensional deep feature into a compact face signature. The LFW database is used to evaluate the identification performance of MM-DFR. The flowchart of the proposed MM-DFR framework is shown in Figure 12 .
  • PCA and ANFIS: Sharma et al. [ 96 ] propose an efficient pose-invariant face recognition system based on PCA technique and ANFIS classifier. The PCA technique is employed to extract the features of an image, and the ANFIS classifier is developed for identification under a variety of pose conditions. The performance of the proposed system based on PCA–ANFIS is better than ICA–ANFIS and LDA–ANFIS for the face recognition task. The ORL database is used for evaluation.
  • DCT and PCA: Ojala et al. [ 97 ] develop a fast face recognition system based on DCT and PCA techniques. Genetic algorithm (GA) technique is used to extract facial features, which allows to remove irrelevant features and reduces the number of features. In addition, the DCT–PCA technique is used to extract the features and reduce the dimensionality. The minimum Euclidian distance (ED) as a measurement is used for the decision. Various face databases are used to demonstrate the effectiveness of this system.
  • PCA, SIFT, and iterative closest point (ICP): Mian et al. [ 98 ] present a multimodal (2D and 3D) face recognition system based on hybrid matching to achieve efficiency and robustness to facial expressions. The Hotelling transform is performed to automatically correct the pose of a 3D face using its texture. After that, in order to form a rejection classifier, a novel 3D spherical face representation (SFR) in conjunction with the SIFT descriptor is used, which provide efficient recognition in the case of large galleries by eliminating a large number of candidates’ faces. A modified iterative closest point (ICP) algorithm is used for the decision. This system is less sensitive and robust to facial expressions, which achieved a 98.6% verification rate and 96.1% identification rate on the complete FRGC v2 database.
  • PCA, local Gabor binary pattern histogram sequence (LGBPHS), and GABOR wavelets: Cho et al. [ 99 ] proposed a computationally efficient hybrid face recognition system that employs both holistic and local features. The PCA technique is used to reduce the dimensionality. After that, the local Gabor binary pattern histogram sequence (LGBPHS) technique is employed to realize the recognition stage, which proposed to reduce the complexity caused by the Gabor filters. The experimental results show a better recognition rate compared with the PCA and Gabor wavelet techniques under illumination variations. The Extended Yale Face Database B is used to demonstrate the effectiveness of this system.
  • PCA and Fisher linear discriminant (FLD) [ 100 , 101 ]: Sing et al. [ 101 ] propose a novel hybrid technique for face representation and recognition, which exploits both local and subspace features. In order to extract the local features, the whole image is divided into a sub-regions, while the global features are extracted directly from the whole image. After that, PCA and Fisher linear discriminant (FLD) techniques are introduced on the fused feature vector to reduce the dimensionality. The CMU-PIE, FERET, and AR face databases are used for the evaluation.
  • SPCA–KNN [ 102 ]: Kamencay et al. [ 102 ] develop a new face recognition method based on SIFT features, as well as PCA and KNN techniques. The Hessian–Laplace detector along with SPCA descriptor is performed to extract the local features. SPCA is introduced to identify the human face. KNN classifier is introduced to identify the closest human faces from the trained features. The results of the experiment have a recognition rate of 92% for the unsegmented ESSEX database and 96% for the segmented database (700 training images).
  • Convolution operations, LSTM recurrent units, and ELM classifier [ 103 ]: Sun et al. [ 103 ] propose a hybrid deep structure called CNN–LSTM–ELM in order to achieve sequential human activity recognition (HAR). Their proposed CNN–LSTM–ELM structure is evaluated using the OPPORTUNITY dataset, which contains 46,495 training samples and 9894 testing samples, and each sample is a sequence. The model training and testing runs on a GPU with 1536 cores, 1050 MHz clock speed, and 8 GB RAM. The flowchart of the proposed CNN–LSTM–ELM structure is shown in Figure 13 [ 103 ].

5.2. Summary of Hybrid Approaches

6. assessment of face recognition approaches, 6.1. measures of similarity or distances.

  • Peak-to-correlation energy (PCE) or peak-to-sidelobe ratio (PSR) [ 18 ]: The PCE was introduced in (8).
  • Euclidean distance [ 54 ]: The Euclidean distance is one of the most basic measures used to compute the direct distance between two points in a plane. If we have two points P 1 and P 2 , with the coordinates ( x 1 ,   y 1 ) and ( x 2 ,   y 2 ) , respectively, the calculation of the Euclidean distance between them would be as follows: d E ( P 1 ,   P 2   ) = ( x 2 − x 1 ) 2 + ( y 2 − y 1 ) 2 . (15) In general, the Euclidean distance between two points P = ( 1 ,   p 2 ,   … ,   p n ) and Q = ( q 1 ,   q 2 , …   ,   q n ) in the n-dimensional space would be defined by the following: d E ( P , Q ) = ∑ i n ( p i − q i ) 2 . (16)
  • Bhattacharyya distance [ 104 , 105 ]: The Bhattacharyya distance is a statistical measure that quantifies the similarity between two discrete or continuous probability distributions. This distance is particularly known for its low processing time and its low sensitivity to noise. For the probability distributions p and q defined on the same domain, the distance of Bhattacharyya is defined as follows: D B ( p ,   q ) = − l n ( B C ( p ,   q ) ) , (17) B C ( p ,   q ) = ∑ x ∈ X p ( x ) q ( x )   ( a ) ;   B C ( p ,   q ) = ∫ p ( x ) q ( x ) d x   ( b ) , (18) where B C is the Bhattacharyya coefficient, defined as Equation (18a) for discrete probability distributions and as Equation (18b) for continuous probability distributions. In both cases, 0 ≤ BC ≤ 1 and 0 ≤ DB ≤ ∞. In its simplest formulation, the Bhattacharyya distance between two classes that follow a normal distribution can be calculated from a mean ( μ ) and the variance ( σ 2 ): D B ( p ,   q ) = 1 4 l n ( 1 4 ( σ p 2 σ q 2 + σ q 2 σ p 2 + 2 ) ) + 1 4 ( ( μ p − μ q ) σ q 2 + σ p 2 ) . (19)
  • Chi-squared distance [ 106 ]: The Chi-squared ( X 2 ) distance was weighted by the value of the samples, which allows knowing the same relevance for sample differences with few occurrences as those with multiple occurrences. To compare two histograms S 1 = ( u 1 , …   …   … . u m ) and S 2 = ( w 1 , …   …   … . w m ) , the Chi-squared ( X 2 ) distance can be defined as follows: ( X 2 ) = D ( S 1 , S 2 ) = 1 2 ∑ i = 1 m ( u i − w i ) 2 u i + w i . (20)

6.2. Classifiers

  • Support vector machines (SVMs) [ 13 , 26 ]: The feature vectors extracted by any descriptor are classified by linear or nonlinear SVM. The SVM classifier may realize the separation of the classes with an optimal hyperplane. To determine the last, only the closest points of the total learning set should be used; these points are called support vectors ( Figure 14 ). There is an infinite number of hyperplanes capable of perfectly separating two classes, which implies to select a hyperplane that maximizes the minimal distance between the learning examples and the learning hyperplane (i.e., the distance between the support vectors and the hyperplane). This distance is called “margin”. The SVM classifier is used to calculate the optimal hyperplane that categorizes a set of labels training data in the correct class. The optimal hyperplane is solved as follows: D = { ( x i , y i ) | x i ∈ R n ,   y i ∈ { − 1 , 1 } ,   i = 1 … … l } . (21) Given that x i are the training features vectors and y i are the corresponding set of l (1 or −1) labels. An SVM tries to find a hyperplane to distinguish the samples with the smallest errors. The classification function is obtained by calculating the distance between the input vector and the hyperplane. w x i − b = C f , (22) where w and b are the parameters of the model. Shen et al. [ 108 ] proposed the Gabor filter to extract the face features and applied the SVM for classification. The proposed FaceNet method achieves a good record accuracy of 99.63% and 95.12% using the LFW YouTube Faces DB datasets, respectively.
  • k-nearest neighbor (k-NN) [ 17 , 91 ]: k-NN is an indolent algorithm because, in training, it saves little information, and thus does not build models of difference, for example, decision trees.
  • K-means [ 9 , 109 ]: It is called K-means because it represents each of the groups by the average (or weighted average) of its points, called the centroid. In the K-means algorithm, it is necessary to specify a priori the number of clusters k that one wishes to form in order to start the process.
  • Deep learning (DL): An automatic learning technique that uses neural network architectures. The term “deep” refers to the number of hidden layers in the neural network. While conventional neural networks have one layer, deep neural networks (DNN) contain several layers, as presented in Figure 15 .
  • Convolutional layer : sometimes called the feature extractor layer because features of the image are extracted within this layer. Convolution preserves the spatial relationship between pixels by learning image features using small squares of the input image. The input image is convoluted by employing a set of learnable neurons. This produces a feature map or activation map in the output image, after which the feature maps are fed as input data to the next convolutional layer. The convolutional layer also contains rectified linear unit (ReLU) activation to convert all negative value to zero. This makes it very computationally efficient, as few neurons are activated each time.
  • Pooling layer: used to reduce dimensions, with the aim of reducing processing times by retaining the most important information after convolution. This layer basically reduces the number of parameters and computation in the network, controlling over fitting by progressively reducing the spatial size of the network. There are two operations in this layer: average pooling and maximum pooling: - Average-pooling takes all the elements of the sub-matrix, calculates their average, and stores the value in the output matrix. - Max-pooling searches for the highest value found in the sub-matrix and saves it in the output matrix.
  • Fully-connected layer : in this layer, the neurons have a complete connection to all the activations from the previous layers. It connects neurons in one layer to neurons in another layer. It is used to classify images between different categories by training.

6.3. Databases Used

  • LFW (Labeled Faces in the Wild) database was created in October 2007. It contains 13,333 images of 5749 subjects, with 1680 subjects with at least two images and the rest with a single image. These face images were taken on the Internet, pre-processed, and localized by the Viola–Jones detector with a resolution of 250 × 250 pixels. Most of them are in color, although there are also some in grayscale and presented in JPG format and organized by folders.
  • FERET (Face Recognition Technology) database was created in 15 sessions in a semi-controlled environment between August 1993 and July 1996. It contains 1564 sets of images, with a total of 14,126 images. The duplicate series belong to subjects already present in the series of individual images, which were generally captured one day apart. Some images taken from the same subject vary overtime for a few years and can be used to treat facial changes that appear over time. The images have a depth of 24 bits, RGB, so they are color images, with a resolution of 512 × 768 pixels.
  • AR face database was created by Aleix Martínez and Robert Benavente in the computer vision center (CVC) of the Autonomous University of Barcelona in June 1998. It contains more than 4000 images of 126 subjects, including 70 men and 56 women. They were taken at the CVC under a controlled environment. The images were taken frontally to the subjects, with different facial expressions and three different lighting conditions, as well as several accessories: scarves, glasses, or sunglasses. Two imaging sessions were performed with the same subjects, 14 days apart. These images are a resolution of 576 × 768 pixels and a depth of 24 bits, under the RGB RAW format.
  • ORL Database of Faces was performed between April 1992 and April 1994 at the AT & T laboratory in Cambridge. It consists of a total of 10 images per subject, out of a total of 40 images. For some subjects, the images were taken at different times, with varying illumination and facial expressions: eyes open/closed, smiling/without a smile, as well as with or without glasses. The images were taken under a black homogeneous background, in a vertical position and frontally to the subject, with some small rotation. These are images with a resolution of 92 × 112 pixels in grayscale.
  • Extended Yale Face B database contains 16,128 images of 640 × 480 grayscale of 28 individuals under 9 poses and 64 different lighting conditions. It also includes a set of images made with the face of individuals only.
  • Pointing Head Pose Image Database (PHPID) is one of the most widely used for face recognition. It contains 2790 monocular face images of 15 persons with tilt angles from −90° to +90° and variations of pan. Every person has two series of 93 different poses (93 images). The face images were taken under different skin color and with or without glasses.

6.4. Comparison between Holistic, Local, and Hybrid Techniques

7. discussion about future directions and conclusions, 7.1. discussion.

  • Local approaches: use features in which the face described partially. For example, some system could consist of extracting local features such as the eyes, mouth, and nose. The features’ values are calculated from the lines or points that can be represented on the face image for the recognition step.
  • Holistic approaches: use features that globally describe the complete face as a model, including the background (although it is desirable to occupy the smallest possible surface).
  • Hybrid approaches: combine local and holistic approaches.
  • Three-dimensional face recognition: In 2D image-based techniques, some features are lost owing to the 3D structure of the face. Lighting and pose variations are two major unresolved problems of 2D face recognition. Recently, 3D facial recognition for facial recognition has been widely studied by the scientific community to overcome unresolved problems in 2D facial recognition and to achieve significantly higher accuracy by measuring geometry of rigid features on the face. For this reason, several recent systems based on 3D data have been developed [ 3 , 93 , 95 , 128 , 129 ].
  • Multimodal facial recognition: sensors have been developed in recent years with a proven ability to acquire not only two-dimensional texture information, but also facial shape, that is, three-dimensional information. For this reason, some recent studies have merged the two types of 2D and 3D information to take advantage of each of them and obtain a hybrid system that improves the recognition as the only modality [ 98 ].
  • Deep learning (DL): a very broad concept, which means that it has no exact definition, but studies [ 14 , 110 , 111 , 112 , 113 , 121 , 130 , 131 ] agree that DL includes a set of algorithms that attempt to model high level abstractions, by modeling multiple processing layers. This field of research began in the 1980s and is a branch of automatic learning where algorithms are used in the formation of deep neural networks (DNN) to achieve greater accuracy than other classical techniques. In recent progress, a point has been reached where DL performs better than people in some tasks, for example, to recognize objects in images.

7.2. Conclusions

Author contributions, conflicts of interest.

  • Liao, S.; Jain, A.K.; Li, S.Z. Partial face recognition: Alignment-free approach. IEEE Trans. Pattern Anal. Mach. Intell. 2012 , 35 , 1193–1205. [ Google Scholar ] [ CrossRef ] [ PubMed ] [ Green Version ]
  • Jridi, M.; Napoléon, T.; Alfalou, A. One lens optical correlation: Application to face recognition. Appl. Opt. 2018 , 57 , 2087–2095. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Napoléon, T.; Alfalou, A. Pose invariant face recognition: 3D model from single photo. Opt. Lasers Eng. 2017 , 89 , 150–161. [ Google Scholar ] [ CrossRef ]
  • Ouerhani, Y.; Jridi, M.; Alfalou, A. Fast face recognition approach using a graphical processing unit “GPU”. In Proceedings of the 2010 IEEE International Conference on Imaging Systems and Techniques, Thessaloniki, Greece, 1–2 July 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 80–84. [ Google Scholar ]
  • Yang, W.; Wang, S.; Hu, J.; Zheng, G.; Valli, C. A fingerprint and finger-vein based cancelable multi-biometric system. Pattern Recognit. 2018 , 78 , 242–251. [ Google Scholar ] [ CrossRef ]
  • Patel, N.P.; Kale, A. Optimize Approach to Voice Recognition Using IoT. In Proceedings of the 2018 International Conference on Advances in Communication and Computing Technology (ICACCT), Sangamner, India, 8–9 February 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 251–256. [ Google Scholar ]
  • Wang, Q.; Alfalou, A.; Brosseau, C. New perspectives in face correlation research: A tutorial. Adv. Opt. Photonics 2017 , 9 , 1–78. [ Google Scholar ] [ CrossRef ]
  • Alfalou, A.; Brosseau, C.; Kaddah, W. Optimization of decision making for face recognition based on nonlinear correlation plane. Opt. Commun. 2015 , 343 , 22–27. [ Google Scholar ] [ CrossRef ]
  • Zhao, C.; Li, X.; Cang, Y. Bisecting k-means clustering based face recognition using block-based bag of words model. Opt. Int. J. Light Electron Opt. 2015 , 126 , 1761–1766. [ Google Scholar ] [ CrossRef ]
  • HajiRassouliha, A.; Gamage, T.P.B.; Parker, M.D.; Nash, M.P.; Taberner, A.J.; Nielsen, P.M. FPGA implementation of 2D cross-correlation for real-time 3D tracking of deformable surfaces. In Proceedings of the 2013 28th International Conference on Image and Vision Computing New Zealand (IVCNZ 2013), Wellington, New Zealand, 27–29 November 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 352–357. [ Google Scholar ]
  • Kortli, Y.; Jridi, M.; Al Falou, A.; Atri, M. A comparative study of CFs, LBP, HOG, SIFT, SURF, and BRIEF techniques for face recognition. In Pattern Recognition and Tracking XXIX ; International Society for Optics and Photonics; SPIE: Bellingham, WA, USA, 2018; Volume 10649, p. 106490M. [ Google Scholar ]
  • Dehai, Z.; Da, D.; Jin, L.; Qing, L. A pca-based face recognition method by applying fast fourier transform in pre-processing. In 3rd International Conference on Multimedia Technology (ICMT-13) ; Atlantis Press: Paris, France, 2013. [ Google Scholar ]
  • Ouerhani, Y.; Alfalou, A.; Brosseau, C. Road mark recognition using HOG-SVM and correlation. In Optics and Photonics for Information Processing XI ; International Society for Optics and Photonics; SPIE: Bellingham, WA, USA, 2017; Volume 10395, p. 103950Q. [ Google Scholar ]
  • Liu, W.; Wang, Z.; Liu, X.; Zeng, N.; Liu, Y.; Alsaadi, F.E. A survey of deep neural network architectures and their applications. Neurocomputing 2017 , 234 , 11–26. [ Google Scholar ] [ CrossRef ]
  • Xi, M.; Chen, L.; Polajnar, D.; Tong, W. Local binary pattern network: A deep learning approach for face recognition. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 3224–3228. [ Google Scholar ]
  • Ojala, T.; Pietikäinen, M.; Harwood, D. A comparative study of texture measures with classification based on featured distributions. Pattern Recognit. 1996 , 29 , 51–59. [ Google Scholar ] [ CrossRef ]
  • Gowda, H.D.S.; Kumar, G.H.; Imran, M. Multimodal Biometric Recognition System Based on Nonparametric Classifiers. Data Anal. Learn. 2018 , 43 , 269–278. [ Google Scholar ]
  • Ouerhani, Y.; Jridi, M.; Alfalou, A.; Brosseau, C. Optimized pre-processing input plane GPU implementation of an optical face recognition technique using a segmented phase only composite filter. Opt. Commun. 2013 , 289 , 33–44. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Mousa Pasandi, M.E. Face, Age and Gender Recognition Using Local Descriptors. Ph.D. Thesis, Université d’Ottawa/University of Ottawa, Ottawa, ON, Canada, 2014. [ Google Scholar ]
  • Khoi, P.; Thien, L.H.; Viet, V.H. Face Retrieval Based on Local Binary Pattern and Its Variants: A Comprehensive Study. Int. J. Adv. Comput. Sci. Appl. 2016 , 7 , 249–258. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Zeppelzauer, M. Automated detection of elephants in wildlife video. EURASIP J. Image Video Process. 2013 , 46 , 2013. [ Google Scholar ] [ CrossRef ] [ PubMed ] [ Green Version ]
  • Parmar, D.N.; Mehta, B.B. Face recognition methods & applications. arXiv 2014 , arXiv:1403.0485. [ Google Scholar ]
  • Vinay, A.; Hebbar, D.; Shekhar, V.S.; Murthy, K.B.; Natarajan, S. Two novel detector-descriptor based approaches for face recognition using sift and surf. Procedia Comput. Sci. 2015 , 70 , 185–197. [ Google Scholar ]
  • Yang, H.; Wang, X.A. Cascade classifier for face detection. J. Algorithms Comput. Technol. 2016 , 10 , 187–197. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Viola, P.; Jones, M. Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Kauai, HI, USA, 8–14 December 2001. [ Google Scholar ]
  • Rettkowski, J.; Boutros, A.; Göhringer, D. HW/SW Co-Design of the HOG algorithm on a Xilinx Zynq SoC. J. Parallel Distrib. Comput. 2017 , 109 , 50–62. [ Google Scholar ] [ CrossRef ]
  • Seo, H.J.; Milanfar, P. Face verification using the lark representation. IEEE Trans. Inf. Forensics Secur. 2011 , 6 , 1275–1286. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Shah, J.H.; Sharif, M.; Raza, M.; Azeem, A. A Survey: Linear and Nonlinear PCA Based Face Recognition Techniques. Int. Arab J. Inf. Technol. 2013 , 10 , 536–545. [ Google Scholar ]
  • Du, G.; Su, F.; Cai, A. Face recognition using SURF features. In MIPPR 2009: Pattern Recognition and Computer Vision ; International Society for Optics and Photonics; SPIE: Bellingham, WA, USA, 2009; Volume 7496, p. 749628. [ Google Scholar ]
  • Calonder, M.; Lepetit, V.; Ozuysal, M.; Trzcinski, T.; Strecha, C.; Fua, P. BRIEF: Computing a local binary descriptor very fast. IEEE Trans. Pattern Anal. Mach. Intell. 2011 , 34 , 1281–1298. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Smach, F.; Miteran, J.; Atri, M.; Dubois, J.; Abid, M.; Gauthier, J.P. An FPGA-based accelerator for Fourier Descriptors computing for color object recognition using SVM. J. Real-Time Image Process. 2007 , 2 , 249–258. [ Google Scholar ] [ CrossRef ]
  • Kortli, Y.; Jridi, M.; Al Falou, A.; Atri, M. A novel face detection approach using local binary pattern histogram and support vector machine. In Proceedings of the 2018 International Conference on Advanced Systems and Electric Technologies (IC_ASET), Hammamet, Tunisia, 22–25 March 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 28–33. [ Google Scholar ]
  • Wang, Q.; Xiong, D.; Alfalou, A.; Brosseau, C. Optical image authentication scheme using dual polarization decoding configuration. Opt. Lasers Eng. 2019 , 112 , 151–161. [ Google Scholar ] [ CrossRef ]
  • Turk, M.; Pentland, A. Eigenfaces for recognition. J. Cogn. Neurosci. 1991 , 3 , 71–86. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Annalakshmi, M.; Roomi, S.M.M.; Naveedh, A.S. A hybrid technique for gender classification with SLBP and HOG features. Clust. Comput. 2019 , 22 , 11–20. [ Google Scholar ] [ CrossRef ]
  • Hussain, S.U.; Napoléon, T.; Jurie, F. Face Recognition Using Local Quantized Patterns ; HAL: Bengaluru, India, 2012. [ Google Scholar ]
  • Alfalou, A.; Brosseau, C. Understanding Correlation Techniques for Face Recognition: From Basics to Applications. In Face Recognition ; Oravec, M., Ed.; IntechOpen: Rijeka, Croatia, 2010. [ Google Scholar ]
  • Napoléon, T.; Alfalou, A. Local binary patterns preprocessing for face identification/verification using the VanderLugt correlator. In Optical Pattern Recognition XXV ; International Society for Optics and Photonics; SPIE: Bellingham, WA, USA, 2014; Volume 9094, p. 909408. [ Google Scholar ]
  • Schroff, F.; Kalenichenko, D.; Philbin, J. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, Boston, MA, USA, 7–12 June 2015; pp. 815–823. [ Google Scholar ]
  • Kambi Beli, I.; Guo, C. Enhancing face identification using local binary patterns and k-nearest neighbors. J. Imaging 2017 , 3 , 37. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Benarab, D.; Napoléon, T.; Alfalou, A.; Verney, A.; Hellard, P. Optimized swimmer tracking system by a dynamic fusion of correlation and color histogram techniques. Opt. Commun. 2015 , 356 , 256–268. [ Google Scholar ] [ CrossRef ]
  • Bonnen, K.; Klare, B.F.; Jain, A.K. Component-based representation in automated face recognition. IEEE Trans. Inf. Forensics Secur. 2012 , 8 , 239–253. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Ren, J.; Jiang, X.; Yuan, J. Relaxed local ternary pattern for face recognition. In Proceedings of the 2013 IEEE International Conference on Image Processing, Melbourne, Australia, 15–18 September 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 3680–3684. [ Google Scholar ]
  • Karaaba, M.; Surinta, O.; Schomaker, L.; Wiering, M.A. Robust face recognition by computing distances from multiple histograms of oriented gradients. In Proceedings of the 2015 IEEE Symposium Series on Computational Intelligence, Cape Town, South Africa, 7–10 December 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 203–209. [ Google Scholar ]
  • Huang, C.; Huang, J. A fast HOG descriptor using lookup table and integral image. arXiv 2017 , arXiv:1703.06256. [ Google Scholar ]
  • Arigbabu, O.A.; Ahmad, S.M.S.; Adnan, W.A.W.; Yussof, S.; Mahmood, S. Soft biometrics: Gender recognition from unconstrained face images using local feature descriptor. arXiv 2017 , arXiv:1702.02537. [ Google Scholar ]
  • Lugh, A.V. Signal detection by complex spatial filtering. IEEE Trans. Inf. Theory 1964 , 10 , 139. [ Google Scholar ]
  • Weaver, C.S.; Goodman, J.W. A technique for optically convolving two functions. Appl. Opt. 1966 , 5 , 1248–1249. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Horner, J.L.; Gianino, P.D. Phase-only matched filtering. Appl. Opt. 1984 , 23 , 812–816. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Leonard, I.; Alfalou, A.; Brosseau, C. Face recognition based on composite correlation filters: Analysis of their performances. In Face Recognition: Methods, Applications and Technology ; Nova Science Pub Inc.: London, UK, 2012. [ Google Scholar ]
  • Katz, P.; Aron, M.; Alfalou, A. A Face-Tracking System to Detect Falls in the Elderly ; SPIE Newsroom; SPIE: Bellingham, WA, USA, 2013. [ Google Scholar ]
  • Alfalou, A.; Brosseau, C.; Katz, P.; Alam, M.S. Decision optimization for face recognition based on an alternate correlation plane quantification metric. Opt. Lett. 2012 , 37 , 1562–1564. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Elbouz, M.; Bouzidi, F.; Alfalou, A.; Brosseau, C.; Leonard, I.; Benkelfat, B.E. Adapted all-numerical correlator for face recognition applications. In Optical Pattern Recognition XXIV ; International Society for Optics and Photonics; SPIE: Bellingham, WA, USA, 2013; Volume 8748, p. 874807. [ Google Scholar ]
  • Heflin, B.; Scheirer, W.; Boult, T.E. For your eyes only. In Proceedings of the 2012 IEEE Workshop on the Applications of Computer Vision (WACV), Breckenridge, CO, USA, 9–11 January 2012; pp. 193–200. [ Google Scholar ]
  • Zhu, X.; Liao, S.; Lei, Z.; Liu, R.; Li, S.Z. Feature correlation filter for face recognition. In Advances in Biometrics, Proceedings of the International Conference on Biometrics, Seoul, Korea, 27–29 August 2007 ; Springer: Berlin/Heidelberg, Germany, 2007; Volume 4642, pp. 77–86. [ Google Scholar ]
  • Lenc, L.; Král, P. Automatic face recognition system based on the SIFT features. Comput. Electr. Eng. 2015 , 46 , 256–272. [ Google Scholar ] [ CrossRef ]
  • Işık, Ş. A comparative evaluation of well-known feature detectors and descriptors. Int. J. Appl. Math. Electron. Comput. 2014 , 3 , 1–6. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Mahier, J.; Hemery, B.; El-Abed, M.; El-Allam, M.; Bouhaddaoui, M.; Rosenberger, C. Computation evabio: A tool for performance evaluation in biometrics. Int. J. Autom. Identif. Technol. 2011 , 24 , hal-00984026. [ Google Scholar ]
  • Alahi, A.; Ortiz, R.; Vandergheynst, P. Freak: Fast retina keypoint. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 510–517. [ Google Scholar ]
  • Arashloo, S.R.; Kittler, J. Efficient processing of MRFs for unconstrained-pose face recognition. In Proceedings of the 2013 IEEE Sixth International Conference on Biometrics: Theory, Applications and Systems (BTAS), Rlington, VA, USA, 29 September–2 October 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 1–8. [ Google Scholar ]
  • Ghorbel, A.; Tajouri, I.; Aydi, W.; Masmoudi, N. A comparative study of GOM, uLBP, VLC and fractional Eigenfaces for face recognition. In Proceedings of the 2016 International Image Processing, Applications and Systems (IPAS), Hammamet, Tunisia, 5–7 November 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 1–5. [ Google Scholar ]
  • Lima, A.; Zen, H.; Nankaku, Y.; Miyajima, C.; Tokuda, K.; Kitamura, T. On the use of kernel PCA for feature extraction in speech recognition. IEICE Trans. Inf. Syst. 2004 , 87 , 2802–2811. [ Google Scholar ]
  • Devi, B.J.; Veeranjaneyulu, N.; Kishore, K.V.K. A novel face recognition system based on combining eigenfaces with fisher faces using wavelets. Procedia Comput. Sci. 2010 , 2 , 44–51. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Simonyan, K.; Parkhi, O.M.; Vedaldi, A.; Zisserman, A. Fisher vector faces in the wild. In Proceedings of the BMVC 2013—British Machine Vision Conference, Bristol, UK, 9–13 September 2013. [ Google Scholar ]
  • Li, B.; Ma, K.K. Fisherface vs. eigenface in the dual-tree complex wavelet domain. In Proceedings of the 2009 Fifth International Conference on Intelligent Information Hiding and Multimedia Signal Processing, Kyoto, Japan, 12–14 September 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 30–33. [ Google Scholar ]
  • Agarwal, R.; Jain, R.; Regunathan, R.; Kumar, C.P. Automatic Attendance System Using Face Recognition Technique. In Proceedings of the 2nd International Conference on Data Engineering and Communication Technology ; Springer: Singapore, 2019; pp. 525–533. [ Google Scholar ]
  • Cui, Z.; Li, W.; Xu, D.; Shan, S.; Chen, X. Fusing robust face region descriptors via multiple metric learning for face recognition in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, Portland, OR, USA, 23–28 June 2013; pp. 3554–3561. [ Google Scholar ]
  • Prince, S.; Li, P.; Fu, Y.; Mohammed, U.; Elder, J. Probabilistic models for inference about identity. IEEE Trans. Pattern Anal. Mach. Intell. 2011 , 34 , 144–157. [ Google Scholar ]
  • Perlibakas, V. Face recognition using principal component analysis and log-gabor filters. arXiv 2006 , arXiv:cs/0605025. [ Google Scholar ]
  • Huang, Z.H.; Li, W.J.; Shang, J.; Wang, J.; Zhang, T. Non-uniform patch based face recognition via 2D-DWT. Image Vision Comput. 2015 , 37 , 12–19. [ Google Scholar ] [ CrossRef ]
  • Sufyanu, Z.; Mohamad, F.S.; Yusuf, A.A.; Mamat, M.B. Enhanced Face Recognition Using Discrete Cosine Transform. Eng. Lett. 2016 , 24 , 52–61. [ Google Scholar ]
  • Hoffmann, H. Kernel PCA for novelty detection. Pattern Recognit. 2007 , 40 , 863–874. [ Google Scholar ] [ CrossRef ]
  • Arashloo, S.R.; Kittler, J. Class-specific kernel fusion of multiple descriptors for face verification using multiscale binarised statistical image features. IEEE Trans. Inf. Forensics Secur. 2014 , 9 , 2100–2109. [ Google Scholar ] [ CrossRef ]
  • Vinay, A.; Shekhar, V.S.; Murthy, K.B.; Natarajan, S. Performance study of LDA and KFA for gabor based face recognition system. Procedia Comput. Sci. 2015 , 57 , 960–969. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Sivasathya, M.; Joans, S.M. Image Feature Extraction using Non Linear Principle Component Analysis. Procedia Eng. 2012 , 38 , 911–917. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Zhang, B.; Chen, X.; Shan, S.; Gao, W. Nonlinear face recognition based on maximum average margin criterion. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; IEEE: Piscataway, NJ, USA, 2005; Volume 1, pp. 554–559. [ Google Scholar ]
  • Vankayalapati, H.D.; Kyamakya, K. Nonlinear feature extraction approaches with application to face recognition over large databases. In Proceedings of the 2009 2nd International Workshop on Nonlinear Dynamics and Synchronization, Klagenfurt, Austria, 20–21 July 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 44–48. [ Google Scholar ]
  • Javidi, B.; Li, J.; Tang, Q. Optical implementation of neural networks for face recognition by the use of nonlinear joint transform correlators. Appl. Opt. 1995 , 34 , 3950–3962. [ Google Scholar ] [ CrossRef ]
  • Yang, J.; Frangi, A.F.; Yang, J.Y. A new kernel Fisher discriminant algorithm with application to face recognition. Neurocomputing 2004 , 56 , 415–421. [ Google Scholar ] [ CrossRef ]
  • Pang, Y.; Liu, Z.; Yu, N. A new nonlinear feature extraction method for face recognition. Neurocomputing 2006 , 69 , 949–953. [ Google Scholar ] [ CrossRef ]
  • Wang, Y.; Fei, P.; Fan, X.; Li, H. Face recognition using nonlinear locality preserving with deep networks. In Proceedings of the 7th International Conference on Internet Multimedia Computing and Service, Hunan, China, 19–21 August 2015; ACM: New York, NY, USA, 2015; p. 66. [ Google Scholar ]
  • Li, S.; Yao, Y.F.; Jing, X.Y.; Chang, H.; Gao, S.Q.; Zhang, D.; Yang, J.Y. Face recognition based on nonlinear DCT discriminant feature extraction using improved kernel DCV. IEICE Trans. Inf. Syst. 2009 , 92 , 2527–2530. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Khan, S.A.; Ishtiaq, M.; Nazir, M.; Shaheen, M. Face recognition under varying expressions and illumination using particle swarm optimization. J. Comput. Sci. 2018 , 28 , 94–100. [ Google Scholar ] [ CrossRef ]
  • Hafez, S.F.; Selim, M.M.; Zayed, H.H. 2d face recognition system based on selected gabor filters and linear discriminant analysis lda. arXiv 2015 , arXiv:1503.03741. [ Google Scholar ]
  • Shanbhag, S.S.; Bargi, S.; Manikantan, K.; Ramachandran, S. Face recognition using wavelet transforms-based feature extraction and spatial differentiation-based pre-processing. In Proceedings of the 2014 International Conference on Science Engineering and Management Research (ICSEMR), Chennai, India, 27–29 November 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 1–8. [ Google Scholar ]
  • Fan, J.; Chow, T.W. Exactly Robust Kernel Principal Component Analysis. IEEE Trans. Neural Netw. Learn. Syst. 2019 . [ Google Scholar ] [ CrossRef ] [ PubMed ] [ Green Version ]
  • Vinay, A.; Cholin, A.S.; Bhat, A.D.; Murthy, K.B.; Natarajan, S. An Efficient ORB based Face Recognition framework for Human-Robot Interaction. Procedia Comput. Sci. 2018 , 133 , 913–923. [ Google Scholar ]
  • Lu, J.; Plataniotis, K.N.; Venetsanopoulos, A.N. Face recognition using kernel direct discriminant analysis algorithms. IEEE Trans. Neural Netw. 2003 , 14 , 117–126. [ Google Scholar ] [ PubMed ] [ Green Version ]
  • Yang, W.J.; Chen, Y.C.; Chung, P.C.; Yang, J.F. Multi-feature shape regression for face alignment. EURASIP J. Adv. Signal Process. 2018 , 2018 , 51. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Ouanan, H.; Ouanan, M.; Aksasse, B. Non-linear dictionary representation of deep features for face recognition from a single sample per person. Procedia Comput. Sci. 2018 , 127 , 114–122. [ Google Scholar ] [ CrossRef ]
  • Fathima, A.A.; Ajitha, S.; Vaidehi, V.; Hemalatha, M.; Karthigaiveni, R.; Kumar, R. Hybrid approach for face recognition combining Gabor Wavelet and Linear Discriminant Analysis. In Proceedings of the 2015 IEEE International Conference on Computer Graphics, Vision and Information Security (CGVIS), Bhubaneswar, India, 2–3 November 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 220–225. [ Google Scholar ]
  • Barkan, O.; Weill, J.; Wolf, L.; Aronowitz, H. Fast high dimensional vector multiplication face recognition. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 1960–1967. [ Google Scholar ]
  • Juefei-Xu, F.; Luu, K.; Savvides, M. Spartans: Single-sample periocular-based alignment-robust recognition technique applied to non-frontal scenarios. IEEE Trans. Image Process. 2015 , 24 , 4780–4795. [ Google Scholar ] [ CrossRef ]
  • Yan, Y.; Wang, H.; Suter, D. Multi-subregion based correlation filter bank for robust face recognition. Pattern Recognit. 2014 , 47 , 3487–3501. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Ding, C.; Tao, D. Robust face recognition via multimodal deep face representation. IEEE Trans. Multimed. 2015 , 17 , 2049–2058. [ Google Scholar ] [ CrossRef ]
  • Sharma, R.; Patterh, M.S. A new pose invariant face recognition system using PCA and ANFIS. Optik 2015 , 126 , 3483–3487. [ Google Scholar ] [ CrossRef ]
  • Moussa, M.; Hmila, M.; Douik, A. A Novel Face Recognition Approach Based on Genetic Algorithm Optimization. Stud. Inform. Control 2018 , 27 , 127–134. [ Google Scholar ] [ CrossRef ]
  • Mian, A.; Bennamoun, M.; Owens, R. An efficient multimodal 2D-3D hybrid approach to automatic face recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2007 , 29 , 1927–1943. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Cho, H.; Roberts, R.; Jung, B.; Choi, O.; Moon, S. An efficient hybrid face recognition algorithm using PCA and GABOR wavelets. Int. J. Adv. Robot. Syst. 2014 , 11 , 59. [ Google Scholar ] [ CrossRef ]
  • Guru, D.S.; Suraj, M.G.; Manjunath, S. Fusion of covariance matrices of PCA and FLD. Pattern Recognit. Lett. 2011 , 32 , 432–440. [ Google Scholar ] [ CrossRef ]
  • Sing, J.K.; Chowdhury, S.; Basu, D.K.; Nasipuri, M. An improved hybrid approach to face recognition by fusing local and global discriminant features. Int. J. Biom. 2012 , 4 , 144–164. [ Google Scholar ] [ CrossRef ]
  • Kamencay, P.; Zachariasova, M.; Hudec, R.; Jarina, R.; Benco, M.; Hlubik, J. A novel approach to face recognition using image segmentation based on spca-knn method. Radioengineering 2013 , 22 , 92–99. [ Google Scholar ]
  • Sun, J.; Fu, Y.; Li, S.; He, J.; Xu, C.; Tan, L. Sequential Human Activity Recognition Based on Deep Convolutional Network and Extreme Learning Machine Using Wearable Sensors. J. Sens. 2018 , 2018 , 10. [ Google Scholar ] [ CrossRef ]
  • Soltanpour, S.; Boufama, B.; Wu, Q.J. A survey of local feature methods for 3D face recognition. Pattern Recognit. 2017 , 72 , 391–406. [ Google Scholar ] [ CrossRef ]
  • Sharma, G.; ul Hussain, S.; Jurie, F. Local higher-order statistics (LHS) for texture categorization and facial analysis. In European Conference on Computer Vision ; Springer: Berlin/Heidelberg, Germany, 2012; pp. 1–12. [ Google Scholar ]
  • Zhang, J.; Marszałek, M.; Lazebnik, S.; Schmid, C. Local features and kernels for classification of texture and object categories: A comprehensive study. Int. J. Comput. Vis. 2007 , 73 , 213–238. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Leonard, I.; Alfalou, A.; Brosseau, C. Spectral optimized asymmetric segmented phase-only correlation filter. Appl. Opt. 2012 , 51 , 2638–2650. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Shen, L.; Bai, L.; Ji, Z. A svm face recognition method based on optimized gabor features. In International Conference on Advances in Visual Information Systems ; Springer: Berlin/Heidelberg, Germany, 2007; pp. 165–174. [ Google Scholar ]
  • Pratima, D.; Nimmakanti, N. Pattern Recognition Algorithms for Cluster Identification Problem. Int. J. Comput. Sci. Inform. 2012 , 1 , 2231–5292. [ Google Scholar ]
  • Zhang, C.; Prasanna, V. Frequency domain acceleration of convolutional neural networks on CPU-FPGA shared memory system. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2017; ACM: New York, NY, USA, 2017; pp. 35–44. [ Google Scholar ]
  • Nguyen, D.T.; Pham, T.D.; Lee, M.B.; Park, K.R. Visible-Light Camera Sensor-Based Presentation Attack Detection for Face Recognition by Combining Spatial and Temporal Information. Sensors 2019 , 19 , 410. [ Google Scholar ] [ CrossRef ] [ PubMed ] [ Green Version ]
  • Parkhi, O.M.; Vedaldi, A.; Zisserman, A. Deep face recognition. In Proceedings of the BMVC 2015—British Machine Vision Conference, Swansea, UK, 7–10 September.
  • Wen, Y.; Zhang, K.; Li, Z.; Qiao, Y. A discriminative feature learning approach for deep face recognition. In European Conference on Computer Vision ; Springer: Berlin/Heidelberg, Germany, 2016; pp. 499–515. [ Google Scholar ]
  • Passalis, N.; Tefas, A. Spatial bag of features learning for large scale face image retrieval. In INNS Conference on Big Data ; Springer: Berlin/Heidelberg, Germany, 2016; pp. 8–17. [ Google Scholar ]
  • Liu, W.; Wen, Y.; Yu, Z.; Li, M.; Raj, B.; Song, L. Sphereface: Deep hypersphere embedding for face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 212–220. [ Google Scholar ]
  • Amato, G.; Falchi, F.; Gennaro, C.; Massoli, F.V.; Passalis, N.; Tefas, A.; Vairo, C. Face Verification and Recognition for Digital Forensics and Information Security. In Proceedings of the 2019 7th International Symposium on Digital Forensics and Security (ISDFS), Barcelos, Portugal, 10–12 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–6. [ Google Scholar ]
  • Taigman, Y.; Yang, M.; Ranzato, M.A. Wolf, LDeepface: Closing the gap to human-level performance in face verification. In Proceedings of the IEEE conference on computer vision and pattern recognition, Washington, DC, USA, 23–28 June 2014; pp. 1701–1708. [ Google Scholar ]
  • Ma, Z.; Ding, Y.; Li, B.; Yuan, X. Deep CNNs with Robust LBP Guiding Pooling for Face Recognition. Sensors 2018 , 18 , 3876. [ Google Scholar ] [ CrossRef ] [ PubMed ] [ Green Version ]
  • Koo, J.; Cho, S.; Baek, N.; Kim, M.; Park, K. CNN-Based Multimodal Human Recognition in Surveillance Environments. Sensors 2018 , 18 , 3040. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Cho, S.; Baek, N.; Kim, M.; Koo, J.; Kim, J.; Park, K. Detection in Nighttime Images Using Visible-Light Camera Sensors with Two-Step Faster Region-Based Convolutional Neural Network. Sensors 2018 , 18 , 2995. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Koshy, R.; Mahmood, A. Optimizing Deep CNN Architectures for Face Liveness Detection. Entropy 2019 , 21 , 423. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Elmahmudi, A.; Ugail, H. Deep face recognition using imperfect facial data. Future Gener. Comput. Syst. 2019 , 99 , 213–225. [ Google Scholar ] [ CrossRef ]
  • Seibold, C.; Samek, W.; Hilsmann, A.; Eisert, P. Accurate and robust neural networks for security related applications exampled by face morphing attacks. arXiv 2018 , arXiv:1806.04265. [ Google Scholar ]
  • Yim, J.; Jung, H.; Yoo, B.; Choi, C.; Park, D.; Kim, J. Rotating your face using multi-task deep neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 676–684. [ Google Scholar ]
  • Bajrami, X.; Gashi, B.; Murturi, I. Face recognition performance using linear discriminant analysis and deep neural networks. Int. J. Appl. Pattern Recognit. 2018 , 5 , 240–250. [ Google Scholar ] [ CrossRef ]
  • Gourier, N.; Hall, D.; Crowley, J.L. Estimating Face Orientation from Robust Detection of Salient Facial Structures. Available online: venus.inrialpes.fr/jlc/papers/Pointing04-Gourier.pdf (accessed on 15 December 2019).
  • Gonzalez-Sosa, E.; Fierrez, J.; Vera-Rodriguez, R.; Alonso-Fernandez, F. Facial soft biometrics for recognition in the wild: Recent works, annotation, and COTS evaluation. IEEE Trans. Inf. Forensics Secur. 2018 , 13 , 2001–2014. [ Google Scholar ] [ CrossRef ]
  • Boukamcha, H.; Hallek, M.; Smach, F.; Atri, M. Automatic landmark detection and 3D Face data extraction. J. Comput. Sci. 2017 , 21 , 340–348. [ Google Scholar ] [ CrossRef ]
  • Ouerhani, Y.; Jridi, M.; Alfalou, A.; Brosseau, C. Graphics processor unit implementation of correlation technique using a segmented phase only composite filter. Opt. Commun. 2013 , 289 , 33–44. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Su, C.; Yan, Y.; Chen, S.; Wang, H. An efficient deep neural networks training framework for robust face recognition. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3800–3804. [ Google Scholar ]
  • Coşkun, M.; Uçar, A.; Yildirim, Ö.; Demir, Y. Face recognition based on convolutional neural network. In Proceedings of the 2017 International Conference on Modern Electrical and Energy Systems (MEES), Kremenchuk, Ukraine, 15–17 November 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 376–379. [ Google Scholar ]

Click here to enlarge figure

Author/Technique UsedDatabaseMatchingLimitationAdvantageResult
Local Appearance-Based Techniques
Khoi et al. [ ]LBPTDFMAPSkewness in face imageRobust feature in fontal face5%
CF199913.03%
LFW90.95%
Xi et al. [ ]LBPNetFERETCosine similarityComplexities of CNNHigh recognition accuracy97.80%
LFW94.04%
Khoi et al. [ ]PLBPTDFMAPSkewness in face imageRobust feature in fontal face5.50%
CF9.70%
LFW91.97%
Laure et al. [ ]LBP and KNNLFWKNNIllumination conditionsRobust85.71%
CMU-PIE99.26%
Bonnen et al. [ ]MRF and MLBPAR (Scream)Cosine similarityLandmark extraction fails or is not idealRobust to changes in facial expression86.10%
FERET (Wearing sunglasses) 95%
Ren et al. [ ]Relaxed LTPCMU-PIEChisquare distanceNoise levelSuperior performance compared with LBP, LTP95.75%
Yale B98.71%
Hussain et al. [ ]LPQFERET/Cosine similarityLot of discriminative informationRobust to illumination variations99.20%
LFW75.30%
Karaaba et al. [ ]HOG and MMDFERETMMD/MLPDLow recognition accuracyAligning difficulties68.59%
LFW23.49%
Arigbabu et al. [ ]PHOG and SVMLFWSVMComplexity and time of computationHead pose variation88.50%
Leonard et al. [ ]VLC correlatorPHPIDASPOFThe low number of the reference image usedRobustness to noise92%
Napoléon et al. [ ]LBP and VLCYaleBPOFIlluminationRotation + Translation98.40%
YaleB Extended95.80%
Heflin et al. [ ]correlation filterLFW/PHPIDPSRSome pre-processing steps More effort on the eye localization stage39.48%
Zhu et al. [ ]PCA–FCFCMU-PIECorrelation filterUse only linear methodOcclusion-insensitive96.60%
FRGC2.091.92%
Seo et al. [ ]LARK + PCALFWCosine similarityFace detectionReducing computational complexity78.90%
Ghorbel et al. [ ]VLC + DoGFERETPCELow recognition rateRobustness81.51%
Ghorbel et al. [ ]uLBP + DoGFERETchi-square distanceRobustnessProcessing time93.39%
Ouerhani et al. [ ]VLCPHPIDPCEPowerProcessing time77%
Lenc et al. [ ]SIFTFERETa posterior probabilityStill far to be perfectSufficiently robust on lower quality real data97.30%
AR95.80%
LFW98.04%
Du et al. [ ]SURFLFWFLANN distanceProcessing timeRobustness and distinctiveness95.60%
Vinay et al. [ ]SURF + SIFTLFWFLANNProcessing timeRobust in unconstrained scenarios78.86%
Face94distance96.67%
Calonder et al. [ ]BRIEF_KNNLow recognition rateLow processing time48%
Author/Techniques UsedDatabases MatchingLimitationAdvantage Result
Linear Techniques
Seo et al. [ ]LARK and PCALFWL2 distanceDetection accuracyReducing computational complexity85.10%
Annalakshmi et al. [ ]ICA and LDALFWBayesian ClassifierSensitivity Good accuracy88%
Annalakshmi et al. [ ]PCA and LDALFWBayesian ClassifierSensitivity Specificity59%
Hussain et al. [ ]LQP and GaborFERETCosine similarityLot of discriminative informationRobust to illumination variations99.2%
75.3%
LFW
Gowda et al. [ ]LPQ and LDAMEPCOSVM Computation timeGood accuracy99.13%
Z. Cui et al. [ ]BoWARASMOcclusionsRobust99.43%
ORL 99.50%
FERET82.30%
Khan et al. [ ]PSO and DWTCKEuclidienne distanceNoiseRobust to illumination98.60%
MMI95.50%
JAFFE98.80%
Huang et al. [ ]2D-DWTFERETKNNPoseFrontal or near-frontal facial images90.63%
97.10%
LFW
Perlibakas and Vytautas [ ]PCA and Gabor filterFERETCosine metricPrecisionPose87.77%
Hafez et al. [ ]Gabor filter and LDAORL2DNCC PoseGood recognition performance98.33%
C. YaleB99.33%
Sufyanu et al. [ ]DCTORLNCCHigh memoryControlled and uncontrolled databases93.40%
Yale
Shanbhag et al. [ ]DWT and BPSO_ __ _RotationSignificant reduction in the number of features88.44%
Ghorbel et al. [ ]Eigenfaces and DoG filterFERETChi-square distanceProcessing timeReduce the representation84.26%
Zhang et al. [ ]PCA and FFTYALESVMComplexityDiscrimination93.42%
Zhang et al. [ ]PCAYALESVMRecognition rateReduce the dimensionality 84.21%
Fan et al. [ ]RKPCAMNIST ORL RBF kernelComplexityRobust to sparse noises_
Vinay et al. [ ] ORB and KPCAORLFLANN MatchingProcessing timeRobust87.30%
Vinay et al. [ ]SURF and KPCAORLFLANN MatchingProcessing timeReduce the dimensionality80.34%
Vinay et al. [ ]SIFT and KPCAORLFLANN MatchingLow recognition rateComplexity69.20%
Lu et al. [ ]KPCA and GDAUMIST faceSVMHigh error rate Excellent performance48%
Yang et al. [ ]PCA and MSRHELEN faceESRComplexityUtilizes color, gradient, and regional information98.00%
Yang et al. [ ]LDA and MSRFRGCESRLow performancesUtilizes color, gradient, and regional information90.75%
Ouanan et al. [ ]FDDL ARCNNOcclusionOrientations, expressions98.00%
Vankayalapati and Kyamakya [ ]CNNORL_ _PosesHigh recognition rate95%
Devi et al. [ ]2FNNORL_ _ComplexityLow error rate98.5
Author/Technique UsedDatabaseMatchingLimitationAdvantage Result
Fathima et al. [ ]GW-LDAAT&Tk-NNHigh processing timeIllumination invariant and reduce the dimensionality88%
FACES9494.02%
MITINDIA88.12%
Barkan et al., [ ]OCLBP, LDA, and WCCNLFWWCCN_Reduce the dimensionality87.85%
Juefei et al. [ ]ACF and WLBPLFW ComplexityPose conditions89.69%
Simonyan et al. [ ]Fisher + SIFTLFWMahalanobis matrixSingle feature typeRobust87.47%
Sharma et al. [ ]PCA–ANFISORLANFISSensitivity-specificity 96.66%
ICA–ANFISANFISPose conditions71.30%
LDA–ANFISANFIS 68%
Ojala et al. [ ] DCT–PCAORLEuclidian distanceComplexityReduce the dimensionality92.62%
UMIST99.40%
YALE95.50%
Mian et al. [ ] Hotelling transform, SIFT, and ICPFRGCICPProcessing timeFacial expressions99.74%
Cho et al. [ ]PCA–LGBPHSExtended Yale FaceBhattacharyya distanceIllumination conditionComplexity95%
PCA–GABOR Wavelets
Sing et al. [ ]PCA–FLDCMUSVMRobustnessPose, illumination, and expression71.98%
FERET94.73%
AR68.65%
Kamencay et al. [ ]SPCA-KNNESSEXKNNProcessing timeExpression variation96.80%
Sun et al. [ ]CNN–LSTM–ELMOPPORTUNITYLSTM/ELMHigh processing timeAutomatically learn feature representations90.60%
Ding et al. [ ]CNNs and SAELFW_ _ComplexityHigh recognition rate99%
ApproachesDatabases UsedAdvantagesDisadvantagesPerformancesChallenges Handled
TDF, CF1999,
LFW, FERET,
CMU-PIE, AR,
Yale B, PHPID,
YaleB Extended, FRGC2.0, Face94.
]. , ]. , ]. ], various lighting conditions[ ], facial expressions [ ], and low resolution.
]. ]. ]. ]. ]. ].
LFW, FERET, MEPCO, AR, ORL, CK, MMI, JAFFE,
C. Yale B, Yale, MNIST, ORL, UMIST face, HELEN face, FRGC.
, ]. , , , ]. ]. ]. ]. , ], scaling, facial expressions.
, , ]. , , ]. ]. , ]. ]. , ]. , ], poses [ ], conditions, scaling, facial expressions.
AT&T, FACES94,
MITINDIA, LFW, ORL, UMIST, YALE, FRGC, Extended Yale, CMU, FERET, AR, ESSEX.
]. , , ]. ]. ]. , ].

Share and Cite

Kortli, Y.; Jridi, M.; Al Falou, A.; Atri, M. Face Recognition Systems: A Survey. Sensors 2020 , 20 , 342. https://doi.org/10.3390/s20020342

Kortli Y, Jridi M, Al Falou A, Atri M. Face Recognition Systems: A Survey. Sensors . 2020; 20(2):342. https://doi.org/10.3390/s20020342

Kortli, Yassin, Maher Jridi, Ayman Al Falou, and Mohamed Atri. 2020. "Face Recognition Systems: A Survey" Sensors 20, no. 2: 342. https://doi.org/10.3390/s20020342

Article Metrics

Article access statistics, further information, mdpi initiatives, follow mdpi.

MDPI

Subscribe to receive issue release notifications and newsletters from MDPI journals

  • Search Menu
  • Sign in through your institution
  • Advance Articles
  • Special Issues
  • Author Guidelines
  • Submission Site
  • Open Access
  • Reviewer Guidelines
  • Review and Appeals Process
  • About The Computer Journal
  • About the BCS, The Chartered Institute for IT
  • Editorial Board
  • Advertising and Corporate Services
  • Journals Career Network
  • Self-Archiving Policy
  • Dispatch Dates
  • Journals on Oxford Academic
  • Books on Oxford Academic

BCS, The Chartered Institute for IT

Article Contents

1. introduction, 2. related works, 4. experimental results, 5. discussion, 6. conclusion, ethics statement, conflict of interest statement, data availability statement, face recognition using deep learning on raspberry pi.

  • Article contents
  • Figures & tables
  • Supplementary Data

Abdulatif Ahmed Ali Aboluhom, Ismet Kandilli, Face recognition using deep learning on Raspberry Pi, The Computer Journal , 2024;, bxae066, https://doi.org/10.1093/comjnl/bxae066

  • Permissions Icon Permissions

Facial recognition on resource-limited devices such as the Raspberry Pi poses a challenge due to inherent processing limitations. For real-time applications, finding efficient and reliable solutions is critical. This study investigated the feasibility of using transfer learning for facial recognition tasks on the Raspberry Pi and evaluated transfer learning that leverages knowledge from previously trained models. We compared two well-known deep learning (DL) architectures, InceptionV3 and MobileNetV2, adapted to face recognition datasets. MobileNetV2 outperformed InceptionV3, achieving a training accuracy of 98.20% and an F1 score of 98%, compared to InceptionV3’s training accuracy of 86.80% and an F1 score of 91%. As a result, MobileNetV2 emerges as a more powerful architecture for facial recognition tasks on the Raspberry Pi when integrated with transfer learning. These results point to a promising direction for deploying efficient DL applications on edge devices, reducing latency, and enabling real-time processing.

Face recognition and individual identification have gained significant interest, with various methods available. This involves visual-image recognition, where visual patterns are identified through the eyes and processed by the brain. Similarly, computers analyze photos or videos as matrices of pixels, identifying which pixel groups represent specific objects. In face recognition, the computer’s task is to accurately identify the person to whom the face data belongs.

Face recognition is a system that involves face detection, positioning, and identity verification. It uses algorithms to find face coordinates in images or videos, and uses deep learning (DL) to identify facial features. The process involves detecting faces and comparing them to a database. The face recognition process can be based on appearance or feature-based approaches, focusing on geometric features such as eyes, nose, eyebrows, and cheeks by Gurovich et al. [ 1 ]. Despite significant advancements in face recognition technology, some areas require refinement for real-world applications. Specialized cameras can improve image quality and address filtering and restructuring challenges by Jiang et al. [ 2 , 3 ]. Face recognition technology has been widely utilized for access control purposes by Vardhini et al. [ 4 ], security by Lander et al. [ 5 ], and finance. However, it is now being introduced in areas such as logistics, retail, smartphones, transportation, education, real estate, and network information security by Adjabi et al. [ 6 ]. Beyond fingerprinting, facial recognition technology stands as a paramount subject in computer vision and biometric systems by Jain et al. [ 7 ]. In recent times, facial recognition technology has emerged as a captivating field of study, especially for surveillance purposes in smart homes, cities by Lin et al. [ 8 ] and Sajjad et al. [ 9 ], and robotic systems by Bhavyalakshmi and Harish [ 10 ] and Quah and Ghazaly [ 11 ]. Advancements in artificial intelligence have led to a need for more accurate, flexible, and rapid recognition technology. New DL techniques have enabled high accuracy in recognizing individuals through digital imaging. Face recognition systems enable efficient processing of large volumes of images. An experimental application of deep neural networks (DNNs) in digital photography was implemented using a Raspberry Pi-based solution by Bajrami and Gashi [ 12 ]. Recently, DNNs, notably convolutional neural networks (CNNs), have demonstrated impressive classification performance in face recognition tasks. A face recognition system can be realized using a CNN model. Wang and Guo [ 13 ] developed a camera-based system that accurately recognizes human faces and names using a CNN model, even on Raspberry Pi computers, but their efficiency depends on accuracy and sensitivity. Pan et al. [ 14 ] demonstrated the success of DL in face recognition, but challenges remain due to a lack of labelled data. Advancements in face recognition have been made through abundant labelled data and transfer learning, which applies knowledge from one task to another, varying in homogeneity or heterogeneity depending on feature space similarity. Wang et al. [ 15 ] highlighted the challenge of face recognition using DL techniques in real-world conditions, including changes in lighting, facial expressions, and poses. They noted that systems may struggle to recognize partially obscured faces or those captured from non-frontal angles. This study sought to develop a fully automated system for face recognition. DL methods have been customized for this purpose. Using the proposed DL techniques, this study aimed to increase the accuracy and precision of face recognition systems.

Saabia et al. [ 16 ], introduced an advanced face recognition framework that combines preprocessing, feature extraction, and selection processes, augmented by the application of the grey wolf optimization algorithm, and used the k-NN classifier. Ali et al. [ 17 ], presented a face detection method that integrates Haar cascades with exact skin, eye, and nose detection mechanisms, overcoming challenges like environmental conditions and device disparities, and incorporating local binary patterns for improved facial recognition in video streams. Ali et al. [ 18 ] proposed the model by combining Haar cascade files with improved feature detection. Muhammad Sajid et al. [ 19 ], explored the use of deep convolutional neural networks (dCNN) for age-invariant face recognition using various architectures like AlexNet, VGGNet, GoogLeNet, and ResNet. Ab Wahab et al. [ 20 ] developed a hybrid CNN-KNN facial expression recognition model on Raspberry Pi 4, demonstrating the feasibility of implementing complex models on limited hardware. Additionally, Bajrami and Gashi [ 12 ] proposed a cost-effective and scalable solution for facial recognition using a DNN on Raspberry Pi, addressing lighting conditions and image quality challenges. Gwyn et al. [ 21 ] used architectures such as VGG-16 and InceptionV3 to stand out for their high accuracy and F1 score, demonstrating the effectiveness of DL in improving facial recognition. Ariefwan et al. [ 22 ], made a comparison of CNN architectures, especially ResNet, MobileNetV2, and InceptionV3, for facial recognition systems. Dang and T. V [ 23 ], introduced a new approach leveraging the ArcFace-based facial recognition model built into MobileNetV2 and gesture recognition. Moreover, Dang and T. V. [ 24 ] introduced an optimized model combining enhanced FaceNet with MobileNetV2 and SSD. By leveraging the strengths of different CNN architectures and integrating additional features, the field is moving towards more versatile, efficient, and accurate systems. The current research aims to contribute to this dynamic and, building on CNN models, break new ground in facial recognition technology using Raspberry Pi devices.

3.1. DL and Artificial Neural Network fundamentals

DL is a subset of machine learning in artificial intelligence (AI) that focuses on developing algorithms capable of autonomous learning and decision-making. Unlike traditional machine learning, DL requires large datasets for training. It relies primarily on DNNs, self-learning systems that process and filter data within interconnected layers. Artificial Neural Networks (ANN), as described by Bengio et al. [ 25 ], form the foundational structure of DL algorithms.The ANN, in its most basic configuration, consists of three layers: Input Layer: This is the initial layer responsible for data entry. Hidden Layer: This is the second layer and is used for data processing. Output Layer: This is the third layer where decisions are made based on the prior data analysis and processing. Figure 1 provides an CeTeXaddANNCeTeXCACeTeXdelArtificial Neural NetworkCeTeXCD with two hidden layers.

An CeTeXaddANNCeTeXCACeTeXdelArtificial Neural NetworkCeTeXCD with two hidden layers by Ognjanovski [26].

An CeTeXaddANNCeTeXCACeTeXdelArtificial Neural NetworkCeTeXCD with two hidden layers by Ognjanovski [ 26 ].

The neural network’s fundamental architecture, comprising input, hidden, and output layers, is inspired by biological neurons in the human brain. Each artificial neuron connects to others in preceding and succeeding layers, without direct links within its layer. The number of hidden layers indicates the network’s complexity and potential capabilities, as illustrated in Fig. 1 . DNNs feature multiple hidden layers, enhancing complexity with interconnected units and neurons. Similar to brain neurons, neural networks evolve through learning, adjusting their data processing and analysis to improve efficiency and accuracy. Neurons receive multiple inputs and produce a single output, with connections represented by weight vectors indicating their strength. Inputs are multiplied by weights and adjusted by a bias parameter before passing through an activation function, yielding the neuron’s output for the next layer. These connections undergo iterative updates to optimize final outputs, as detailed by Bengio et al. [ 25 ].

3.2. Principles of convolution neural networks

A CNN is a specialized kind of DNN, particularly adept at processing visual images. While DNNs generally have intricate architectures and necessitate vast datasets for training, they can comprise millions of parameters. This makes CNNs computationally more demanding than conventional methods. Despite their prowess in visual recognition tasks, CNNs might not always outperform traditional methods in terms of speed and resource efficiency. Figure 2 illustrates the fundamental architecture of a CNN, which encompasses key components: input, convolutional, pooling, dense, and output layers.

CNN diagram by Phung and Rhee [27].

CNN diagram by Phung and Rhee [ 27 ].

Table 1 provides a structured view of the different layers in a CNN, along with their respective functions and a description of what they do.

Layers in a CNN

Input LayerData entryThe first layer is where input image data is fed into the network.
Convolutional LayerFeature extractionApplies filters to the input to create a feature map, capturing local dependencies in the input.
Activation LayerNon-linearityIntroduces non-linear properties to the system, allowing for complex patterns to be learned. Commonly uses ReLU (Rectified Linear Unit).
Pooling LayerDimensional reductionReduces the spatial size of the feature maps, thus reducing the number of parameters and computations in the network.
Fully Connected LayerClassificationNeurons in this layer have full connections to all activations in the previous layer, as seen in regular Neural Networks, and are typically used for final classification.
Output LayerDecision makingThe final layer gives the output of the network, such as the class scores in classification tasks.
Input LayerData entryThe first layer is where input image data is fed into the network.
Convolutional LayerFeature extractionApplies filters to the input to create a feature map, capturing local dependencies in the input.
Activation LayerNon-linearityIntroduces non-linear properties to the system, allowing for complex patterns to be learned. Commonly uses ReLU (Rectified Linear Unit).
Pooling LayerDimensional reductionReduces the spatial size of the feature maps, thus reducing the number of parameters and computations in the network.
Fully Connected LayerClassificationNeurons in this layer have full connections to all activations in the previous layer, as seen in regular Neural Networks, and are typically used for final classification.
Output LayerDecision makingThe final layer gives the output of the network, such as the class scores in classification tasks.

CNNs employ convolutional kernels, which are particularly adept at isolating local edge features. The primary motivation behind leveraging these convolutional mechanisms is to drastically minimize the number of parameters in play (in contrast to fully connected layers) while maintaining the proficiency to discern edge features on a local level. As a result, when pitted against DNNs, CNNs showcase enhanced efficiency during both the training and testing phases. A rudimentary convolutional kernel is illustrated in Fig. 3 . For a CNN with a kernel size of 2 × 2 and an input size of 4x3, with a stride of 1, the output dimensions are calculated as 3 × 2. If a fully connected layer were used for generating six output values (like the given example), it would entail 72 computations (i.e. 4 × 3 × 3 × 2). In contrast, the CNN would require only 24 computations (i.e. 3 |$\times $| 2 |$\times $| 2 |$\times $| 2) to achieve the same by Bengio et al. [ 25 ].

Convolutional kernel by Bengio et al. [25].

Convolutional kernel by Bengio et al. [ 25 ].

Here’s the list of variables used in the operation in Fig. 3 :

|$a$| - |$l$|⁠ : Elements of the input matrix. These could represent pixel intensities in an image.

|$w$| - |$z$|⁠ : Elements of the kernel matrix. These are the weights that will be learned during the training of a CNN.

|$aw + bx + ey + fz$|⁠ : represents the computation for the top-left value of the output matrix. They multiply the corresponding elements of the input matrix by the kernel weights and sum them up.

To elaborate further, each element of the output matrix is calculated as follows:

For the top-left value in the output, you overlay the kernel on the top-left corner of the input matrix and compute:

|$(aw)+(bx)+(cy)+(dy)+(ey)+(fz)$|

For the next value to the right in the output (first row, second column), you shift the kernel one step to the right and compute:

|$(bw)+(cx)+(dz)+(ey)+(fz)+(gz)$|

This process continues across the entire input matrix to generate the output.

The output matrix is smaller than the input matrix because the kernel doesn’t slide beyond the edges of the input matrix. If we want the output matrix to be the same size as the input, we would use padding around the input matrix.

CNNs undergo training by evaluating the accuracy of their predictions and making incremental adjustments to the model’s parameters to refine these predictions. The model’s performance is gauged using a loss function, which computes a ’loss’ value based on the deviation between the predicted and actual outcomes. To enhance the model, this loss is then backpropagated through the network, guiding the adjustments to its parameters. This iterative process of forward and backward propagation continues until the model’s performance plateaus or no longer shows significant improvement. To ensure the model generalizes well and doesn’t overfit the training data, it’s essential to assess its performance using a separate validation dataset, which represents unseen data by Scholkopf and Smola et al. [ 28 ].

Automatic face recognition has always been a subject of great intrigue. However, the multifaceted nature of face recognition makes it a challenging endeavour. In our investigation, we designed a CNN model, keeping certain parameters constant, such as filter size, pooling layers, and convolutional layers, but varying the architectural depth. This was to discern the effects of both depth and filter size on face recognition outcomes.

The size of the output can be derived based on several factors. Equation (1) to determine the weight of output O is presented as

|$W$| is the weight of the input.

|$K$| is the kernel size.

|$S$| is the stride.

|$P$| is the amount of zero-padding.

|$O$| is the weight of output.

The pooling layer functions by operating on each depth slice or channel of the input, modulating its spatial dimensions. In design, a max-pooling layer with a 2 |$\times $| 2 filter size |$F$| was employed, utilizing a stride |$S$| of 2. This procedure sub-samples every depth slice of the input by a factor of 2 in both its width and height dimensions, effectively discarding 75% of the activations. Equation (2) shows the relationship between the input and output weight sizes for such a pooling operation, which is encapsulated by

The Softmax function is used for multi-class classification problems and produces outputs between (0,1) that indicate the probability of each given input belonging to a class. Equation (3) below represents the Softmax function [ 25 ]:

|$\sigma $|⁠ : represents Softmax.

|$z$|⁠ : represents the input vector.

|$e^{z_{i}}$|⁠ : represents the standard exponential function of the input vector.

|$k$|⁠ : represents the number of classes in the multi-class classifier.

|$e^{z_{j}}$|⁠ : represents the standard exponential function of the output vector.

3.3. Face recognition system procedure

The details of the approach are presented in the following steps:

Step 1: initially, a subset of image samples from a labelled dataset is chosen and added to the training set, while the remaining samples are used to form a test set.

Step 2: the training set is then fed to the CNN model for training.

Step 3: the test set is subsequently recognized using the trained network from the above step.

Step 4: the model and algorithm of the system are implemented on a Raspberry Pi.

Step 5: the face recognition procedure is carried out by the Raspberry Pi in conjunction with its attached camera. Figure 4 presents the block diagram outlining the face recognition system.

Block diagram of the face recognition system.

Block diagram of the face recognition system.

In Fig. 5 , the flowchart describes the facial recognition process using the Raspberry Pi 4. The process starts by loading the model, and then opening the webcam to take a photo. Images are pre-processed and analyzed for face detection. If no face is detected, it loops back to image capture. If a face is detected, it will extract and match features for facial recognition. After identification, it will display the results, then close the camera and end the process.

The flowchart of the Face recognition system

The flowchart of the Face recognition system

3.4. Transfer learning

Training a CNN network with such a massive dataset from scratch can be quite challenging. Instead of spending time, energy, and resources training all the layers of the model with random initial weights, it’s more efficient to use a model pre-trained on a large dataset as a foundation for a new task. Moreover, training the system on a limited dataset can diminish the CNN model’s ability to generalize. By transferring learned knowledge from one task in one domain to another, achieving faster and more effective results is possible, a method termed ’transfer learning’ by Pan and Yang [ 14 ]. The face recognition system utilizes the transfer learning method, leveraging a model with pre-existing learned features, to expedite the training process. Figure 6 illustrates the principle of transfer learning.

Principle of transfer learning by Nagrath et al. [29].

Principle of transfer learning by Nagrath et al. [ 29 ].

3.5. The proposed models

In this work, we utilized the principles of transfer learning to handle a face recognition problem by leveraging two pre-trained CNNs, MobileNetV2 and InceptionV3. In Table 2 , the number of parameters used in the model architecture is presented.

CNN models’ parameters used in face recognition.

MobileNetV22,257,984
InceptionV323.8 M
MobileNetV22,257,984
InceptionV323.8 M

The proposed models are built upon the MobileNetV2 and InceptionV3 architectures as the base models. Additional layers are then added to augment their capabilities. They include a global average pooling layer, which condenses spatial information from the base models’ feature maps. Afterwards, a dense layer with 128 units utilizes the ReLU activation function. Finally, an output layer with softmax activation is added, particularly suited for tasks with multiple classes. In this case, the models are dealing with 10 classes for face recognition. Tables 3 and 4 provide an overview of an adapted MobileNetV2 and InceptionV3 architecture, emphasizing the transition from the input layer through layers, and ending with the FC layer and softmax output for classification.

MobileNetV2 architecture

Input128 × 128 × 3
Convolution64 × 64 × 32323 × 32
Bottleneck64 × 64 × 16163 × 311
32 × 32 × 24243 × 326
32 × 32 × 24243 × 316
16 × 16 × 32323 × 326
8 × 8 × 64643 × 316
8 × 8 × 96963 × 316
4 × 4 × 1601603 × 326
4 × 4 × 3203203 × 316
Convolution (1 × 1)4 × 4 × 128012801 × 11
Average Pooling1 × 1 × 1280
Fully Connected (Dense)128
Output (Softmax)10
Input128 × 128 × 3
Convolution64 × 64 × 32323 × 32
Bottleneck64 × 64 × 16163 × 311
32 × 32 × 24243 × 326
32 × 32 × 24243 × 316
16 × 16 × 32323 × 326
8 × 8 × 64643 × 316
8 × 8 × 96963 × 316
4 × 4 × 1601603 × 326
4 × 4 × 3203203 × 316
Convolution (1 × 1)4 × 4 × 128012801 × 11
Average Pooling1 × 1 × 1280
Fully Connected (Dense)128
Output (Softmax)10

InceptionV3 architecture

Input128 × 128 × 3
Convolution64 × 64 × 32323 × 32
Convolution62 × 62 × 32323 × 31
Convolution62 × 62 × 64643 × 31
Max Pooling31 × 31 × 643 × 32
Convolution31 × 31 × 80801 × 11
Convolution29 × 29 × 1921923 × 31
Max Pooling14 × 14 × 1923 × 32
Inception modulesVariedVariedVariedVaried
Average Pooling3 × 3 × 20483 × 31
Dropout2048
Fully Connected128
Output (Softmax)10
Input128 × 128 × 3
Convolution64 × 64 × 32323 × 32
Convolution62 × 62 × 32323 × 31
Convolution62 × 62 × 64643 × 31
Max Pooling31 × 31 × 643 × 32
Convolution31 × 31 × 80801 × 11
Convolution29 × 29 × 1921923 × 31
Max Pooling14 × 14 × 1923 × 32
Inception modulesVariedVariedVariedVaried
Average Pooling3 × 3 × 20483 × 31
Dropout2048
Fully Connected128
Output (Softmax)10

In Table 5 , the training models’ setup is given. To reduce the time required for model convergence, the model batch size is increased to the extent that Raspberry Pi memory allows.

Model training setup

MobileNetV2Dense (128, ReLU)32100128 × 128
InceptionV3Dense (128, ReLU)32100128 × 128
MobileNetV2Dense (128, ReLU)32100128 × 128
InceptionV3Dense (128, ReLU)32100128 × 128

Figure 7 represents a face recognition pipeline. An input image goes through a base model comprising a pre-trained model that we use as feature extraction layers. The extracted features are then passed through a fully connected FC layer with 128 nodes. The resulting features from the FC layer are fed into a softmax classifier for the final classification step, determining the identity of the person in the input image.

Classification model training.

Classification model training.

3.6. Dataset

We worked with a dataset consisting of 1000 facial images that included celebrities’ images. In this well-curated and distributed dataset, 700 images were allocated for training and 300 for testing. The dataset consisted of 1000 images divided into 10 distinct categories, with each category representing a different celebrity. Each of these categories contains exactly 100 images. We used the Celebrity Face Image Dataset from Kaggle, and the dataset is available from this [ 30 ] (The datasets were derived from the following source in the public domain: Vishesh Thakur (7 December 2022). Celebrity Face Image Dataset, Version 1 . Retrieved 18 June 2024, from https://www.kaggle.com/datasets/vishesh1412/celebrity-face-image-dataset/data ), Table 6 describes the dataset features that are used in the face recognition system.

Dataset features used in face recognition system

Total Size1000 images
Image Resolution128 × 128 pixels
Number of Categories10 categories per celebrity
Images per Category100 images per category
Train Data SplitTraining Set: 700 images (approximately 70%)
Test Data SplitTesting Set: 300 images (approximately 30%)
Total Size1000 images
Image Resolution128 × 128 pixels
Number of Categories10 categories per celebrity
Images per Category100 images per category
Train Data SplitTraining Set: 700 images (approximately 70%)
Test Data SplitTesting Set: 300 images (approximately 30%)

3.7. Raspberry Pi computer

The Raspberry Pi, a compact single-board computer, can operate a comprehensive operating system. Renowned for its durability, it can function continuously, akin to a server-grade machine, while maintaining minimal power consumption. The heat emitted by its CPU is virtually negligible, making it an efficient choice for various applications. Its utility spans a broad spectrum, from home security systems and robot controllers to serving as a desktop PC, media centre, and web server, among others. The Raspberry Pi’s versatility is particularly evident in tasks that do not demand extensive processing power. Table 7 describes the features of the Raspberry Pi 4.

Raspberry Pi 4 features

Launch DateQ2’19
Cores4
Threads4
Processor frequency1.50 GHz
Power consumption2.7 W – 6.4 W
Memory4 GB
ProcessorRPi4 B
Dimension85 |$\times $| 49 |$\times $| 1.8 (in cm)
Launch DateQ2’19
Cores4
Threads4
Processor frequency1.50 GHz
Power consumption2.7 W – 6.4 W
Memory4 GB
ProcessorRPi4 B
Dimension85 |$\times $| 49 |$\times $| 1.8 (in cm)

3.8. USB webcam

A webcam is a digital camera that can be connected to a computer and allows the user to transmit images directly over the Internet to various parts of the world. These cameras can be connected to computers using different methods, such as USB ports or Wi-Fi. In this paper, we have used the Everest SC-HD03 model webcam, which has a resolution of 1080p and offers excellent image quality.

In this study, we utilized transfer learning techniques to train a model specifically for celebrity face recognition. We adopted the MobileNetV2 and InceptionV3 models pre-trained on a dataset as the base model and made enhancements to the specialized dataset consisting of images of celebrities. The dataset comprised 1000 images, equally distributed across 10 different classes, where each class represented a different celebrity. We partitioned the dataset into training and test subsets, using 700 images for training and 300 images for testing. We set up the ADAM optimizer with a learning rate of 0.001. As shown in Fig. 8 , we trained the MobileNetV2 model for 100 epochs. Remarkably, after training, the model achieved a notable accuracy rate of 97.67%. This metric indicates that the model correctly identified the celebrity in approximately 97.67% of the images in the test set. This high accuracy score is a testament to the model’s capability to extract meaningful features from face images and correctly associate them with the corresponding celebrity. The MobileNetV2 model reported a loss value of 5.5% in the test set.

Accuracy and loss of the MobileNetV2 model.

Accuracy and loss of the MobileNetV2 model.

Figure 9 displays the accuracy of the InceptionV3 model. After 100 epochs, the model achieved an accuracy of 91.00%. This means that the model correctly identified celebrities in the test set with approximately 91.00% accuracy. This was slightly lower than the accuracy we achieved with the MobileNetV2 model. However, this result indicates that the model effectively extracts significant features from face images and successfully matches these features with the respective celebrities. The InceptionV3 model recorded a training loss of 33.59% in the test set.

Accuracy and loss of the InceptionV3 model.

Accuracy and loss of the InceptionV3 model.

Table 8 shows the accuracy and loss results of the models; After training the dataset, InceptionV3 achieved a training accuracy of 86.80% and a training loss of 45.48%. Its validation accuracy is 91.00% and its validation loss is 33.59%. The model was trained for 100 epochs and took 186 minutes and 16.12 seconds on the Raspberry Pi 4. In contrast, MobileNetV2 outperformed InceptionV3 with a training accuracy of 98.20% and a training loss of 5.73%. Its validation accuracy is 97.67% and its validation loss is 5.53%. It was also trained for 100 epochs but was faster, taking 102 minutes and 43.46 seconds on the Raspberry Pi 4. The higher accuracy and lower loss indicate that MobileNetV2 performed better on the dataset and might be a better choice for the face recognition task. This difference underscores the suitability of MobileNetV2 for edge computing devices where resources are limited and efficiency is great.

Accuracy, loss, and time spent: results of the Models

InceptionV386.8045.4891.0033.59186 minutes and 16.12 seconds
MobileNetV298.205.7397.675.53102 minutes and 43.46 seconds
InceptionV386.8045.4891.0033.59186 minutes and 16.12 seconds
MobileNetV298.205.7397.675.53102 minutes and 43.46 seconds

Confusion matrix of models.

Confusion matrix of models.

Figure 10 shows the confusion matrix of the models. The InceptionV3 model’s confusion matrix shows how often the model correctly predicts the class versus when it makes an error. For example, Angelina Jolie was correctly classified 29 times but was also incorrectly predicted as Megan Fox one time. Similarly, Hugh Jackman was misclassified as Johnny Depp once and as Leonardo DiCaprio once. The model appears to have difficulty distinguishing between some individuals. The MobileNetV2 model’s confusion matrix shows good predictions. For example, Brad Pitt was correctly classified 29 times and misclassified only once as Robert Downey Jr. In most cases, each individual was correctly classified at least 29 times out of 30, indicating a high level of performance. In general, the MobileNetV2 model displays better performance with higher correct classification rates and fewer confusions between the classes compared to the InceptionV3 model.

Classification report of models.

Classification report of models.

ROC curves of models.

ROC curves of models.

Figure 11 shows the classification report of models, including precision, recall, F1 score, and support for each class, as well as the overall averages for the InceptionV3 and MobileNetV2 models. In the InceptionV3 model’s classification report, an accuracy of 0.91 is observed, indicating that the model made correct predictions 92% of the time across all classes. The MobileNetV2 model shows generally higher precision, recall, and F1 scores compared to the InceptionV3 model, indicating that it may be a more accurate model. It has an accuracy of 0.98.

Figure 12 shows the ROC curves of the models. The InceptionV3 model showed high AUC values for most classes, indicating high test accuracy and strong discriminative ability between positive and negative classes. The MobileNetV2 model also showed an ideal ROC curve, indicating perfect accuracy. Both models performed well, with MobileNetV2 showing an ideal ROC curve. All models were trained using the same dataset and experimental setup, ensuring a fair comparison. Results are presented in Table 9 .

Comparison of face recognition accuracies and F1 scores of proposed and existing models.

VGG19 [ ]77.8083.3383
Xception [ ]94.4096.6797
VGG16 [ ]82.8082.6783
InceptionV386.8091.0091
VGG19 [ ]77.8083.3383
Xception [ ]94.4096.6797
VGG16 [ ]82.8082.6783
InceptionV386.8091.0091

Table 9 provides comparative results of facial recognition models and describes the effectiveness of the inceptionV3 model and mobileNetV2 model compared to existing models. The training accuracy of our proposed InceptionV3 model is a commendable 86.80%, the validation accuracy and F1 score are slightly higher at 91%. Meanwhile, the MobileNetV2 model features a high training accuracy of 98.20%, validation accuracy of 97.67%, and F1 score of 98%, demonstrating a robust balance between precision and recall and effective generalization. The VGG19 referenced as [ 31 ] achieves a training accuracy of 77.80%, which is relatively modest compared to the other models. Its validation accuracy stands at 83.33%, indicating a slight improvement in performance during validation. The F1 score for VGG19 is 83%. The Xception model [ 21 ] demonstrates a significant improvement over VGG19, with a training accuracy of 94.40%. This high training accuracy is complemented by an even higher validation accuracy of 96.67%, suggesting generalization capabilities. The F1 score for Xception is 97%. Finally, the VGG16 model [ 21 ], like VGG19, shows a training accuracy of 82.80%. However, its validation accuracy drops slightly to 82.67%, indicating potential overfitting or challenges in generalization. The F1 score for VGG16 is 83%. VGG19 and VGG16 offer consistent performance, but they are outpaced by more advanced architectures like Xception, InceptionV3, and MobileNetV2. The Xception model provides a balanced and high-performance output, but MobileNetV2 stands out as the top performer across all metrics.

This real-time testing provided evidence of the model’s commendable performance and its potential for various face recognition tasks on the Raspberry Pi. Figure 13 displays the test results, which show excellent precision in successfully classifying the faces of celebrities.

The study evaluated the performance of two models, InceptionV3 and MobileNetV2, on the Raspberry Pi 4. InceptionV3 demonstrated high precision and recall, resulting in an F1 score of 91.9% and an accuracy of 91.9%. In contrast, MobileNetV2 exhibited even higher precision and recall, achieving an F1 score of 98.9% and an improved accuracy of 97.67%. MobileNetV2 showed superior performance overall, with a training accuracy of 98.20% compared to 86.80% for InceptionV3. Additionally, MobileNetV2 had lower loss values, indicating better generalization to new data. One of the strengths of MobileNetV2 is its high efficiency and excellent performance on lightweight devices like the Raspberry Pi. However, limitations related to the Raspberry Pi 4’s computing power were noted, with InceptionV3 having a longer training time (186 minutes) compared to MobileNetV2 (102 minutes). The duration of model training can be adjusted by adjusting the batch size. The study highlights MobileNetV2 as the preferred choice for face recognition tasks due to its high efficiency and accuracy. These findings align with optimization efforts for lightweight devices, ensuring efficiency without sacrificing accuracy.

Test results of face recognition system.

Test results of face recognition system.

In conclusion, MobileNetV2 outperformed InceptionV3 across all major metrics on the Raspberry Pi 4, making it the optimal choice for applications on lightweight devices. Future work will focus on improving the models’ effectiveness by expanding the dataset. This will include incorporating new classes and adding more images for each class, allowing the models to learn new features across a wider range of conditions and scenarios. Integrating Raspberry Pi and IoT technologies will facilitate efficient data collection and processing, allowing real-time updates and enhanced scalability. Additionally, addressing fraud and impersonation issues will be addressed through face anti-spoofing tasks in the real-time system.

The images of celebrities in this study have been sourced from publicly available materials. We have provided proper attribution as referenced in the attached citation. No personal or private images were used, and all images were employed respectfully and ethically strictly for this study.

The authors declare that there are no known financial or personal conflicts that could have influenced the results of this study.

Data derived from a source in the public domain. The data underlying this article are available in Kaggle at https://www.kaggle.com/datasets/vishesh1412/celebrity-face-image-dataset . The datasets were derived from the following source in the public domain: Vishesh Thakur (7 December 2022). Celebrity Face Image Dataset, Version 1. Retrieved 18 June 2024, from https://www.kaggle.com/datasets/vishesh1412/celebrity-face-image-dataset .

Gurovich   Y . et al.    DeepGestalt - identifying rare genetic syndromes using deep learning . CoRR . abs/1801.07637 .

Jiang   K , Wang   Z , Yi   P . et al.    Edge-enhanced GAN for remote sensing image superresolution . IEEE Trans. Geosci. Remote Sens.   2019 ; 57 : 5799 – 5812 . https://doi.org/10.1109/TGRS.2019.2902431 .

Google Scholar

Jiang   K , Wang   Z , Yi   P . et al.    ATMFN: Adaptive-threshold-based multi-model fusion network for compressed face hallucination . IEEE Trans. Multimed.   2020 ; 22 : 2734 – 2747 . https://doi.org/10.1109/TMM.2019.2960586 .

Vardhini   PH , Reddy   SPRD , Parapatla   VP . Facial recognition using OpenCV and Python on Raspberry Pi . In: Proceedings of the 2022 International Mobile and Embedded Technology Conference (MECON) , 480 – 485, IEEE, Noida, India .

Lander   K , Bruce   V , Bindemann   M . Use-inspired basic research on individual differences in face identification: implications for criminal investigation and security . Cogn. Res.   2018 ; 3 : 1 – 13 . https://doi.org/10.1186/s41235-018-0115-6 .

Adjabi   I , Ouahabi   A , Benzaoui   A . et al.    Past, present, and future of face recognition: a review . Electronics   2020 ; 9 : 1188 . https://doi.org/10.3390/electronics9081188 .

Jain   AK , Deb   D , Engelsma   JJ . Biometrics: trust, but verify . IEEE Trans. Biom. Behav. Identity Sci.   2022 ; 4 : 303 – 323 . https://doi.org/10.1109/TBIOM.2021.3115465 .

Lin   J-J , Huang   S-C . The implementation of the visitor access control system for the senior citizen based on the LBP face recognition . In: Proceedings of the 2017 International Conference on Fuzzy Theory and Its Applications (IFUZZY) , 1 – 6 , IEEE, Pingtung, Taiwan.

Sajjad   M , Nasir   M , Muhammad   K . et al.    Raspberry Pi assisted face recognition framework for enhanced law-enforcement services in smart cities . Future Gener. Comput. Syst. 2020; 108 : 995 – 1007 . https://doi.org/10.1016/j.future.2017.11.013 .

Bhavyalakshmi   R , Harish   B . Surveillance robot with Fface recognition using Raspberry Pi . Int. J. Eng. Res. Technol.   2020 ; V8 : 648–652. https://doi.org/10.17577/IJERTV8IS120298 .

Shen   Q , Md Ghazaly   M . Development and analysis of face recognition system on a mobile robot environment . J. Mech. Eng.   2020 ; 15 : 169 – 189 .

Bajrami   X , Gashi   B . Face recognition with Raspberry Pi using deep neural networks . Int. J. Comput. Vis. Robot.   2022 ; 12 : 177 – 193 . https://doi.org/10.1504/IJCVR.2022.121156 .

Wang   H , Guo   L . Research on face recognition based on deep learning . In: Proceedings of the 2021 3rd International Conference on Artificial Intelligence and Advanced Manufacture (AIAM) , 540 – 546 , IEEE, Manchester, United Kingdom.

Pan   SJ , Yang   Q . A survey on transfer learning . IEEE Trans. Knowl. Data Eng.   2010 ; 22 : 1345 – 1359 . https://doi.org/10.1109/TKDE.2009.191 .

Wang   M , Deng   W . Deep face recognition: a survey . Neurocomputing   2021 ; 429 : 215 – 244 . https://doi.org/10.1016/j.neucom.2020.10.081 .

Saabia   AA-B , El-Hafeez   T , Zaki   AM . Face recognition based on Grey wolf optimization for feature selection . Proc. Int. Conf. Adv. Intell. Syst. Inform. (AISI)   2018 ; 845 : 273 – 283 . https://doi.org/10.1007/978-3-319-99010-1_25 .

Ali   AA , El-Hafeez   T , Mohany   Y . A robust and efficient system to detect human faces based on facial features . Asian J. Res. Comput. Sci.   2019 ; 2 : 1 – 12 . https://doi.org/10.9734/ajrcos/2018/v2i430080 .

Ali   AA , El-Hafeez   TA , Mohany   YK . An accurate system for face detection and recognition . J. Adv. Math. Comput. Sci.   2019 ; 33 : 1 – 19 . https://doi.org/10.9734/jamcs/2019/v33i330178 .

Sajid   M , Ali   N , Ratyal   NI . et al.    Deep learning in age-invariant face recognition: a comparative study . Comput. J.   2022 ; 65 : 940 – 972 . https://doi.org/10.1093/comjnl/bxaa134 .

Ab Wahab   MN , Nazir   A , Ren   ATZ . et al.    Efficientnet-lite and hybrid CNN-KNN implementation for facial expression recognition on Raspberry Pi . IEEE Access   2021 ; 9 : 134065 – 134080 . https://doi.org/10.1109/ACCESS.2021.3113337 .

Gwyn   T , Roy   K , Atay   M . Face recognition using popular deep net architectures: a brief comparative study . Future Internet   2021 ; 13 : 164 . https://doi.org/10.3390/fi13070164 .

Ariefwan   MRM , Diyasa   IGSM , Hindrayani   KM . InceptionV3, ResNet50, ResNet18 and MobileNetV2 performance comparison on face recognition classification . Literasi Nusantara   2021 ; 4 : 1 – 10 .

Dang   T-V . Smart home management system with face recognition based on ArcFace model in deep convolutional neural network . J. Robot. Control   2022 ; 3 : 754 – 761 . https://doi.org/10.18196/jrc.v3i6.15978 .

Dang   T-V . Smart attendance system based on improved facial recognition . J. Robot. Control   2023 ; 4 : 46 – 53 . https://doi.org/10.18196/jrc.v4i1.16808 .

Goodfellow   I , Bengio   Y , Courville   A . Deep Learning . MIT Press , Cambridge, United States. http://www.deeplearningbook.org .

Google Preview

Ognjanovski   G . Everything you need to know about neural networks and backpropagation—machine learning easy and fun   Towards Data Science . https://towardsdatascience.com/everything-you-need-to-know-about-neural-networks-and-backpropagation-machine-learning-made-easy-e5285bc2be3a , Accessed: 2024-06-18 .

Phung   VH , Rhee   EJ . A deep learning approach for classification of cloud image patches on small datasets . J. Inf. Commun. Converg. Eng.   2023 ; 16 : 173 – 178 .

Schölkopf   B , Smola   AJ . Learning With Kernels: Support Vector Machines, Regularization, Optimization, and Beyond . MIT press, Cambridge, United States .

Nagrath   P , Jain   R , Madan   A . et al.    SSDMNV2: a real time DNN-based face mask detection system using single shot multibox detector and MobileNetV2 . Sustain. Cities Soc.   2021 ; 66 : 102692 . https://doi.org/10.1016/j.scs.2020.102692 .

Thakur   V . Celebrity face image dataset, version 1 . Retrieved June 18, 2024 from   https://www.kaggle.com/datasets/vishesh1412/celebrity-face-image-dataset/data .

Ahmed   T , Das   P , Ali   MF . et al.    A comparative study on convolutional neural network based face recognition . In: Proceedings of the 2020 11th International Conference on Computing, Communication and Networking Technologies (ICCCNT) , 1 – 5 , IEEE, Kharagpur, India.

Month: Total Views:
September 2024 70

Email alerts

Citing articles via.

  • Recommend to your Library

Affiliations

  • Online ISSN 1460-2067
  • Print ISSN 0010-4620
  • Copyright © 2024 British Computer Society
  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Institutional account management
  • Rights and permissions
  • Get help with access
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

Student Attendance Monitoring System Using Face Recognition

7 Pages Posted: 24 May 2021

E CHARAN SAI

Jain University, Faculty of Engineering & Technology, School of Engineering and Technology

SHAIK ALTHAF HUSSAIN

Amara shyam.

Date Written: May 22, 2021

There is no reason that a critical educational practise like attendance should be viewed in the old, tedious manner in this age of rapidly evolving new technologies. In the conventional method, it is difficult to manage large groups of students in a classroom. Since it takes time and has a high risk of error when entering data into a system, it is not recommended. Real-Time Face Recognition is a practical method for dealing with a large number of students' attendance on a daily basis. Many algorithms and techniques have been developed to improve face recognition performance, but our proposed model employs the Haarcascade classifier to determine the to determine the positive and negative characteristics of the face, as well as the LBPH (Local binary pattern histogram) algorithm for face recognition, all of which are implemented in Python and the OpenCV library. For user interface purposes, we use the tkinter GUI interface.

Keywords: Local Binary Pattern Histogram(LBPH), Face Detection, Face Recognition, Haarcascade Classifier, Python, Student Attendance.

Suggested Citation: Suggested Citation

E CHARAN SAI (Contact Author)

Jain university, faculty of engineering & technology, school of engineering and technology ( email ).

Bangalore, 572112 India

Do you have a job opening that you would like to promote on SSRN?

Paper statistics, related ejournals, artificial intelligence ejournal.

Subscribe to this fee journal for more curated articles on this topic

Computation Theory eJournal

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 13 September 2024

Dense pedestrian face detection in complex environments

  • Qiang Gao 1 ,
  • Bingru Ding 2 ,
  • Yinghong Xie 2 &
  • Xiaowei Han 1  

Scientific Reports volume  14 , Article number:  21460 ( 2024 ) Cite this article

Metrics details

  • Computer science
  • Mathematics and computing

To address the problem of dense crowd face detection in complex environments, this paper proposes a face detection model named Deep and Compact Face Detection (DCFD), which adopts an improved lightweight EfficientNetV2 network to replace the backbone network of RetinaFace. A large kernel attention mechanism is introduced to address the face detection task more accurately. The backbone network, an improved efficient channel attention (ECA) mechanism, is added to further improve the algorithm performance. The feature fusion module is an improved neural architecture search feature pyramid network (NAS-FPN) that significantly improves the face detection accuracy in different scenes. To balance the training process of positive and negative samples, we use the focus loss function to replace the traditional cross-entropy loss function. In different environments, the DCFD algorithm has shown efficient face detection performance. This algorithm provides not only a feasible and effective solution for solving the problem of face detection in dense groups but also an important basis for improving the accuracy of face detection models in practical applications.

Similar content being viewed by others

face recognition project research papers

Global remote feature modulation end-to-end detection

face recognition project research papers

Offset-decoupled deformable convolution for efficient crowd counting

face recognition project research papers

Research on dense object detection methods in congested environments of urban streets and roads based on DCYOLO

Introduction.

Recently, with the widespread application of artificial intelligence across various fields and the emergence of deep learning, face recognition technology has become seamlessly integrated into products and daily life. In this domain, the face detection module, as a pivotal component, has experienced notable progress and achievements. Face detection serves as a crucial element in the realm of face recognition. Only by simultaneously detecting the face and extracting relevant content can it be applied to meet practical needs. However, face detection continues to encounter significant challenges in complex environments, characterized by factors such as intricate backgrounds, low resolution, invisibility, and insufficient signals—elements that are beyond our control.

In the early stages of face recognition, algorithms relied on modular matching techniques, employing a predefined face template image to compare various locations in the detection image and determine the presence or absence of a face. For example, Rowley et al. 1 proposed a neural network-based face detection algorithm that utilized a dataset trained with 20 × 20 pixels as the gradient collector model 2 . Although this approach achieves high accuracy, it suffers from relatively slow processing speeds. In 1997, Margineantu et al. 3 proposed a face recognition algorithm within the framework of AdaBoost, a machine learning method based on probabilistic approximate correct (PAC) learning theory. In 2004, Viola and Jones 4 designed a groundbreaking face detection algorithm using simple Haar-like features and a cascading AdaBoost classifier. This method, known as the Viola‒Jones (VJ) framework, represents a major breakthrough in the field of face detection. Although this approach achieved high accuracy, its processing speed was relatively slow.

With the development of artificial intelligence, deep learning technology has been used in human detection. Face detection algorithms in deep learning can be broadly categorized into two categories: one-stage network detection algorithms and two-stage network detection algorithms. The single-stage network detection algorithm is a method that directly employs neural networks for object detection. Representative examples include those based on the YOLO 5 , 6 , 7 series (V1–V7), SSD 8 , MTCNN 9 , and RetinaFace 10 , which have achieved relatively balanced performance in terms of speed and accuracy. On the other hand, the two-stage network detection algorithm initially generates candidate regions and subsequently performs object prediction through the shaped neural network. Representative models include the R-CNN 11 , fast region-based convolutional network (Fast R-CNN) 12 , Faster R-CNN 13 , and spatial pyramid pooling network (SPP-Net) 14 . While these algorithms are characterized by high detection accuracy, they exhibit relatively slow detection because they perform object generation and object prediction in two distinct phases.

In recent years, with the further development of deep learning technology, the efficiency and detection accuracy of various algorithms have been fully balanced, and notable progress has been made through the introduction of new methods. In 2017, Yang Shuo and Luo Ping 15 proposed a face detection algorithm based on deep convolutional neural networks, named Faceness-Net. This method initially detects local features of the face, uses multiple classifiers based on deep convolutional networks to score each facial part (nose, eyes, etc.), and subsequently combines these scores to determine the most likely face region. Following this, the CNN is trained to further enhance the effectiveness of detection. In 2020, Niu Zuodong et al. 16 improved the RetinaFace algorithm by incorporating an attention mechanism, achieving outstanding performance in the task of mask detection.

In 2021, Li Yanling et al. 17 enhanced the O-Net and R-Net modules of the MTCNN, optimizing the image candidate boxes and classification confidence through the Better-NMS algorithm. This addresses the issue of missed detection in candidate boxes when the intersection over union (IoU) value exceeds the preset threshold. In 2022, Yuan Chao et al. 18 addressed the accuracy of face detection algorithms for indoor security. They replaced the backbone network of YOLO-v4 with a deep separable residual network and introduced an attention mechanism to adaptively adjust channel features and spatial feature weights. Additionally, in the same year, Bochkovskiy et al. proposed the YOLOv7 model to further increase the accuracy of face detection. Alibaba DAMO Academy has also open-sourced its lightweight face detection model, DamoFD, and the new face recognition framework, TransFace. The DamoFD model, with its lightweight and high-performance characteristics, provides a solution to face detection on mobile devices. Moreover, the TransFace framework 19 has further improved the accuracy and robustness of face recognition by introducing deep learning techniques. Although both the accuracy and speed of face detection have improved significantly in recent years, there is still room for improvement. The Better-NMS algorithm notably enhances the accuracy of face detection through optimizing the candidate box selection. However, this advancement may increase the computational complexity, thereby affecting the overall speed of face detection. Moreover, face detection in real-world scenarios may encounter challenges such as occlusion or extreme poses, which can lead to false positives or missed detections by the model. Advanced models such as YOLOv7 may adopt more complex network structures and utilize more parameters. While these strategies can enhance model performance, they also incur additional costs. Therefore, while pursuing high-precision face detection, we must also balance the computational complexity and storage requirements of the model.

To address the above problems, a lightweight and compact method named DCFD is proposed. The DCFD algorithm balances the computational complexity and storage requirements while ensuring fast detection speed and maintaining a high level of detection accuracy. The improved RetinaFace model is used for pedestrian face detection. The EfficientNetV2 network serves as the backbone network, and a large kernel attention mechanism is introduced to better fuse context information and improve the utilization of fused feature information. By reconstructing the information fusion module and using a focus loss function, the detection of small target faces in complex environments has high accuracy. The advantages of the proposed algorithm are as follows:

This paper uses the lightweight EfficientNetV2 network and introduces large kernel attention (LKA) to better fuse local context information and consider long-range dependencies so that the model focuses more on facial features.

This paper optimizes the loss function, using the focus loss function instead of the cross-entropy loss function, to balance the training of positive and negative samples and reduce the loss of easy-to-train positive samples.

An improved efficient channel attention mechanism (ECA) is introduced between the backbone network feature extraction network and the feature fusion module to improve the utilization of feature information in the feature fusion module.

This paper reconstructs the feature fusion module via the improved NAS-FPN image pyramid instead of the original feature pyramid network (FPN). This improvement improves the detection accuracy for small faces and enhances the accuracy and robustness of detection in dense crowd environments.

Model network architecture

The RetinaFace network conducts face detection on pixels of varying sizes in different orientations through self-supervised and jointly supervised multitask learning. The network model comprises four components: the backbone extraction network, FPN, single-stage headless (SSH) feature extraction network, and detection layer (Head). The structural diagram of the network is depicted in Fig.  1 .

figure 1

RetinaFace network structure diagram.

First, the training dataset is fed into the MobileNet0.25 backbone network, and the outputs of the output layer are labeled c2–c5. After the feature maps are extracted, they are fused, and the upsampling (bilinear interpolation) method is used to ensure that the two layers formed are the same size 20 . The two-layer feature map matrices are subsequently added to obtain the feature pyramid structure. Second, the feature maps of each selected layer are used as input to the context module. Last, after the classification and regression branch are processed, the algorithm obtains the final prediction result. Figure  2 shows the main flow of the RetinaFace algorithm.

figure 2

Main process of the RetinaFace algorithm.

In this paper, we propose the DCFD model on the basis of the RetinaFace model. Recognizing the limited generalizability and low robustness of RetinaFace, this model incorporates design improvements in six key aspects:

Input: the image to be detected is input.

Backbone Network: The improved EfficientNetV2 21 is utilized as the backbone network, and the large kernel attention is incorporated.

Feature Pyramid: NAS-FPN 22 is applied to fuse features from different levels, and feature extraction capabilities are enhanced.

Prediction: Three types of predictions—classification, face regression boxes, and facial key point locations—are conducted.

Decoding: The predictions from the previous step are adjusted through the decoding process, and the position of the facial frame and key points are refined.

NMS: EIOU is implemented as the NMS method to effectively remove highly overlapping prediction boxes.

The DCFD algorithm enhances the backbone network, improving the expression and learning capabilities of the network while reducing the storage space and memory occupation of the model. This optimization is designed to enhance the generalization ability and information-gathering capability of the model for effective context modeling. During the feature pyramid reconstruction process, the network optimizes the search process for vertical field issues, thus enhancing the overall performance of the network. Through the prediction and decoding of images, our algorithm exhibits heightened sensitivity to the accuracy of object bounding boxes, enabling it to better handle inputs of varying sizes and generate more precise evaluation models.

The effective collaboration of the main components in this model achieves accurate face recognition in dense crowds. The network architecture of the DCFD is illustrated in Fig.  3 .

figure 3

DCFD network architecture.

The SSH module comprises three main components: a single 3 × 3 convolution on the left, a 5 × 5 convolution replaced by two 3 × 3 convolutions in the middle, and a 7 × 7 convolution replaced by three 3 × 3 convolutions on the right. The output of the Head layer includes feature maps of sizes 80 × 80, 40 × 40, and 20 × 20. The first feature map is utilized for SoftMax-based binary classification, the second feature map is employed for face box regression, the third feature map is dedicated to facial regression keypoints, and the prior box is adjusted to capture facial keypoint information.

By stacking multiple small convolution kernels, the same receptive field can be equivalently achieved as a large convolution kernel, which can reduce not only the number of network parameters but also the computational complexity to effectively accelerate the training and inference process of the network. In addition, small kernels are more convenient to implement and optimize in hardware. Adopting multiple small convolution kernels means that more nonlinear activation functions are introduced into the network, which helps increase the expressive power of the network. The SSH method further improves the detection performance of small faces by skillfully integrating context information into the feature map. This improved strategy not only makes the network structure more efficient, but also improves the accuracy and reliability of face detection.

Design of the improved EfficientNetV2 module

To address the issue that the original backbone network MobileNet0.25 in RetinaFace cannot simultaneously account for both detection accuracy and speed, this paper employs the enhanced EfficientNetV2 network as the backbone network for RetinaFace. Additionally, the LKA is designed to further increase the detection accuracy and speed of RetinaFace. This addresses the problems of overfitting and false detection that were associated with the original network.

The primary module of EfficientNetV2 is MBConv, as depicted in Fig.  4 . In this paper, the convolutional features, after pooling by the MBConv module, are concatenated and then input to the fully connected layer as a whole. The enhanced EfficientNetV2 network can effectively capture the underlying texture information and high-level semantic details of an image, thereby improving its detection capability.

figure 4

Improved MBConv structure.

Existing networks encounter challenges such as low face extraction accuracy, poor generalization ability, and high time complexity in face detection tasks. To address these challenges, this paper introduces the LKA module into the EfficientNetV2 framework to handle face detection tasks effectively in complex environments. This module effectively combines the strengths of convolutions and transformers, overcoming the limitations of convolutions in handling long-range dependencies and addressing the challenges that transformers face in adapting to local information and channel dimensions.

The LKA mechanism, while maintaining the local feature extraction strength of the CNN, effectively captures information from a larger context within images or video data through its unique large kernel design. This design not only enables LKA to extract local features as efficiently as CNNs do but also overcomes the potential limitations of CNNs in handling long-range dependencies.

Furthermore, the combination of LKA with the flexibility of transformers provides a more efficient solution to sequence modeling. Through their self-attention mechanism, transformers are able to directly identify and focus on the most significant parts of the sequence for the current prediction. On this basis, LKA assigns different attention weights to the information within the large kernel range, achieving more fine-grained feature selection and attention allocation.

When enhancing the original EfficientNetV2, this paper uses the LKA convolution block to replace the traditional convolution block, as depicted in Fig.  5 . The objective of this innovative design is to endow the long range of the network with the ability to support and implement a global receptive field, thereby facilitating the comprehensive extraction of coarse-grained global features from images. The introduction of the LKA module further reduces the number of output channels and repeated layers in EfficientNetV2, allowing for the extraction of multidimensional and multiscale fine-grained features from global image features.

figure 5

LKA convolution block.

Tables 1 and 2 provide details on the original and improved network structures of EfficientNetV2, respectively. In Table 1 , Conv 3 × 3 signifies the use of a standard 3 × 3 convolution operation paired with the SiLU activation function and batch normalization regularization. SE represents the self-attention module, and 0.25 is the coefficient of the first fully connected layer in the SE module, which is equivalent to a quarter of the number of channels in the feature matrix of this module. K represents the convolution kernel size, and the coefficient after MBConv is the dilation factor.

Considering the limited and slow expansion of the receptive field in the traditional convolution module, which causes inefficiency in utilizing distant pixels of the image, we introduce LKA at the start of the EfficientNetV2 network. Leveraging the long-range dependence of LKA, the network can acquire global receiver field characteristics, facilitating the extraction of coarse-grained global features from the image and effectively enhancing the accuracy of the network.

To enhance the EfficientNetV2 network, the output channels and the number of repeat layers of each module following the LKA module are condensed. Throughout this process, the network search algorithm is fully utilized to explore the depth of the network. Once the depth is determined, the number of repetitions between the network layers is systematically investigated. The parameter configuration of the network is subsequently further determined by exploring the width of the network.

Improved ECA module

To address the challenges of accurate face detection and slow detection speed in complex environments, this paper explores the introduction of an attention mechanism after the backbone network. The ECA mechanism 21 is a technique used to increase the performance of convolutional neural networks; it is known for its plug-and-play advantages and is widely used in deep learning for face detection. However, owing to its limitations in terms of global performance and tendency to overlook spatial information, this paper restructures the ECA mechanism and designs a fusion spatial attention module to enhance feature extraction, particularly for small faces in dense crowds. This approach improves the accuracy and robustness of the model.

The ECA mechanism does not require dimensional reduction and uses one-dimensional convolution to achieve local cross-recurrence, effectively enabling cross-channel interaction and extracting dependencies between channels. The ECA mechanism structure is illustrated in Fig.  6 , where the front and back 'C' represent the input and output feature maps, respectively, 'GAP' represents global average pooling, 'α' represents the activation function, and 'k' represents the scope of local cross-channel coverage.

figure 6

Structure of the ECA mechanism.

The modification to the backbone network of RetinaFace affects the detection accuracy, especially for dense small object detection at low resolution. While retaining the advantages of the one-dimensional convolution of the ECA mechanism, a spatial attention module (SA module) is added to better capture features in different spatial locations of face images. This new addition enhances global information attention and network performance. The improved ECA mechanism is shown in Fig.  7 . Through 1D convolution, the feature extraction ability is enhanced without dimensionality reduction, thus improving model performance with minimal parameters and computation. This enhancement achieves adaptability for face detection in complex backgrounds.

figure 7

Structure of the improved ECA mechanism.

Refactoring the NAS-FPN module

In RetinaFace, a FPN is used to address object changes at different scales. However, the traditional FPN structure extracts feature maps of different scales from different network levels and performs a 1 × 1 convolution on each extracted feature map to reduce the number of convolution kernels. With increasing depth, the resolution of the FPN decreases, which affects the keypoint detection of small target faces. To solve this problem, this paper designs an improved NAS-FPN, which better utilizes the semantic information contained in the fusion feature map and then improves the network accuracy.

NAS 23 repeats the training of the FPN 24 in the given search space and performs cross-range feature fusion through top-down and bottom-up connections of the feature maps obtained from the original FPN structure. The best accuracy and speed can be measured according to the number of repetitions.

NAS-FPN uses an automatic architecture search network to select a new restructuring scheme for feature maps across five scales, providing greater flexibility in achieving an improved structure. The pyramid network architecture of NAS-FPN is illustrated in Fig.  8 , where pink P3–P7 represent the input feature layer, yellow P3–P7 represents the output feature layer, and GP represents the global pooling layer. The semantic information contained in the high-level feature map is computed as a global feature map. 'R–C–B' represents ReLU–Conv–BatchNorm, indicating the sequence of operations involving the ReLU activation function, convolution, batch operation, and batch normalization.

figure 8

NAS-FPN structure.

The FPN enhances the accuracy of object detection by fusing features across different scales. However, in the case of small face detection, the traditional implementation of the FPN may not effectively extract and fuse relevant information. NAS-FPN uses NAS techniques to explore various cross-scale connections and feature fusion strategies, ultimately discovering the optimal feature fusion pattern tailored specifically for small face detection, significantly improving detection accuracy.

The design of NAS-FPN adopts a modular approach, in which each component of the FPN is considered a searchable module. This approach increases not only the flexibility of the search process but also the search efficiency. Moreover, NAS-FPN supports early exit and anytime-prediction functionalities. When the model has sufficient confidence in detecting a small face in a specific region, it can terminate further processing for that region, thereby effectively conserving computational resources and enhancing detection speed while maintaining detection accuracy.

The traditional FPN structure diagram is shown in Fig.  2 . Since the structure of the FPN cannot fully utilize the feature map, the improved NAS-FPN structure is used to improve the network accuracy. The EfficientNetV2 network is used as the backbone network in the DCFD algorithm. To improve the network accuracy without increasing the number of network parameters and calculations, the feature map is recombined and fused with the feature map via the NA-FPN structure. To perceive the feature map more effectively, the feature network is modified, and the improved network structure diagram is shown in Fig.  9 . This modification optimizes the feature extraction and fusion mechanism of the network, thus improving the performance of the DCFD algorithm in the object detection task.

figure 9

Improved network.

Focal loss function

In the dense pedestrian face detection task, owing to the unbalanced distribution of face and nonface samples in the dataset, the traditional cross-entropy loss function performs poorly in the face of this problem. To address this challenge, we introduce the focal loss function, which is unique in its ability to adaptively focus on difficult samples. Higher weights are assigned to those that are easily misclassified, and lower weights are assigned for those that are relatively easy to classify. This mechanism helps the model focus more on key regions, thereby improving the performance of face detection. The cross-entropy loss function used is shown in Eq. ( 1 ).

where y i is the label value of the i-th sample, which can be 0 or 1, and p i is the probability that the model predicts that the i-th sample is positive with a value of [0, 1]. This loss function is used to predict the difference between the prediction of the current model and the actual label.

If the sample is positive (y i  = 1), the following form of Eq. ( 1 ) can be obtained:

In Eq. ( 2 ), when p i is closer to 1, it is better to predict the positive sample.

If the sample is negative (y i  = 0), the cross-entropy loss of p i is given by Eq. ( 3 ). When p i is closer to 0, it is better to predict the negative sample:

To solve the problems of sample imbalance and difficulty in classification, the focal loss function 23 is introduced, which adjusts the loss by introducing a focal parameter, and its prediction accuracy Pt is defined as Eq. ( 4 ):

To solve the sample imbalance problem, this paper introduces the weighting factor α, which ranges from [0, 1]. When y i  = 1, training sample i is regarded as a positive sample, and its weight is set to α. However, when y i  = 0, training sample i is regarded as a negative sample, and the weight is set to (1 − α). On the basis of this definition of the weighting factor α, a weighted cross-entropy formula is defined to adjust the cross-entropy loss. According to Eq. ( 2 ), the focal loss function with weights can be obtained as follows:

The weight factor α is introduced to adjust the weight ratio of positive samples to negative samples in the cross-entropy to solve the problem of unbalanced sample data α t . To address the problem of distinguishing "hard" samples, a conditioning factor γ in the range [0, + ∞] is introduced into the cross-entropy loss function. According to Eq. ( 3 ), the focal loss function with weights can be obtained as follows:

The regulator γ can be used as follows:

When the samples are misclassified and p t is small (representing samples that are difficult to classify), adjusting the parameter (1 − p t ) γ is close to 1 and has a minimal effect on the result of the loss function. This means that the model will maintain a large loss in this case, which makes people pay more attention to those samples that are difficult to classify.

The tuning parameter (1 − p t ) γ is 0 when sample p t is 1 (samples that are completely correctly classified), which means that the loss of samples that are completely correctly classified is 0, and the model will not pay attention to such samples.

When sample p t is 0.9 (correct samples that are easy to classify) and γ = 2, the parameter (1 − p t ) γ is adjusted to 0.01, and the sample loss function is reduced to 1/100 of the original value so that the model pays less attention to such samples that are easy to classify.

When γ = 0, the focal loss function becomes the cross-entropy loss function, which does not consider the case in which the sample is difficult to classify and all samples are subjected to the same weight.

In summary, according to Eqs. ( 4 )–( 6 ), the focus loss function can be obtained as shown in Eq. ( 7 ). According to the classification of the samples and the adjustment factor γ, the loss function is dynamically adjusted to better process the samples.

The above results indicate that the ability of the model to detect small faces has improved. However, in the postprocessing stage, many faces due to too many prior boxes causes oversensitive detection. In some "hard images", thousands of faces may be present, significantly slowing the NMS algorithm. Therefore, to address the above problems, a strategy is adopted in this paper. The cross-entropy loss function is used in the first 145 cycles. This strategy has the advantage of reducing the number of prior boxes, thereby alleviating the burden on the NMS stage.

Experiments and analysis of the results

This paper conducted experiments with the widely used WiderFace dataset and LFW dataset. The WiderFace dataset comprises 32,203 images and includes 393,703 accurately labeled face instances. The dataset exhibits cognitive diversity, featuring images from 61 different scenes where faces demonstrate variations in blur degree, expression, illumination, occlusion, and pose, as shown in (a)–(f) in Fig.  10 .

figure 10

WiderFace dataset.

In this work, we systematically labeled, classified, and filtered the WiderFace dataset to create a new subset of WiderFace data, which comprises approximately 29,000 images. For experimentation, the collated dataset was then divided into two subsets. All scenes were randomly sampled, with 70% of the images allocated to the training set for model training and parameter tuning and the remaining 30% of the images designated as the test set for evaluating the performance and generalization ability of the model.

The LFW dataset comprises 13,233 face images, representing 5749 unique identities and reflecting the diversity of faces in terms of age, illumination, and pose. The substantial number of images and individuals in this dataset ensures ample sample diversity. In the example plots (a) to (f) in Fig.  11 , typical images from the LFW dataset are displayed.

figure 11

LFW dataset.

Experimental environment

The algorithm implemented in this paper is based on the Ubuntu 18.04 64-bit operating system. PyTorch is used as the deep learning framework, with CUDA version 11.0.2, and the programming language is Python 3.8. The hardware and software environments for model training are detailed in Table 3 , and the stability and reliability of these environments provide a solid foundation for the implementation and performance evaluation of the algorithm in this paper.

Parameter settings and evaluation metrics

In the experiments of this paper, precision and recall serve as the primary evaluation metrics for a comprehensive assessment of model performance. The model was tested by using a dataset comprising 1000 images.

The prediction accuracy refers to the proportion of all samples predicted by the model to be actual faces. The initial calculation can be expressed as Eq. ( 8 ), where TP represents the number of positive samples correctly predicted as faces, whereas FP represents the number of negative samples incorrectly predicted as faces.

Recall refers to the proportion of samples that the model correctly predicts as faces from all the samples that are actual faces, and it can be expressed by Eq. ( 9 ). Here, FN represents the number of positive samples incorrectly predicted as nonfaces.

In the training stage, anchor boxes with EIOU values greater than 0.5 are defined as positive samples, whereas those with EIOU values less than 0.3 are defined as negative samples, and a positive-to-negative sample ratio of 1:3 is maintained after sample screening. The model training data for the DCFD are outlined in Table 4 .

The feature pyramid, prior box and SHH setting parameters in the model are shown in Table 5 :

Ablation experiment

To verify the optimization effect of the DCFD algorithm on the RetinaFace face detection model, this paper designs two sets of ablation experiments on the WiderFace and LFW datasets. These experiments cover the comparisons of four key improvement points: the replacement of the backbone network, the introduction of a focal loss function, an improved ECA mechanism, and the NAS-FPN feature fusion module. By gradually integrating these improved measures and comparing and analyzing the detection results, the superiority of the improved algorithm is confirmed. In Table 6 , "√" indicates the application of this improved method in the RetinaFace face detection network, whereas easy, medium, and hard represent the detection accuracies in three different difficulty modes of the dataset.

The data in Table 6 clearly show that on the WiderFace dataset, after replacing the backbone network, the accuracy of the model in the three difficulty modes of easy, medium and hard is improved by 1.61%, 1.45% and 1.24%, respectively. The detection accuracy of the model is subsequently further improved by approximately 1% by introducing the ECA mechanism and the focal loss function. Notably, the addition of the NAS-FPN feature fusion module improves not only the detection accuracy but also the overall network performance. This comprehensive improvement strategy not only ensures the accuracy of face detection but also balances the detection speed, thus comprehensively improving the performance of the RetinaFace face detection model.

The data in Table 7 show that the DCFD algorithm demonstrates comparable effectiveness on the LFW dataset, which is similar to its performance on the WiderFace dataset. Each experimental component has a positive impact on the model and underscores the versatility of the DCFD algorithm across different datasets. The results of this series of ablation experiments further affirm the robustness and superiority of the DCFD algorithm in addressing various datasets and scenarios.

In the ablation experiment, to better evaluate the effectiveness of the proposed DCFD algorithm, the contribution of each part is gradually verified, as shown in Fig.  12 , to show the effect of each experimental part.

figure 12

Plot of the results of the ablation experiment.

Comparative experiments

In the reconstructed WiderFace dataset, the model loss starts to converge after 150 training rounds. To further evaluate the performance of the proposed model, 10 different network models were trained in the same experimental environment. A detailed analysis and comparison were conducted, and the experimental results are presented in Table 8 . The results clearly indicate that the DCFD model in this paper has significant advantages in terms of accuracy and detection speed. Compared with other network models, the algorithm in this paper performs well in dense crowd face detection tasks, further validating the effectiveness of the improved algorithm.

As shown in Table 8 , in the three validation subsets of the reconstructed WiderFace dataset, the average accuracy of the DCFD model reaches 96.64%, 96.3%, and 86.73%; these values are 3.19%, 3.44%, and 3.50% higher, respectively, than those of RetinaFace. In contrast, traditional detection methods that do not use convolutional neural networks, such as V‒J and DPM, perform poorly in terms of accuracy. The average accuracy of Faceness-Net is also lower because of its lack of multiscale feature extraction methods, making it challenging to adapt to different face sizes. Moreover, the accuracies of the MTCNN and SCRFD-34GF deep network models, which lack an attention mechanism, are relatively low.

To ensure the complete consistency of the experimental environment, this paper conducts 150 rounds of training on the LFW dataset. As shown in Table 9 , in the LFW dataset, the average accuracy of the DCFD model reached 97.04%, 96.43% and 87.01%; the size of the model was 15.3 MB, achieving a balance between detection accuracy and detection speed; and the overall model was more lightweight. These values are 3.59%, 3.57% and 3.78% higher than those of RetinaFace. Newer networks such as TinaFace and YOLOv7-tiny have significant advantages over YOLOv7-face at both the medium and hard difficulty levels. Owing to the change in the backbone network and the addition of an attention mechanism, the DCFD algorithm has better performance. Through comparative experiments, we further confirm the robustness and excellent performance of the DCFD algorithm on different datasets. According to the data analysis in Table 9 , the experimental results align with the conclusions obtained for the WiderFace dataset, confirming the strong performance of the DCFD model on the LFW dataset.

The face detection curves for different difficulty levels (easy, medium, and hard) are presented in Fig.  13 . Compared with other network models, the DCFD algorithm has higher prediction accuracy and an excellent recognition rate. The proposed algorithm achieves significant advantages in face detection tasks with varying difficulty levels.

figure 13

Model precision‒recall curves.

To design a face detection algorithm that is suitable for various environments, this paper strives to maintain high detection accuracy while reducing the number of parameters and computational complexity. To verify the effectiveness of MobileNetV2 as the backbone network in improving accuracy, RetinaFace-ResNet50 and RetinaFace-MobileNetV1 are selected as the comparison algorithms. All algorithms have been strictly tested on the WiderFace dataset to ensure the fairness and reliability of the results. The experimental results show that the algorithm proposed in this article has significant advantages over the other algorithms. The detailed test and comparison results are shown in Table 10 .

According to Table 10 , the algorithm proposed in this work significantly outperforms MobileNetV1 in terms of the easy, medium, and hard difficulty levels, with increases of 5.3%, 7.35%, and 13.42%, respectively. Moreover, compared with ResNet50, the proposed algorithm also demonstrates good competitiveness. The optimization of the backbone network in this work not only improves the overall accuracy of the network but also shows a distinct advantage over other networks.

As shown in Table 11 , after replacing the cross-entropy loss function originally used by the RetinaFace network model with the focal loss function, the model achieves the best performance at the hard difficulty level. This improvement not only enables the model to better address facial conditions in different scenarios but also significantly improves the accuracy of face detection. The optimization of the loss function can significantly improve the performance of the network model in addressing complex face detection tasks.

The experimental results presented in Table 12 indicate that the accuracy generally improves after the ECA mechanism is introduced. Furthermore, when the ECA mechanism is improved and integrated into the SA module, the accuracy rate achieves a significant increase of more than 1.0% in the difficulty levels of easy, medium and hard. Compared with the original ECA mechanism, this improvement not only yields greater performance improvement but also fully verifies the positive promotion effect of the attention mechanism improvement scheme proposed in this paper on the overall performance of the network model.

The detection effect of the original RetinaFace algorithm is illustrated in a(1)–a(3) of Fig.  14 , whereas the detection effect of the DCFD algorithm is shown in b(1)–b(3) of Fig.  14 . These visualizations vividly demonstrate the improvement achieved by the DCFD model over the original RetinaFace. In a(1) and b(1), the recognition ability of RetinaFace is poor on face images with a large occlusion area in the last row on the right, whereas DCFD performs better on such images. In a(2) and b(2), RetinaFace has difficulty recognizing faces at a long distance, whereas DCFD can recognize almost all faces at this distance. In a(3) and b(3), RetinaFace performs poorly in face recognition at long distances, whereas DCFD can successfully recognize 80% of faces at a long distance.

figure 14

Comparison of the detection results before and after improvement for the RetinaFace Network.

Experimental analysis

Through a series of experiments on the LFW and WiderFace datasets, the backbone network, attention mechanism, loss function and feature fusion module are carefully optimized. To verify the effectiveness of these improvements, not only are the performances of the classical network and the newer network compared but also detailed module ablation experiments are carried out. The experimental results show that the improvement of each module has a positive effect on the accuracy and running speed of the network model.

For face detection tasks, the EfficientNetV2 model is improved, and LKA is introduced to make the model more focused when facial feature expression is captured. Through this improvement, the model can more effectively capture the underlying texture information and high-level semantic information of the image to achieve significant compression of the output channels and the number of repeated layers while ensuring the accuracy of the network. The experimental results on the LFW and WiderFace datasets show that the improved EfficientNetV2 model with LKA achieves significant performance improvement on face detection tasks. By effectively controlling the complexity of the model, the running speed of the model is improved, and it is more competitive in practical applications.

To further improve the performance of the model, an improved ECA mechanism is introduced after the backbone network, which effectively avoids the dimensionality reduction process so that the model can extract the relationships between channels more accurately, thus significantly improving the performance of the model while maintaining the minimum number of parameters and calculations. Moreover, the utilization of backbone network features is enhanced, and the ability of the model to address small objects is subsequently improved. This not only optimizes the ability of the model to capture subtle facial features in the image but also strengthens the ability of the model to recognize faces in complex backgrounds.

In view of the limitations of the traditional cross-entropy loss function in face detection tasks, this paper improves upon the traditional cross-entropy loss function. The traditional cross-entropy loss function often performs poorly when it is difficult to obtain a single sample, especially for tasks such as face detection, and its limitations are particularly prominent. The focal loss function can more effectively detect facial regions that were difficult to detect before, especially for those small target faces that are easy to miss and show better detection performance. In addition, the loss function also reduces the loss of correct samples that are easy to train by balancing the training samples of faces and nonfaces, which further improves the generalization ability and robustness of the model. After the focal loss function is used on different datasets, the performance of the model on the face detection task is significantly improved.

In this paper, NAS-FPN is selected as the feature fusion module. When the depth of the network increases, the resolution of the traditional FPN decreases, which slightly affects the keypoint detection performance of small target faces. To solve this problem, the NAS-FPN module is introduced. NAS-FPN can dynamically adjust the combination of feature layers to adapt to the needs of object detection at different scales. This cross-scale feature combination strategy enables the model to fuse the feature information of different levels more effectively, thereby improving the accuracy of face detection.

In the experiments, the NAS-FPN module shows good adaptability for diverse scenarios. For both large target faces and small target faces, the module is able to optimize the detection effect by dynamically adjusting the feature layer. In addition, NAS-FPN allows the network to adapt better to various environmental changes and improves the robustness and generalizability of the model.

Conclusions

To address the challenges of RetinaFace in face detection in complex environments, especially in dense crowds, we made a series of key improvements that led to significant performance gains.

The enhanced lightweight EfficientNetV2 network serves as the backbone for RetinaFace, which uses a large kernel attention mechanism to extract facial features more effectively. To better concentrate on spatial information expression and capture facial spatial relationships in dense crowds, a spatial attention module is introduced into the ECA mechanism. To achieve finer model adjustments, the model now directly addresses pixel-level predictions, enhancing accuracy in both the prediction of the model structure and its surroundings. The reconstruction of the spatial fusion module enables adaptive gain for image targets, reduces model parameters, improves computational efficiency, and offers a more robust solution to practical application scenarios in subsequent face detection problems.

In summary, through experiments on different datasets, the DCFD algorithm has significantly improved detection performance and accuracy, particularly in dense crowd face detection tasks. The DCFD algorithm presents an effective solution to face detection challenges in practical application scenarios.

Data availability

The WiderFace dataset in this paper can be accessed from http://shuoyang1213.me/WIDERFACE/ , and the LFW dataset can be accessed at https://vis-www.cs.umass.edu/lfw/ .

Liqiao, H. U. & Runhe, Q. I. U. A face recognition algorithm for adaptive weighted HOG features. Comput. Eng. Appl. 53 (03), 164–168 (2017).

Google Scholar  

Shen, X., Lin, Z., Brandt, J. et al . Detecting and aligning faces by image retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 3460–3467 (2013).

Xie, M. et al. The research of traffic cones detection based on Haar-like features and adaboost classification. IET Conf. Proc. 2022 (18), 437–442 (2022).

Viola, P. & Jones, M. J. Robust real-time face detection. Int. J. Comput. Vis. 57 (2), 137–154 (2004).

Article   Google Scholar  

Li, C., Wang, R., Li, J. et al . Face detection based on YOLOv3. In Recent Trends in Intelligent Computing, Communication and Devices 277–284 (Springer, 2020).

Yu, J. & Zhang, W. Face mask wearing detection algorithm based on improved YOLO-v4. Sensors 21 (9), 3263 (2021).

Article   ADS   PubMed   PubMed Central   Google Scholar  

Thuan, D. Evolution of Yolo Algorithm and Yolov5: The State-of-the-Art Object Detention Algorithm 1301–1361 (Oulu University of Applied Sciences, 2021).

Liu, W., Anguelov, D., Erhan, D. et al . Ssd: Single shot multibox detector. In European Conference on Computer Vision 21–37 (Springer, 2016).

Zhang, N., Luo, J., Gao, W. Research on face detection technology based on MTCNN. In 2020 International Conference on Computer Network, Electronic and Automation (ICCNEA) 154–158 (IEEE, 2020).

Deng, J., Guo, J., Ververas, E. et al . Retinaface: Single-shot multi-level face localisation in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 5203–5212 (2020).

He, K., Gkioxari, G., Dollár, P. et al . Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision 2961–2969 (2017).

Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision 1440–1448 (2015).

Ren, S., He, K., Girshick, R. et al . Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 28 (2015).

GIRSHICKR. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision 1440–1448 (IEEE, 2015).

Yang, S. et al. Faceness-net: Face detec––tion through deep facial part responses. IEEE Trans. Pattern Anal. Mach. Intell. 40 (8), 1845–1859 (2017).

Article   PubMed   Google Scholar  

Niu, Z., Qin, T., Li, H. & Chen, J. Improved mask wearing detection algorithm for natural scenes of RetinaFace. Comput. Eng. Appl. 56 (12), 1–7 (2020).

Li, Y., Wang, S., Yang, Z. An improved face detection algorithm for multi-task cascading convolutional neural network. J. Xinyang Norm. Univ. (Nat. Sci. Ed.) 1–5.

Chao, Y., Liu, W., Tang, H., Ma, C. & Wang, Y. Indoor face rapid detection method based on improved YOLO-v4. Comput. Eng. Appl. 58 (14), 105–113 (2022).

Zhejiang University & Alibaba DAMO Academy Research Team. TransFace: A new framework for face recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (IEEE, 2023).

Kirillov, A., Girshick, R., He, K. et al . Panoptic feature pyramid networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 6399–6408 (2019).

Wang, Q., Wu, B., Zhu, P. et al. Supplementary material for ‘ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition 13–19 (IEEE, 2020).

Ghiasi, G., Lin, T. Y., Le, Q. V. NAS-FPN: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 7036–7045 (2019).

Lin, T. Y., Goyal, P., Girshick, R. et al . Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision 2980–2988 (2017).

Lin, T. Y., Dollár, P., Girshick, R. et al . Feature pyramid networks for object detection. In Proceedings of IEEE/CVF International Conference on Computer Vision and Pattern Recognition 2117−2125 (IEEE, 2017).

Download references

Acknowledgements

This research was supported by the Foundation of Liaoning Educational Committee under grant No. LJKMZ20221827. This work was supported by the Liaoning Provincial Science and Technology Plan Project 2023JH2/101300205, the Applied Basic Research Project of Liaoning Province 2022JH2/101300279 and the Shenyang Science and Technology Plan Project 23-407-3-33.

Author information

Authors and affiliations.

Institute of Innovation Science and Technology, Shenyang University, Shenyang, 110044, China

Qiang Gao & Xiaowei Han

School of Information Engineering, Shenyang University, Shenyang, 110044, China

Bingru Ding & Yinghong Xie

School of Electronics and Information Engineering, Liaoning University of Technology, Jinzhou, 121001, China

You can also search for this author in PubMed   Google Scholar

Contributions

B.R.D. conceived the experiment and wrote the manuscript with the help of Q.G., X.J., Y.H.X., and X.W.H. B.R.D. conducted the experiment. All the authors helped oversee the project, explore improvements, discuss the results, and write the article.

Corresponding author

Correspondence to Bingru Ding .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ .

Reprints and permissions

About this article

Cite this article.

Gao, Q., Ding, B., Jia, X. et al. Dense pedestrian face detection in complex environments. Sci Rep 14 , 21460 (2024). https://doi.org/10.1038/s41598-024-72523-8

Download citation

Received : 29 December 2023

Accepted : 09 September 2024

Published : 13 September 2024

DOI : https://doi.org/10.1038/s41598-024-72523-8

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Dense pedestrian
  • Face detection
  • EfficientNet

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

face recognition project research papers

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Revolutionizing Attendance Tracking: A Smart System Utilizing Face Recognition Technology

  • Conference paper
  • First Online: 10 September 2024
  • Cite this conference paper

face recognition project research papers

  • Shashank Dwivedi 13 ,
  • Amit Kumar Tiwari 13 ,
  • Uddeshy Jaiswa 13 ,
  • Shivangi Tripathi 13 ,
  • Utkarsh Saxena 13 &
  • Utkarsh Trivedi 13  

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 1005))

Included in the following conference series:

  • International Conference on Innovations in Data Analytics

13 Accesses

The Revolutionizing Attendance Tracking: A Smart System Utilizing Face Recognition Technology is a highly advanced and efficient solution to the traditional attendance-taking process. This system utilizes facial recognition technology to automatically identify and verify students or employees as they enter a classroom or workplace. The system eliminates the need for manual attendance taking and provides real-time updates to attendance records. The system works by capturing a photo of the individual’s face and comparing it to a database of previously recorded images. The technology uses various algorithms to identify the unique features of the individual’s face and match them to a specific individual. The system is highly accurate and can quickly process a large number of individuals. Overall, the Revolutionizing Attendance Tracking: A Smart System Utilizing Face Recognition Technology is a highly advanced and efficient solution to the traditional attendance marking process. The system provides numerous beneficiary, including saving time, improved accuracy, and security enhancement.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Abadi M, Erlingsson U, Goodfellow I, McMahan HB, Mironov I, Papernot N, Talwar K, Zhang L (2017) On the protection of private information in machine learning systems: two recent approches, pp 1–6

Google Scholar  

Baig S, Geetadhari K, Noor MA, Sonkar A (2022) Face recognition based attendance management system by using machine learning. Int J Multi Res Growth Evaluation 3(3):1–4

Bhalla V, Singla T, Gahlot A, Gupta V (2013) Bluetooth based attendance management system. Int J Innov Eng Technol (IJIET) 3(1):227–233

Bruner C, Discher A, Chang H (2011) Chronic elementary absenteeism: a problem hidden in plain sight: a research brief from attendance works and child & family policy center. Attendance Works

Bussa S, Mani A, Bharuka S, Kaushik S (2020) Smart attendance system using OPENCV based on facial recognition. Int J Eng Res Technol 9(3):54–59

Epstein JL, Sheldon SB (2002) Present and accounted for: improving student attendance through family and community involvement. J Edu Res 95(5):308–318

Article   Google Scholar  

Jain SK, Joshi U, Sharma BK (2011) Attendance management system. Masters Project Report, Rajasthan Technical University, Kota

Joardar S, Chatterjee A, Rakshit A (2014) A real-time palm dorsa subcutaneous vein pattern recognition system using collaborative representation-based classification. IEEE Trans Instrum Measur 64(4):959–966

Mahat S, Mundhe S (2015) Proposed framework: college attendance management system with mobile phone detector. Int J Res IT Manag 5(11):72–82

McCluskey CP, Bynum TS, Patchin JW (2004) Reducing chronic absenteeism: an assessment of an early truancy initiative. Crime Delinquency 50(2):214–234

Nasrollahi K, Moeslund TB (2009) Complete face logs for video sequences using face quality measures. IET Sig Process 3(4):289–300

Pal KK, Sudeep K (2016) Preprocessing for image classification by convolutional neural networks, pp 1778–1781

Ready DD (2010) Socioeconomic disadvantage, school attendance, and early cognitive development: the differential effects of school exposure. Soc Edu 83(4):271–286

Shetty AB, Rebeiro J et al (2021) Facial recognition using Haar cascade and LBP classifiers. Glob Transitions Proc 2(2):330–335

Viola P, Jones MJ (2004) Robust real-time face detection. Int J Comput Vision 57:137–154

Download references

Author information

Authors and affiliations.

Department of Computer Science and Engineering, United Institute of Technology, Prayagraj, Uttar Pradesh, 211008, India

Shashank Dwivedi, Amit Kumar Tiwari, Uddeshy Jaiswa, Shivangi Tripathi, Utkarsh Saxena & Utkarsh Trivedi

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Shashank Dwivedi .

Editor information

Editors and affiliations.

Sister Nivedita University, Kolkata, West Bengal, India

Abhishek Bhattacharya

Soumi Dutta

Department of Computer and System Sciences, Visva-Bharati University, Kolkata, West Bengal, India

Paramartha Dutta

Department of Computing Information Technology, Rochester Institute of Technology, Prishtina, Kosovo

Debabrata Samanta

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Cite this paper.

Dwivedi, S., Tiwari, A.K., Jaiswa, U., Tripathi, S., Saxena, U., Trivedi, U. (2024). Revolutionizing Attendance Tracking: A Smart System Utilizing Face Recognition Technology. In: Bhattacharya, A., Dutta, S., Dutta, P., Samanta, D. (eds) Innovations in Data Analytics. ICIDA 2023. Lecture Notes in Networks and Systems, vol 1005. Springer, Singapore. https://doi.org/10.1007/978-981-97-4928-7_22

Download citation

DOI : https://doi.org/10.1007/978-981-97-4928-7_22

Published : 10 September 2024

Publisher Name : Springer, Singapore

Print ISBN : 978-981-97-4927-0

Online ISBN : 978-981-97-4928-7

eBook Packages : Intelligent Technologies and Robotics Intelligent Technologies and Robotics (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

IMAGES

  1. GitHub

    face recognition project research papers

  2. (PDF) Study of Face Recognition Techniques: A Survey

    face recognition project research papers

  3. (PDF) A PROJECT REPORT On FACE RECOGNITION AND TRACKING SYSTEM

    face recognition project research papers

  4. (PDF) A Review on Face Recognition Technique

    face recognition project research papers

  5. (PDF) Attendance Management System using Face-Recognition

    face recognition project research papers

  6. (PDF) A SURVEY OF FACE RECOGNITION TECHNIQUES

    face recognition project research papers

VIDEO

  1. Face Recognition using Tensor Flow, Open CV, FaceNet, Transfer Learning

  2. Face Recognition ☠️ #shorts #scary

  3. Image recognition project underway! #python #developer #programming #coding #ai #machinelearning

  4. Face Recognition Project with Automatic Attendance Marking and Face Tracking using Arduino and C#

  5. My Face Recognition project on Pictoblox

  6. The Big Downside to Facial Recognition

COMMENTS

  1. A Review of Face Recognition Technology

    Metrics. Abstract: Face recognition technology is a biometric technology, which is based on the identification of facial features of a person. People collect the face images, and the recognition equipment automatically processes the images. The paper introduces the related researches of face recognition from different perspectives.

  2. (PDF) Face Recognition: A Literature Review

    The task of face recognition has been actively researched in recent years. This paper provides an up-to-date review of major human face recognition research. We first present an overview of face ...

  3. Face recognition: Past, present and future (a review)☆

    The history of face recognition goes back to the 1950s and 1960s, but research on automatic face recognition is considered to be initiated in the 1970s [409]. In the early works, features based on distances between important regions of the face were used [164]. Research studies on face recognition flourished since the beginning of the 1990s ...

  4. Face Recognition

    613 papers with code • 23 benchmarks • 64 datasets. Facial Recognition is the task of making a positive identification of a face in a photo or video image against a pre-existing database of faces. It begins with detection - distinguishing human faces from other objects in the image - and then works on identification of those detected faces.

  5. Face Recognition: Recent Advancements and Research Challenges

    A Review of Face Recognition Technology: In the previous few decades, face recognition has become a popular field in computer-based application development This is due to the fact that it is employed in so many different sectors. Face identification via database photographs, real data, captured images, and sensor images is also a difficult task due to the huge variety of faces. The fields of ...

  6. Design and Evaluation of a Real-Time Face Recognition System using

    In this paper, design of a real-time face recognition using CNN is proposed, followed by the evaluation of the system on varying the CNN parameters to enhance the recognition accuracy of the system. An overview of proposed real-time face recognition system using CNN is shown in Fig. 1. The organization of the paper is as follows.

  7. Face Recognition by Humans and Machines: Three Fundamental Advances

    1. INTRODUCTION. The fields of vision science, computer vision, and neuroscience are at an unlikely point of convergence. Deep convolutional neural networks (DCNNs) now define the state of the art in computer-based face recognition and have achieved human levels of performance on real-world face recognition tasks (Jacquet & Champod 2020, Phillips et al. 2018, Taigman et al. 2014).

  8. Past, Present, and Future of Face Recognition: A Review

    Face recognition is one of the most active research fields of computer vision and pattern recognition, with many practical and commercial applications including identification, access control, forensics, and human-computer interactions. However, identifying a face in a crowd raises serious questions about individual freedoms and poses ethical issues. Significant methods, algorithms, approaches ...

  9. A review on face recognition systems: recent approaches and ...

    Face recognition is an efficient technique and one of the most preferred biometric modalities for the identification and verification of individuals as compared to voice, fingerprint, iris, retina eye scan, gait, ear and hand geometry. This has over the years necessitated researchers in both the academia and industry to come up with several face recognition techniques making it one of the most ...

  10. [2212.13038] A Survey of Face Recognition

    A Survey of Face Recognition. Xinyi Wang, Jianteng Peng, Sufang Zhang, Bihui Chen, Yi Wang, Yandong Guo. View a PDF of the paper titled A Survey of Face Recognition, by Xinyi Wang and 5 other authors. Recent years witnessed the breakthrough of face recognition with deep convolutional neural networks. Dozens of papers in the field of FR are ...

  11. Human face recognition based on convolutional neural network and

    To deal with the issue of human face recognition on small original dataset, a new approach combining convolutional neural network (CNN) with augmented dataset is developed in this paper. The original small dataset is augmented to be a large dataset via several transformations of the face images. Based on the augmented face image dataset, the ...

  12. [2103.14983] Going Deeper Into Face Detection: A Survey

    View a PDF of the paper titled Going Deeper Into Face Detection: A Survey, by Shervin Minaee and 3 other authors. Face detection is a crucial first step in many facial recognition and face analysis systems. Early approaches for face detection were mainly based on classifiers built on top of hand-crafted features extracted from local image ...

  13. Face Detection and Recognition Using OpenCV

    Intel's OpenCV is a free and open-access image and video processing library. It is linked to computer vision, like feature and object recognition and machine learning. This paper presents the main ...

  14. Face recognition based attendance system using machine learning with

    algorithm.Once the system is trained, it can recognize the faces of authorized students in real-time. When a student's. face is detected by the camera, the system matches the detected face with ...

  15. Face Detection and Recognition Using OpenCV

    Face detection and picture or video recognition is a popular subject of research on biometrics. Face recognition in a real-time setting has an exciting area and a rapidly growing challenge. Framework for the use of face recognition application authentication. This proposes the PCA (Principal Component Analysis) facial recognition system. The key component analysis (PCA) is a statistical method ...

  16. Machine-learning Approach Face Detection in Extreme Conditions: A

    Face Detection in Extreme Conditions: A Machine-learning Approach. Sameer Aqib Hashmi Department of Electrical and Computer Engineering North South University, Bashundhara Dhaka, Bangladesh email:[email protected] Abstract- Face detection in unrestricted conditions has been a trouble for years due to various expressions, brightness ...

  17. Sensors

    This paper highlights the recent research on the 2D or 3D face recognition system, focusing mainly on approaches based on local, holistic (subspace), and hybrid features. ... The context of the paper is the PhD project of Yassin Kortli. Conflicts of Interest. The authors declare no conflict of interest. References. Liao, S.; Jain, A.K.; Li, S.Z ...

  18. PDF Evaluating Facial Recognition Technology:

    hop to address the question of facial recognition technology (FRT) performance in new domains. The workshop included leading computer scientists, legal schol. rs, and representatives from industry, government, and civil societ. (listed in the Appendix). Given the limited time, the goal of the workshop was ci.

  19. Face recognition using deep learning on Raspberry Pi

    Facial recognition on resource-limited devices such as the Raspberry Pi poses a challenge due to inherent processing limitations. ... In this paper, we have used the Everest SC-HD03 model webcam, which has a resolution of 1080p and offers excellent image quality. 4. ... Use-inspired basic research on individual differences in face ...

  20. Student Attendance Monitoring System Using Face Recognition

    Keywords: Local Binary Pattern Histogram(LBPH), Face Detection, Face Recognition, Haarcascade Classifier, Python, Student Attendance. Suggested Citation: Suggested Citation SAI, E CHARAN and HUSSAIN, SHAIK ALTHAF and KHAJA, SYED and SHYAM, AMARA, Student Attendance Monitoring System Using Face Recognition (May 22, 2021).

  21. Face detection and Recognition: A review

    Face detection and recognition has gained more research attentions in last few years. ... ballot papers for voting. This project aims to replace the older system with the AI voting system ...

  22. PDF Facial Recognition Attendance System Using Python and OpenCv

    This is a project about Facial Recognition-Based Attendance System for Educational Institutions. In this chapter, the problem and motivation, research objectives, project scope, project contributions and the background information of the project will be discussed in detail. 1.1 Problem Statement and Motivation

  23. Dense pedestrian face detection in complex environments

    This work was supported by the Liaoning Provincial Science and Technology Plan Project 2023JH2/101300205, the Applied Basic Research Project of Liaoning Province 2022JH2/101300279 and the Shenyang ...

  24. Student Attendance System using Face Recognition

    Abstract: Face recognition is among the most productive image processing applications and has a pivotal role in the technical field. Recognition of the human face is an active issue for authentication purposes specifically in the context of attendance of students. Attendance system using face recognition is a procedure of recognizing students by using face biostatistics based on the high ...

  25. PDF Smart Attendance System Using Face Recognition Technology

    The smart attendance system utilizing face recognition stands as a testament to the convergence of computer vision, biometrics, and artificial intelligence. In this report, we will delve into the underlying mechanisms of the smart attendance system, exploring its technological foundations, benefits, challenges, and potential applications.

  26. Revolutionizing Attendance Tracking: A Smart System Utilizing Face

    A comparable attendance system utilizing face recognition was suggested in another academic article that was published in the International Journal of Advanced Research in Computer Science and Software Engineering (IJARCSSE) . The article highlighted the importance of pre-processing steps such as image normalization and resizing to improve the ...