Facial Recognition with CNNs

November 15, 2021
by Jeff Prosise

Not long ago, I boarded a flight to Europe and was surprised that I didn’t have to show my passport. I passed in front of a camera and was promptly welcomed aboard the flight. It was part of an early pilot for Delta Air Lines’ effort to push forward with facial recognition and offer a touchless curb-to-gate travel experience.

Facial recognition is everywhere. It’s one of the most common – and sometimes controversial – applications for AI. Meta, formerly known as Facebook, uses it to tag friends in photos that you upload – at least it did until it killed the feature due to privacy concerns. Apple uses it to allow users to unlock their iPhones, while Microsoft uses it to unlock Windows PCs. Used properly, facial recognition has vast potential to make the world a better, safer, and more secure place.

Several algorithms for recognizing faces in photos have been developed over the years. Some rely on biometrics such as the distance between the eyes or the texture of the skin, while others take a more holistic approach by treating facial identification as a pattern-recognition problem. State-of-the-art models today typically rely on deep convolutional neural networks, or CNNs. One of the primary benchmarks for facial-recognition models is the Labeled Faces in the Wild (LFW) dataset, which contains more than 13,000 facial images of more than 5,000 people collected from the Web. Deep-learning models such as MobiFace and FaceNet routinely achieve greater than 99% accuracy on the dataset. This equals or exceeds a human’s ability to identify faces in LFW photos.

In an earlier blog post, I presented a support-vector machine (SVM) that achieved 85% accuracy using a subset of 500 images – 100 each of five famous people – from the dataset. A subsequent post tackled the same problem with a neural network, with similar results. These models merely scratch the surface of what can be accomplished with modern facial recognition. Let’s apply CNNs and transfer learning to the same LFW subset and see if they can do better at recognizing faces in photos. Along the way, you’ll learn a valuable lesson about pretrained CNNs and the specificity of the weights that are generated when those CNNs are trained.

Use Transfer Learning to Train a Facial-Recognition Model

The first step in exploring CNN-based facial recognition is to load the LFW dataset. This time, we’ll load full-size color images and crop them to 128×128 pixels. Here’s the code to make that happen:

import numpy as np
import pandas as pd
from sklearn.datasets import fetch_lfw_people

faces = fetch_lfw_people(min_faces_per_person=100, resize=1.0, slice_=(slice(60, 188), slice(60, 188)), color=True)
class_count = len(faces.target_names)

print(faces.target_names)
print(faces.images.shape)

Because we set min_faces_per_person to 100, 1,140 facial images corresponding to five people were loaded. Use the following statements to show the first several images and the labels that go with them:

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

fig, ax = plt.subplots(3, 6, figsize=(18, 10))

for i, axi in enumerate(ax.flat):
    axi.imshow(faces.images[i] / 255) # Scale pixel values so Matplotlib doesn't clip everything above 1.0
    axi.set(xticks=[], yticks=[], xlabel=faces.target_names[faces.target[i]])

The dataset is imbalanced, containing almost as many photos of George W. Bush as of everyone else combined. Use the following code to reduce the dataset to 100 images of each person for a total of 500 facial images:

mask = np.zeros(faces.target.shape, dtype=np.bool)

for target in np.unique(faces.target):
    mask[np.where(faces.target == target)[0][:100]] = 1

x_faces = faces.data[mask]
y_faces = faces.target[mask]
x_faces = np.reshape(x_faces, (x_faces.shape[0], faces.images.shape[1], faces.images.shape[2], faces.images.shape[3]))
x_faces.shape

Now preprocess the pixel values for input to a pretrained CNN, one-hot-encode the labels, and use Scikit-learn’s train_test_split function to split the dataset for training and testing, yielding 400 training samples and 100 test samples:

from tensorflow.keras.utils import to_categorical
from tensorflow.keras.applications.resnet50 import preprocess_input
from sklearn.model_selection import train_test_split

face_images = preprocess_input(np.array(x_faces))
face_labels = to_categorical(y_faces)

x_train, x_test, y_train, y_test = train_test_split(face_images, face_labels, train_size=0.8, stratify=face_labels, random_state=0)

If you wanted, you could divide the preprocessed pixel values by 255 and train a CNN from scratch right now with this data. Here’s how you’d go about it (in case you care to give it a try):

from keras.models import Sequential
from keras.layers import Dense, Flatten, Conv2D, MaxPooling2D

model = Sequential()
model.add(Conv2D(32, (3, 3), activation='relu', input_shape=(face_images.shape[1:])))
model.add(MaxPooling2D(2, 2))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(2, 2))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(2, 2))
model.add(Flatten())
model.add(Dense(1024, activation='relu'))
model.add(Dense(class_count, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=20, batch_size=10)

I did it and then plotted the training and validation accuracy:

The validation accuracy is better than that of an SVM or a conventional neural network, but it’s nowhere near what modern CNNs achieve on the LFW dataset. So clearly there is a better way.

That better way, of course, is transfer learning. ResNet50 was trained with more than 1 million images from the ImageNet dataset, so it should be pretty adept at extracting features from photos – more so than our hand-crafted CNN trained with 400 images. Let’s see if that’s the case. Use the following statements to load ResNet50’s feature-extraction layers, initialize them with the ImageNet weights, and freeze them so the weights aren’t adjusted during training. Note that you can freeze all the layers with one line of code by setting trainable=False on the base model:

from tensorflow.keras.applications import ResNet50

base_model = ResNet50(weights='imagenet', include_top=False)
base_model.trainable = False

Now add classification layers to the base model and include a Resizing layer to resize images input to the network to the size that ResNet50 expects:

from keras.models import Sequential
from keras.layers import Flatten, Dense, Resizing

model = Sequential()
model.add(Resizing(224, 224))
model.add(base_model)
model.add(Flatten())
model.add(Dense(1024, activation='relu'))
model.add(Dense(class_count, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

Train the model and plot the training and validation accuracy:

hist = model.fit(x_train, y_train, validation_data=(x_test, y_test), batch_size=10, epochs=10)

acc = hist.history['accuracy']
val_acc = hist.history['val_accuracy']
epochs = range(1, len(acc) + 1)
plt.plot(epochs, acc, '-', label='Training Accuracy')
plt.plot(epochs, val_acc, ':', label='Validation Accuracy')
plt.title('Training and Validation Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.plot()

Results will vary, but my run produced a validation accuracy around 94%:

94% is an improvement over a CNN trained from scratch, and it’s an indication that ResNet50 does a better job of extracting features from facial images. But it’s still not state-of-the-art. Is it possible to do even better?

Boost Transfer Learning with Task-Specific Weights

Initialized with ImageNet weights, ResNet50 does a credible of feature extraction. Those weights were arrived at when ResNet50 was trained on more than 1 million photos of objects ranging from basketballs to butterflies. It was not, however, trained with facial images. Would it be better at extracting features from facial images if it was trained with facial images?

In 2017, a group of researchers at the University of Oxford’s Visual Geometry Group in the U.K. published a paper entitled “VGGFace2: A dataset for recognising faces across pose and age.” After assembling a dataset comprising several million facial images, they trained two variations of ResNet50 with it and published the results. They also published the weights, which are wrapped in a handy Python library named keras-vggface. That library includes a class named VGGFace that encapsulates ResNet50 with TensorFlow-compatible weights. Out of the box, VGGFace is capable of recognizing the faces of thousands of celebrities ranging from Brie Larson to Jennifer Aniston. But its real value lies in using transfer learning to repurpose it to recognize faces it wasn’t trained to recognize before.

In order to experiment with VGGFace, you must first install keras-vggface as well as a package named keras-applications. Then you can use the following code to create an instance of VGGFace built around ResNet50 with custom classification layers:

from keras_vggface.vggface import VGGFace

base_model = VGGFace(model='resnet50', include_top=False)
base_model.trainable = False

model = Sequential()
model.add(Resizing(224, 224))
model.add(base_model)
model.add(Flatten())
model.add(Dense(1024, activation='relu'))
model.add(Dense(class_count, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

Next, train the model and plot the training and validation accuracy:

hist = model.fit(x_train, y_train, validation_data=(x_test, y_test), batch_size=10, epochs=10)

acc = hist.history['accuracy']
val_acc = hist.history['val_accuracy']
epochs = range(1, len(acc) + 1)
plt.plot(epochs, acc, '-', label='Training Accuracy')
plt.plot(epochs, val_acc, ':', label='Validation Accuracy')
plt.title('Training and Validation Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.plot()

The results are nothing short of amazing:

To be sure, run the test data through the network and use a confusion matrix to assess the results:

from sklearn.metrics import confusion_matrix

y_predicted = model.predict(x_test)
mat = confusion_matrix(y_test.argmax(axis=1), y_predicted.argmax(axis=1))

sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False, cmap='Blues',
            xticklabels=faces.target_names,
            yticklabels=faces.target_names)

plt.xlabel('Predicted label')
plt.ylabel('Actual label')

Because VGGFace was tuned to extract features from facial images, the network achieves a perfect score on the 100 test images. That’s not to say it will never fail to recognize a face. It does indicate that on the dataset you trained it with, it is remarkably adept at extracting features from facial images and classifying those features.

And therein lies an important lesson. CNNs that are trained in task-specific ways frequently provide a better base for transfer learning than CNNs trained in a more generic fashion. If the goal is to perform facial recognition, you’ll almost always do better with a CNN trained with facial images than a CNN trained with photos of thousands of dissimilar objects. For a neural network, it’s all about the weights.

But Is It Real?

Anytime a model scores this well in testing, you should be suspicious. Given that VGGFace was trained with images of some of the same famous people found in the LFW dataset, is it possible that it’s biased towards those people? That transfer learning with VGGFace wouldn’t do as well if trained with images of ordinary people? And how would it perform with just a handful of training images?

As a test, I generated a small dataset containing eight pictures each of me, my wife, and my youngest daughter in slightly different poses, at ages up to 20 years apart, with and without glasses. Facial-recognition software often confuses my wife and daughter; many times I’ve uploaded a picture of one of them to Facebook and had Facebook offer to tag her as the other. Then I split the data 50/50 and trained a VGGFace-based model to recognize the faces. With just 12 training images (four each of the three of us) and 12 test images, the model achieved a perfect score in multiple consecutive runs, with the images split differently in every run.

With a highly optimized set of weights, a neural network can do almost anything. Generic weights will suffice when nothing better is available, but given a task-specific set of weights to start with, transfer learning can be pure magic.

Get the Code

You can download a Jupyter notebook containing the facial-recognition example presented in this post from the deep-learning repo that I maintain on GitHub. Feel free to check out the other notebooks in the repo while you’re at it. Also be sure to check back from time to time because I am constantly uploading new samples and updating existing ones.

Stay Informed

We deliver solutions that accelerate the value of Azure.

Ready to experience the full power of Microsoft Azure?