SEEEN Face Recognition System

In recent years, the advancement in Deep Learning has enabled widespread use of face recognition technology. This article explains the SEEEN custom Deep Learning-based face recognition framework.

Formally, Face Recognition is defined as the problem of identifying faces in images or videos. Typically, the flow to recognize faces comprises three steps shown in following figure:

  1. Face detection - Detecting all faces in an image.
  2. Face embedding - Extracting the most important features from each detected face.
  3. Face classification - Classifying each face based on its extracted features.

There are various ways to implement each of the steps in a face recognition pipeline. In our system, we perform face detection using MTCNN, face embedding using FaceNet and classification using similarity search using Milvus and Redis. In video processing, every detected face is tracked.

  1. Face detection using MTCNN

The Multi-Task Cascaded Convolutional Neural Networks  (MTCNN) is a neural network, which detects faces and facial landmarks (five key points of a face) in images. It is one of the most popular face detection tools today.

2. Face Embedding using FaceNet

FaceNet is a deep Neural Network used for extracting features from an image of a face.

FaceNet takes an image of a face as input and outputs a vector of 512 numbers, which represents the most important features of a face. In machine learning, this vector is called embedding.

Ideally, embeddings of similar faces are also similar. Mapping high-dimensional data (like images) into low-dimensional representations (embeddings) has become a fairly common practice in machine learning.

Embedding are vectors, and vectors can be seen as points in a Cartesian coordinate system. That means we can plot an image of a face as a point in the coordinate system using its embedding. Two similar faces should be two points that are close to each other. Thus, one possible way of recognizing a person in a new image (unseen previously) would be to calculate the embedding of this person’s face, calculate the distances to the embeddings of faces of known people, and when the distance is close enough to the embedding of person X, we say that this image contains the face of person X.

Having a dataset of face images, with multiple images for each person, FaceNet can be trained as follows:

  1. Randomly select an anchor image.
  2. Randomly select an image of the same person (positive example).
  3. Randomly select an image of a different person (negative example).
  4. Adjusts the FaceNet network parameters so that the positive example is closer to the anchor than the negative example.

We repeat these steps until there are no more changes to be done, that means all the faces of the same person are close to each other and far from others. This method of learning with anchor, positives and negatives examples is called triplet loss.

3. Face Classification using Milvus And Redis

In our approach, the classification step is done by calculating the embedding similarity between a new face and the known faces. We choose to use Milvus framework to do this similarity search.

Milvus is an open source distributed vector search engine that provides state-of-the-art similarity search and analysis of feature vectors and unstructured data. Some of its key features are:

  • GPU-accelerated search engine
  • Intelligent index
  • Strong scalability
  • High compatibility
  • Billion-Scale similarity search in real-time

Basically, Milvus indexes the embedding vector extracted from images and makes them available for fast search. So the embeddings that we extract from our face images are indexed by Milvus and then matched to new faces in the search phase.

Mivus assigns a unique id (referred as milvus_id) to each vector, and in order to store the people’s names, we use a Redis datastore that maps the milvus_ids to person names. This will allow us to identify the name of the person.

Milvus provides the ability to configure groups of vectors called collections, we use these collections to customise the service at user-level (i.e. users in the system may have their own group of faces that they wish to identify).

The Framework

Our face recognition framework works in two modes displayed in the next figure.

Indexing mode

In this mode we create or update the embeddings dataset (e.g. Milvus collection(s)).

Considering we need to recognise some known people, in order to add them to our system, we will index them as follows:

  1. We first provide some images that contain these people.
  2. The system will detect all faces in each image using MTCNN.
  3. We then manually review the detected faces. We delete all faces that we do not need to recognise, and label the remaining faces with their names.
  4. The system extracts the embedding of each of the labeled faces with FaceNet
  5. The embeddings will be added to a Milvus collection.
  6. Milvus generates an id for each embedding, and the system stores this id with the name of the person in a Redis datastore.

Search mode

To check if there is a known person in a new unseen image:

  1. The system first detects all the faces in that image.
  2. FaceNet embeddings of each face are extracted.
  3. Milvus will calculate a similarity score between the extracted embeddings and its indexed embeddings.
  4. If a similar face (embeddings similarity score above 0.8) is retrieved, its milvus_id will be sent to the face storage in Redis to retrieve the name.
  5. Finally the result of the search will be the detected recognised faces.

The code can run on one or multiple CPU or GPU. We tested it on Nvidia RTX 2080TI GPU and it runs near to real-time “1 minute video == 1 minute processing”.

Watch this video for a demonstration on how Seeen Face Recognition System works.