Seeen Visual Search

Seeen platform provides a powerful tool to allow efficient and effective exploration of large visual datasets by allowing free-text querying. This search is based based on the most recent advances in deep learning for joint textual and visual analysis.

In this article we explain the main component of our visual search platform.

Image retrieval and deep features-based indexing

A person nowadays can store in his/her smartphone on average tens of gigabytes of visual contents, organisations that generate content (such as media companies), will have to manage a much larger magnitude of content. If the content is well exploited and explored, it can play a crucial role in the success of a company at various levels (reach out, core business, exposure, analytics, decision making, new business opportunities etc.).

Image indexing and retrieval, is the core component that enables organising the visual content of any organisation. Traditional image search relies on a textual description to find the image that best matches a certain query. While direct image content-based search is more challenging since images are more complex and convey more  information like colour, texture, objects etc, the challenging issue is how to compose a compact high level semantic representation of the image that can be later matched to a plain text query.

The recent advances in deep learning can help in providing more accurate and scalable solutions for visual search. Among many different applications and domains where deep neural networks proved to excel like image classification, object detection and tracking, it also can be used to provide high level compact representation of both visual and textual data separately. More recently, it became possible to build and train neural networks that can consume both visual and textual data at the same time and provide a unified compact representation for both types of data based on their semantic meaning.

Unified textual-visual representation space

The basic idea of using a neural network for image search is to transfer the plain visual and textual information into a common unified space, where this information that refer to the same semantic concept are transferred to a relatively similar representation. For example the colour red in a picture and the word “red” would ideally have a similar representation after being processed by the neural network.

For textual data transformers-based models like BERT and GPT can be used as an encoder to produce high-level representations. These types of models have proven to provide the best results for a variety of tasks in the NLP domain.  The same kind of structure can be adapted to construct a visual encoder, and both types of encoders can be used together to solve various visual-linguistic tasks.

This can be achieved by training the textual and visual encoders simultaneously on a special task that requires solving textual and visual problems. For example, a neural network is built on top of these encoders and takes as input a pair of image and a sentence that describes the content of the image. Before feeding the input to the network parts of the sentence are masked out and the goal of the network is to recover these parts. The same idea can be applied the opposite way, that is, parts of the image are cropped out and the neural network should predict the objects in the images with the help of the description. These tasks force both the textual and visual encoder to work together to solve these problems until they converge and start to produce similar representations for semantically similar concepts.

Once this is done an image can be considered as relevant to a textual query based on the similarity between the representation of the image from one side and text of the query from the other side.

Seen Visual Search

An overview of the Seen image search system is shown in the next figure.It comprises two modules: offline image indexing, and online search.

Offline Image Indexing

Because of the high cost of extracting visual features, an indexing is done offline by using a deep learning model (RCNN in our case). This process is done each time new images are added to the image collection. The features are then saved in a data store.

The data processing in the is module happens in real time when the user enters a textual query to the system, this will trigger the following steps:

  1. The query is encoded using the Text-Encoder model (e.g. BERT).
  2. The system then iteratively fetches features from the data store and encodes them using the Visual-Encoder.
  3. The final step is to pass the encoded visual and textual information to a visual-textual Model that provides a ranked list of images based on the degree of relevance between the encoded textual and visual features.

Seeen system allows the easy expiration of a dataset by posting textual queries and finding relevant images.

The next three screenshots show examples of three queries and their results:

  • Drifting red car
  • Crowd
  • Man standing in front of a white car

Watch this video for a demonstration on how the search works.

Search query “Drifting red car”
Search query “Crowd”
Search query “Man standing in front of a white car”