CLIP: Surpassing Supervised Vision Approaches by Learning on Text about the Image

Overview

Blog

CLIP

An Overview on CLIP a model which connected Image and Text as modalities by Aligning text to the image

Kaushik Tiwari

Founder @SNR.Audio

April 22, 2024

Modern computer vision systems are like helpful robots, but they're often only programmed to recognize a small range of specific things. This can be a bit limiting and means they need extra training to recognize anything else. Sometimes, you might read a paper or see a model that claims to be the best of the best, but they don't always live up to the hype in real life. This is usually because their performance results, which might look impressive on paper, are super optimized. the authors of CLIP have a solution to this very problem: "Why not learn directly from raw text about images? “

How CLIP works ?

CLIP pre-trains an image encoder and a text encoder to predict which images were paired with which texts in our dataset. This behavior is then used to turn CLIP into a zero-shot classifier. All classes in a dataset are converted into captions such as "a photo of a dog", and the class of the caption that CLIP estimates best pairs with a given image is predicted.

‍

CLIP Pretraining

CLIP was built to address the following challenges in No Particular Order

Costly datasets
Narrow Scope of Benchmark Datasets - (Going beyond 10, 100, 1000/N classes)
Poor Real World Performance

Pretraining the CLIP Model

‍

The Contrastive Language Image Pre-training (CLIP) model involves the following process:

We start with paired text and image inputs.
At each step, a batch is randomly sampled from a large dataset of these pairs.
The text is converted into feature vectors (T1, T2, …, TN) using a text encoder, while the images are similarly transformed using an image encoder, resulting in image feature vectors (I1, I2, …, IN).
We then calculate the cosine similarities between the text and image vectors. In the figure above, the matching text and image pairs are highlighted in blue (I1.T1, I2.T2, …, IN.TN).
For contrastive pre-training, our goal is for these highlighted similarity values to be high. This indicates that matching texts and images are mapped to a similar region in the feature space and the remaining similarities should be lower.
To accomplish this, we treat the similarity values as scaled logits and input them into a softmax classifier. This effectively transforms the problem into a classification task and we aim to minimise the Cross Entropy Loss. (We want each pair of images and texts to have a high probability.)
We repeat this for all image and text pairs in the batch and aim to minimise the average Cross Entropy Loss, which corresponds to the InfoNCE loss.

CLIP Inference and Additional Details about the Dataset
‍

Each of the potential label text classes is processed through the pre-trained text encoder to generate the text feature vectors (T1, T2, T3, …, TN).
The image to be classified is input into the pre-trained image encoder to generate the image feature vector (I1).
The cosine similarity between each of the text feature vectors (T1, T2, T3, …, TN) and the image feature vector (I1) is calculated.
The text feature vector that yields the highest cosine similarity is deemed the label of the image. This essentially means we're picking the text feature vector that is closest in angular distance to the image feature vector. As shown in the figure above, the highest value is dog.
The CLIP model is trained on 400 million image-text pairs from the internet.
The batch size for the input is 32,768.
32 epochs over the dataset.
Cosine learning rate decay is applied.
The architecture of the image encoder is ResNet-based or ViT-based.
The architecture of the text encoder is Transformer-based.

‍