The field of artificial intelligence has seen tremendous progress in recent years, particularly in the areas of vision and language. The convergence of these two fields has given rise to Vision-Language models, which are designed to understand and generate natural language based on visual information.
However, the development and training of these models is a complex and costly process. Massive end-to-end training costs are one of the significant challenges faced by researchers and practitioners in this area, as the training of these models requires significant computational resources and large amounts of high-quality data. But, what if you could use pre-trained models and build a Vision-Language system around them? Say hello to BLIP-2.
A Commercial Perspective
The integration of Vision-Language models into businesses can significantly improve productivity and the customer experience. One way this is achieved is by automating image captioning, labeling, and segmentation tasks, which can save time and resources for businesses that handle large volumes of visual data.
For example, e-commerce platforms can use these models to automatically label and categorize their product images, making it easier for customers to search for products and for businesses to manage their inventory. Moreover, by incorporating visual descriptions into product descriptions, businesses can provide more accurate and detailed information to customers, enhancing the overall shopping experience. This is particularly useful for products where visual information is critical, such as fashion and home decor items.
Technical Deep Dive
What is the problem space?
Vision-Language is an interdisciplinary research area within Artificial Intelligence (AI) that integrates visual and linguistic information. The goal is to enable machines to understand and interpret the relationships between visual and textual information and perform tasks requiring joint comprehension of these modalities. In Vision-Language tasks, the input consists of an image or video and a corresponding textual description, such as a caption or a question. The task is then to infer a relationship between the two modalities, for example, by generating a description of the image or by answering a question about the image.
Where does BLIP-2 come in?
AI is expensive. The average cost of training 1000 hyperparameters is estimated to be around a dollar, and LLMs have more than half a million parameters on average. These huge computational costs create a need for an effective training strategy that could optimize both time and money - Enter BLIP-2
The same applies to MLOps too. Building and deploying ML pipelines from scratch every time you need an ML solution is demanding in terms of time, infrastructure, and workforce. Our AI Engine and AI blueprints were designed to address this issue and will help you get your models into production faster.
BLIP-2 is a training framework capable of leveraging pre-trained Vision and Language models. One immediate implication of this framework is the compatibility of Vision and Language models. To address this apparent modality gap, BLIP-2 uses a Query Transformer (Q-Former). The Q-Former is lightweight. It comprises 188M parameters, which is still a lot but significantly less than most state-of-the-art Vision and Language models; they tend to be in the order of billions.
How does BLIP-2 work? The Architecture.
BLIP-2 needs an Image Encoder and an LLM, both frozen. When a model is "frozen," it means that the weights and parameters of the model have been saved, and the model is no longer being trained on new data. The frozen model can be used for inference or prediction. The Frozen Image Encoder is connected to the Q-Former which is further connected to a frozen LLM. The motivation behind a Q-Former is to bridge the modality gap between the Image Encoder and the LLM.
The Q-Former comprises two transformer submodules: An Image transformer which interacts with the frozen Image Encoder and extracts visual features and a text transformer which can perform as both an encoder and a decoder. A set of 32 learnable query embeddings are given as input to the image transformer it is these queries that extract visual information from the image. The Q-Former is trained using Image-Text pairs. The idea is to train the Q-Former to extract visual representations that are most informative of the text. During the training, the Q-Former is optimized for three objectives: 1. Image-Grounded Text Generation, 2. Image-Text Contrastive Learning, and 3. Image-Text Matching.
Image-Grounded Text Generation.
During this task, the Q-Former is expected to generate images given text. The idea behind this task is to train the Queries to extract those visual features from the given image that are most informative of the text.
Image-Text Contrastive Learning.
During this task, the Q-Former is expected to align the outputs, Z, of the image transformer and the output, t, of the text transformer. The idea is to maximize the mutual information in positive pairs. The output of the image transformer, Z has multiple output embeddings one corresponding to each query. As a result, the alignment is computed pair-wise, each output text embedding, t, against each output embedding in Z.
This is a trivial binary classification task. The model is trained to predict if the given image-text pair is a match or otherwise. The idea is to train the query embeddings Z, to capture multimodal information. Each output query embeddings Z is given to the model and the corresponding score is computed.
The Role of LLM.
The motivation behind the LLM is to leverage its Generative Learning capabilities. The output from the Q-Former is prepended to the input text embeddings of the LLM. The output from the Q-Former acts as an information bottleneck and gives the most helpful information from the image to the LLM. A fully connected layer is employed to transform the dimension of the output of Q-Former (Z) into the input dimension of the LLM.
Since BLIP-2 uses LLMs to generate text it is prone to Offensive Language and other risks LLMs are under. BLIP-2 is also prone to generating factually incorrect texts, generating text along incorrect reasoning paths, and generating information that is not up to date.
During our time with BLIP-2, we found it impressive in its ability to provide detailed and relevant descriptions of images, even the trickier ones (samples below). On the other hand, we observed that the responses to the input text weren’t really up to the mark, on par at best. Nevertheless, we were able to extract useful information by applying the ‘chain of thought’ prompting style where the user usually tries and gets the model to ‘think.’ We believe changing the current LLM (Flan T5) used in the demo should fix this issue because, at the end of the day, BLIP-2 is a training framework. Additionally, the flexibility to choose the image encoder and LLMs of your choice allows for greater customization and personalization, much like our AI Engine where you could choose the frameworks you’d want to use in your MLOps tech stack. Overall, we feel that this model has great potential for use in a wide range of Vision-Language applications.
Samples: Image Captioning
Samples: Image Captioning + Visual Question Answering
User: How fast can this car go?
Response: the ferrari f1 car has a top speed of 200 mph.”
User: Name the people in the picture
Response: ferrari f1 driver fernando alonso and ferrari f1 driver kimi raikkonen. [Factually Incorrect (Outdated information)]
User: caption this picture
Response: is from the british grand prix. [Factually Correct, but not an ideal caption]
User: Can you describe the picture?
Response: a group of racing cars on a track with smoke coming out of them.
User: Whats the cause of the smoke?
Response: a car crashed into another car.