Open AI’s DALL-E 2 is one of the most famous AI art generators out there. Thanks to its predecessor DALL-E, a name that needs no introduction, DALL-E 2 found itself in the limelight right after its release. Apart from generating realistic, high-resolution images from text prompts, DALL-E 2 can perform in-image painting, and produce variations of the generated image while maintaining caption similarity, and photorealism.
A Commercial Perspective.
With the right prompt, Dall-E 2 can generate high-quality art for your business needs. It could help visualize different ideas/concepts in an instant as opposed to employing an artist for simple and straightforward tasks. Furthermore, Dall-E 2 could also help improve your current design team’s efficiency by helping them explore different visual dimensions of an idea or even help them with an artist’s block.
In a nutshell
Given a text description of the desired image, DALL-E 2 first converts that text into a CLIP text embedding (blue), which is later translated into a CLIP image embedding (orange?) by a ‘prior’ model. This CLIP image embedding, optionally along with the caption, is passed to the decoder which decodes the embedding into an image. Further, DALL-E 2 can take an existing image and make changes to the specified portions in the image such that the resulting image stays consistent with the input text prompt.
The Deep Dive
DALL-E 2 is presented as a two-stage model. The first stage corresponds to finding the best representation of the user input text and the second stage corresponds to translating the representations from the previous stage into an actual image. DALL-E 2 leverages state-of-the-art algorithms for both these stages. CLIP and GLIDE are the major contributors to the architecture of DALL-E 2.
CLIP (Contrastive Language-Image Pretraining):
If you’ve been active in the AI research space in the past few years then CLIP is no stranger to you. It was developed by Open AI, the same company behind GPT-3, Dall-E, and Dall-E 2. CLIP has been designed to provide captions to images in a zero-shot setting. CLIP achieves this by training a text encoder and an image encoder to pair captions to their corresponding images on a dataset of (image, caption) pairs collected from various sources across the Internet. Both the text encoder and the image encoder create embeddings into the same embedding space. The embeddings created by CLIP are understood to be robust and perform well in a zero-shot setting thanks to its contrastive style of learning- instead of predicting which caption belongs to which image, the model tries to understand how captions are related to their corresponding images thereby enabling generalization across datasets. The training objective of CLIP is to create embeddings in such a way that the image embedding (I1) of a picture (dog) and the text embedding (T1) of the caption (Pepper the aussie pup) corresponding to the picture have the maximum dot product with respect to the dot products with the text embeddings of all other captions. (The diagonal elements in each need to be the maximum of that row)
The motivation behind GLIDE (Guided Language to Image Diffusion for generation and Editing) is to generate images using Diffusion models while guiding them with natural language. Diffusion models have become the go-to for most image-generation tasks thanks to their ability to generate high-quality synthetic images. Under-the-hood, Diffusion models are trained to recover images after corrupting them with Gaussian noise and the recovery process is stochastic. During the recovery phase in GLIDE, a transformer is used is ‘guide’ the process by influencing the backward diffusion process with the help of text tokens that represent the user prompt. This mechanism helps GLIDE leverage the power of Diffusion and natural language to produce photorealistic images. GLIDE is also known for its image in-painting capabilities, wherein GLIDE is capable of editing existing images based on the user prompt and the area in the image that needs to accommodate the intended change.
A pre-trained CLIP text encoder is used to generate CLIP text embeddings from the user input text. The CLIP text embeddings are given to a prior model that generates CLIP image embeddings. During the training of the Prior Model and the Decoder, the CLIP text encoder remains frozen.
One obvious argument is ‘Why would you need an explicit model to generate CLIP image embeddings, why not use the same text embeddings to produce images since they’re both embedded into the same space and have high dot products?’ To answer this, we’ll need a better understanding of the mechanism of CLIP. CLIP was designed to produce captions given images. The training objective of CLIP is to find those features in an image that are sufficient enough to match it to one single caption as opposed to multiple captions. Now, think about the number of ways you could imagine the picture of a dog. There are an infinite number of possible pictures with a dog in them and all of them are consistent with the caption ‘Picture of a dog’. Which picture of the dog should be picked? The prior model helps mitigate this problem. The job of the prior model is to provide the best-suited CLIP image embeddings given the CLIP text embeddings. Since the CLIP text and CLIP image embeddings of an (image, caption) pair have high cosine similarity, one could still exclude the prior model and pass the text embeddings onto the decoder. As you might have guessed, the resulting images, while they do not lack consistency with the caption, are definitely better when the prior model is in the loop. There’re experiments conducted around the effectiveness of the prior model, the figure below is a comparison of the results, please refer to the source for a much more detailed analysis (Section 5.1)
The role of the decoder is to generate an image from the CLIP image embeddings produced by the prior model. At the heart of the Decoder is another Open AI’s model GLIDE. CLIP image embeddings contain just the high-level semantics they represent a ‘gist’ of what the output image would look like while leaving out information that CLIP thinks is unnecessary. By adding GLIDE as the decoder, the authors intend to build on top of the learnings of CLIP. GLIDE has been slightly modified before plugging it into the architecture. Image embeddings from the previous step are added to the existing time-step embeddings of GLIDE. Additionally, CLIP embeddings are further projected into 4 tokens of context and are concatenated to the output sequence of GLIDE’s text encoder. Effectively, the decoder is responsible for adding more details and realism while retaining the high-level semantics of the output image. The authors call it unCLIP since it produces images given the text which is the exact opposite of CLIP. The addition of GLIDE helps DALL-E 2 to inherit the in-image painting capabilities from GLIDE.
Since the base image is generated at a much lower resolution than what the output image is delivered at, the model has trouble generating complex scenes. The details in the complex scenes generated have often been a tad messy compared to relatively simpler scenes.
During the time we spent with DALL-E 2 in openai’s lab, we’ve found the experience to be hassle-free and very intuitive, much like our pre-assembled AI blueprints for your AI needs. The interaction with the model was very straightforward. We’ve observed that the wording of the prompts played a key role in the quality of the images generated. It is important to use the right words to get the best out of the algorithm. With that said, DALL-E 2 has amazed us quite a few times with its ability to go beyond imagination and create images that are incredibly abstract to think of.