DALL-E 2

How does OpenAI's groundbreaking DALL-E 2 model actually work? Check out this detailed guide to learn the ins and outs of DALL-E 2.
October 9, 2024
  ·  
5 min read

Introduction

Open AI’s DALL-E 2 is one of the most famous AI art generators out there. Thanks to its predecessor DALL-E, a name that needs no introduction, DALL-E 2 found itself in the limelight right after its release. Apart from generating realistic, high-resolution images from text prompts, DALL-E 2 can perform in-image painting, and produce variations of the generated image while maintaining caption similarity, and photorealism.

Prompt: High-resolution picture of a dog in the Himalayas
Prompt: Dog dressed as batman
Prompt: Stained window glass depicting a Robot
Prompt: An oil painting of an Indian King
Prompt: Golden Retriever playing piano

A Commercial Perspective.

With the right prompt, Dall-E 2 can generate high-quality art for your business needs. It could help visualize different ideas/concepts in an instant as opposed to employing an artist for simple and straightforward tasks. Furthermore, Dall-E 2 could also help improve your current design team’s efficiency by helping them explore different visual dimensions of an idea or even help them with an artist’s block.

In a nutshell

Given a text description of the desired image, DALL-E 2 first converts that text into a CLIP text embedding (blue), which is later translated into a CLIP image embedding (orange?) by a ‘prior’ model. This CLIP image embedding, optionally along with the caption, is passed to the decoder which decodes the embedding into an image. Further, DALL-E 2 can take an existing image and make changes to the specified portions in the image such that the resulting image stays consistent with the input text prompt. 

Source: Hierarchical Text-Conditional Image Generation with CLIP Latents

The Deep Dive

DALL-E 2 is presented as a two-stage model. The first stage corresponds to finding the best representation of the user input text and the second stage corresponds to translating the representations from the previous stage into an actual image. DALL-E 2 leverages state-of-the-art algorithms for both these stages. CLIP and GLIDE are the major contributors to the architecture of DALL-E 2.

Some Background

CLIP (Contrastive Language-Image Pretraining):

If you’ve been active in the AI research space in the past few years then CLIP is no stranger to you. It was developed by Open AI, the same company behind GPT-3, Dall-E, and Dall-E 2. CLIP has been designed to provide captions to images in a zero-shot setting. CLIP achieves this by training a text encoder and an image encoder to pair captions to their corresponding images on a dataset of (image, caption) pairs collected from various sources across the Internet. Both the text encoder and the image encoder create embeddings into the same embedding space. The embeddings created by CLIP are understood to be robust and perform well in a zero-shot setting thanks to its contrastive style of learning- instead of predicting which caption belongs to which image, the model tries to understand how captions are related to their corresponding images thereby enabling generalization across datasets. The training objective of CLIP is to create embeddings in such a way that the image embedding (I1) of a picture (dog) and the text embedding (T1) of the caption (Pepper the aussie pup) corresponding to the picture have the maximum dot product with respect to the dot products with the text embeddings of all other captions. (The diagonal elements in each need to be the maximum of that row) 

Source: Learning Transferable Visual Models From Natural Language Supervision

GLIDE:

The motivation behind GLIDE (Guided Language to Image Diffusion for generation and Editing) is to generate images using Diffusion models while guiding them with natural language. Diffusion models have become the go-to for most image-generation tasks thanks to their ability to generate high-quality synthetic images. Under-the-hood, Diffusion models are trained to recover images after corrupting them with Gaussian noise and the recovery process is stochastic. During the recovery phase in GLIDE, a transformer is used is ‘guide’ the process by influencing the backward diffusion process with the help of text tokens that represent the user prompt. This mechanism helps GLIDE leverage the power of Diffusion and natural language to produce photorealistic images. GLIDE is also known for its image in-painting capabilities, wherein GLIDE is capable of editing existing images based on the user prompt and the area in the image that needs to accommodate the intended change.

Reverse Diffusion in action [Backward Diffusion in Diffusion Models].
Example of a text-guided in-image painting by GLIDE. Source: GLIDE

The Architecture.

Text Encoder

A pre-trained CLIP text encoder is used to generate CLIP text embeddings from the user input text. The CLIP text embeddings are given to a prior model that generates CLIP image embeddings. During the training of the Prior Model and the Decoder, the CLIP text encoder remains frozen.

Prior Model

One obvious argument is ‘Why would you need an explicit model to generate CLIP image embeddings, why not use the same text embeddings to produce images since they’re both embedded into the same space and have high dot products?’ To answer this, we’ll need a better understanding of the mechanism of CLIP. CLIP was designed to produce captions given images. The training objective of CLIP is to find those features in an image that are sufficient enough to match it to one single caption as opposed to multiple captions. Now, think about the number of ways you could imagine the picture of a dog. There are an infinite number of possible pictures with a dog in them and all of them are consistent with the caption ‘Picture of a dog’. Which picture of the dog should be picked? The prior model helps mitigate this problem. The job of the prior model is to provide the best-suited CLIP image embeddings given the CLIP text embeddings. Since the CLIP text and CLIP image embeddings of an (image, caption) pair have high cosine similarity, one could still exclude the prior model and pass the text embeddings onto the decoder. As you might have guessed, the resulting images, while they do not lack consistency with the caption, are definitely better when the prior model is in the loop. There’re experiments conducted around the effectiveness of the prior model, the figure below is a comparison of the results, please refer to the source for a much more detailed analysis (Section 5.1)

Source: Hierarchical Text-Conditional Image Generation with CLIP Latents

Decoder [unCLIP]

The role of the decoder is to generate an image from the CLIP image embeddings produced by the prior model. At the heart of the Decoder is another Open AI’s model GLIDE. CLIP image embeddings contain just the high-level semantics they represent a ‘gist’ of what the output image would look like while leaving out information that CLIP thinks is unnecessary. By adding GLIDE as the decoder, the authors intend to build on top of the learnings of CLIP. GLIDE has been slightly modified before plugging it into the architecture. Image embeddings from the previous step are added to the existing time-step embeddings of GLIDE. Additionally, CLIP embeddings are further projected into 4 tokens of context and are concatenated to the output sequence of GLIDE’s text encoder. Effectively, the decoder is responsible for adding more details and realism while retaining the high-level semantics of the output image. The authors call it unCLIP since it produces images given the text which is the exact opposite of CLIP. The addition of GLIDE helps DALL-E 2 to inherit the in-image painting capabilities from GLIDE.

Base Picture
Variation 1
Variation 2

Limitations

Since the base image is generated at a much lower resolution than what the output image is delivered at, the model has trouble generating complex scenes. The details in the complex scenes generated have often been a tad messy compared to relatively simpler scenes. 

Missing details at the limbs and face of the rider. (Prompt: Gorilla riding a horse)
Missing details on the front wing (Prompt: F1 car in a desert)

Our Thoughts

During the time we spent with DALL-E 2 in openai’s lab,  we’ve found the experience to be hassle-free and very intuitive, much like our pre-assembled AI blueprints for your AI needs. The interaction with the model was very straightforward. We’ve observed that the wording of the prompts played a key role in the quality of the images generated. It is important to use the right words to get the best out of the algorithm. With that said, DALL-E 2 has amazed us quite a few times with its ability to go beyond imagination and create images that are incredibly abstract to think of.

Prompt: A battle from the 1900s
Prompt: A battle from the 1500s
Prompt: A Historic Battle from the 1500s
Prompt: Rainbow in Space
Prompt: A realistic image of a cat dressed as an Egyptian god