top of page

The superpower of DALL-E 2

Artificial intelligence (AI) has now entered our lives, far more than we can or want to comprehend, which is an established fact. Transactions, autonomous driving and even medical examinations, there are now very few fields in which an application has not been found for this revolutionary technology, which enables superhuman recognition and cataloging of data.

But the new frontier of AI does not stop at data recognition but wants to reach into content creation and art, such as images, music, text and even video. Surely many of you have seen, scrolling through social media, images accompanied by captions, or better "prompts": these are images created by an artificial intelligence that converts text into images. There are already several tools and software but in this article, we want to jump deeply into

DALL-E and see how it works, what it is used for, and what its limitations (or risks) are.

How DALL-E works

One of the branches of AI development is content creation, and one of the most recent (and surprising) applications, is the ability to create digital images from natural language descriptions. Indeed, if the creation of text absolutely comparable to that created by a human, by language models such as GPT-3, is capable of filling us with admiration as well as disquiet, nothing strikes our senses better than an image. And providing an artificial intelligence with a short sentence only to see it converted into something incredibly photorealistic, which more often than not comes amazingly close to what we had in mind, well, that's another matter. Try it to believe it.

DALL-E in turn is based right on GPT-3, which allows it to recognize text, but it is much smaller in size. In fact, think that if GPT-3 uses 175 billion parameters (that's why we talk about large models) the original DALL-E, announced by OpenAI in January 2021, used 12 billion parameters. Its successor DALL-E 2, announced in April 2022, uses 3.5 billion, but with a better ability to generate images (at 4 times higher resolution) that are more text-matched and more photorealistic.

So from here on we will talk about DALL-E 2, since it is the latest version of the model.


How does it work, you may ask? For those interested in more technicalities, DALL-E 2 is a model that has been trained on a CLIP (Computational Linguistics for Information Processing) model, a 2021 artificial intelligence model that is particularly effective in image learning.

Artificial intelligence models work, in very simplistic terms, in two stages.

One, training, during which the model is given data from which to "learn," for example handwritten characters. The second, the real one, is the phase when the model is launched into the real world and given the real data, from which it must extrapolate what it has learned. For example, recognize your handwriting and translate it into characters. There would then be a third control, but it is beyond our scope.

As for DALL-E, things are a bit more complicated. The developers took a huge dataset of images and text, which, thanks to the CLIP model, were "embedded," that is, each text was given the information of an image and vice versa, so as to create a representational space in which text and images are linked together. The model learned to associate the two features, thanks to a learning phase of more than 650 million images.

At that point, it was possible to build the actual generative model of DALL-E 2, that is, the one that enables it to create the images.

This is divided into two parts, the prior model, which creates a CLIP image conditioned by the text, and the decoder model (Decoder Diffusion model, unCLIP), which produces images conditioned by the CLIP image produced above and the text entered by the user (the one we want to turn into an image).

The decoder is called unCLIP because it performs the reverse process of the original CLIP model: instead of creating a 'mental' representation (embed) from an image, it creates an original image from a generic mental representation.

The mental representation encodes the main semantically significant features: people, animals, objects, style, colors, background, etc., so that DALL-E 2 can generate a new image that preserves these features by varying the non-essential features.

Confused? Let's try to explain it with an example, which to some extent brings to mind Raymond Carver's famous short story, "Cathedral."

Take a piece of paper and a pencil.

First, think about drawing a train coming out of a tunnel, with trees around it and the sun high in the sky. Visualize what the drawing might look like. The mental image that has appeared in your mind is the human analogy of an "embeddable" image, that is, text, with information from an image. You do not know exactly what the drawing will look like, but you know roughly its main features. The prior model does just that: it goes from sentence to mental imagery.

Now start drawing. Translating the images you have in your mind into an actual drawing is what the decoder model, or unCLIP, does. You could redo the drawing from the same text with similar features but a different appearance. This is what DALL-E 2 also does to create separate original images from an "embeddable" image

Once the drawing is finished, look at it. This image is the result of the text "a train coming out of a tunnel, with trees around it and the sun high in the sky."

Now, think about which features best represent the sentence (e.g., the train, the tunnel, the trees) and which best represent the image (e.g., the objects, the style, the colors...). This process of encoding the features of a sentence and an image is what the CLIP model does.

DALL-E 2 is a very versatile model that can go beyond generating images from sentences and is constantly evolving. This allows it to make variations and "judge" what it considers essential from what is replaceable. In essence, DALL-E 2 tends to preserve "semantic information as well as stylistic elements." An example can be seen below, where Dali's "The Persistence of Memory" is subject to several variations. The model retains some forms (trees and clocks) and replaces others (the sky).

Ok but... What can DALL-E 2 do?

So far we have talked about DALL-E 2's ability to create images from text, but the model can do much more:

  • It can create original and realistic images and artwork from a text description. It can combine concepts, attributes, and styles.

  • Can edit existing images from text, adding and removing elements, taking into account shadows and reflections what is in the original canvas, creating new expansive compositions.

  • It can take an image and create different variations of it inspired by the original.

One of the things to keep in mind is that DALL-E 2 works best with long, complex sentences, while short sentences are too general, and basically confuse the program.

DALL-E 2 has learned to represent elements separately by seeing them repeatedly in the huge dataset of 650 million image-text pairs and has developed the ability to merge unrelated concepts with semantic consistency.

The model also has another fantastic ability: interpolation. Using a technique called text diffs, DALL-E 2 can transform one image into another by doing the sum between image-text pairs:

Possibilities, but also limitations and risks (possibilities win for us)

In terms of limitations, DALL-E 2 is not very good at "writing," as the model decodes a text rather than putting it into the image directly. However, with ChatGPT, this limitation may disappear completely, but we will address this topic later...

AI is a great opportunity, but it also hides inherent risks due to training. And if the AI trains by scouring the web or social media, the results are not comforting, as it is full of racist, violent, or otherwise inappropriate phrases and images.

And this is why Google, Meta and OpenAI have not fully publicly released their models, or have done so cautiously.

In the case of DALL-E 2, employees have removed violent content from training data, and filters are in place to prevent DALL-E 2 from generating images if users submit requests that might violate company policies against nudity, violence, conspiracy, or political content.

Nevertheless, DALL-E 2 and all large artificial intelligence models struggle with some issues because they have biases that may harm minorities in particular or exacerbate problems in our society.

We have noticed that portraits of human subjects are often confused and mixed up, rather than a limitation we believe it's a provisional way to avoid getting trapped in social issues... but this is just a theory.


That said, we are extremely excited about this AI revolution, especially in the field of art. DALL-E and other AI tools are making it possible to create images that were unthinkable and difficult to achieve just a few months ago...

What should we expect from the near future? 👀


Iscriviti alla Newsletter


Grazie per l'iscrizione!

Latest Articles

bottom of page