Scroll mode
How many possible images are there?
Our universe has about 1080 atoms. Now imagine if each atom contained its own universe, with 1080 atoms inside. Even that barely scratches the surface. You'd need about 5,000 layers of atom-universes, nested one inside the next, before you reached the number of possible images the size of the one above - about 10400,000. That's a 1 with 400,000 zeroes after it.
As you can imagine, the vast majority of these are nothing but random noise:
Depending on your computer, you're seeing up to 60 random images per second. If you see anything that looks like a real image before the heat death of the universe, let me (or my descendants) know.
Amazingly, diffusion models can navigate this vast space of possibilities to produce coherent results. Unlike humans who start with a blank canvas and add paint, diffusion models start with random noise, and gradually remove the noise until an image emerges.
From noise to image
If we think of all possible images as occupying a vast, multi-dimensional space, then a diffusion model starts at a random point in that space, and gradually forges a path towards a point that's consistent with your prompt.
A smaller world
Actually it's not as bad as I made out, because the model operates in a compressed, lower-dimensional space called latent space. During training, the model is trained together with an encoder/decoder that can translate from this latent space to real images. In these examples, the latent space has 12x fewer dimensions than the full image space - still vast, but more manageable.
Latent space
We can't show the full latent space because it has too many dimensions, but I've compressed it down to give you a feel for it.
From here on, we ignore the latent space and show the fully decoded image at each step - because it's easier for humans to understand. But in reality, the decoding only happens right at the end.
Where words live
Just like the multi-dimensional space of possible images, text prompts can be mapped to a high-dimensional "embedding space". Each prompt lives at a particular spot in this space, and similar prompts cluster together.
Loading...
Again, we can't show the full embedding because it's too high-dimensional but we can show this 2D compression. Notice how similar concepts group together.
The embedding acts like a "compass" for the diffusion process. At each step on its journey, the model looks at the embedding to figure out the best direction to move.
The starting point
If we start at different points in the possible-image space, we end up at slightly different destinations. This is determined by the random seed - a number that starts off the initial randomisation.
Different random seeds, same prompt
Dividing the path
We can choose how many steps we take before stopping. A small number means we have to take big steps, and can end up off track - if we get there at all. But after a certain point, taking more steps doesn't help much, and just wastes time.
Different numbers of inference steps
The weight of words
A vague prompt leads to a more wobbly compass. More detailed prompts constrain the direction more tightly, leading to better results.
Variation of prompt detail
Explore the effect of your prompt on the generated image:
Prompt length explorer
Amonarchbutterflyonapurpleconeflower,macroclose-up,delicateorangeandblackwingdetail,morningdew,softbokeh
The space between words
Because prompts also exist in their own "embedding space", we can follow a path between any two prompt embeddings, and generate images along the way. These "in-between" points don't correspond to human words, but they exist in the embedding space.
The space between two prompts
You can explore this "in-between" space yourself:
A monarch butterfly on a purple coneflower, macro close-up, delicate orange and black wing detail, morning dew, soft bokehA snail resting on a moss-covered rock, extreme close-up, shell spiral detail, rain droplets, rich green, soft overcast lighting
ClearThe pull of the prompt
The model decides how strongly to follow your prompt using a number called the guidance scale. A higher value gives the prompt a stronger "steering" effect, but if we set it too high we can end up with unnatural, oversaturated images:
Changing the guidance scale
You can explore how the guidance scale affects the image generation process:
Guidance Scale = 1.0Guidance Scale = 15.0
ClearThe full journey
Imagine you've been plonked in the middle of an unfamiliar terrain with an uncertain destination and nothing but a compass to guide you. You come up with the following plan:
- Check the compass
- Walk a bit in that direction
- Check the compass again
- Walk a bit in the new direction
- Repeat a fixed number of times
Diffusion models follow a very similar pattern, guided by these 4 things:
- Random seed - where you start
- Different starting points lead to slightly different destinations.
- Prompt - the compass
- A rickety old compass that only points "northish" will get you somewhere, but not necessarily where you want to be.
- Step count - how often you check the compass
- Check the compass too rarely and you'll drift off course. But checking all the time will slow you down.
- Guidance scale - how strongly you follow the compass
- Blindly follow the compass, and you may end up stuck in a river. But ignore it completely, and you'll wander aimlessly.
By combining these, the model navigates from pure chaos to a coherent image that matches your prompt. Here's that journey again, from noise to structure:
From noise to image
An AI model generating an image may look like magic, but now you know it's just navigation through an unimaginably vast space of possibilities.
Which, to be fair, is still super cool.
Thanks
Thanks to Photoroom, who open sourced their text-to-image model PRX, allowing me to generate all of these examples. All examples were generated using the Photoroom/prx-256-t2i-sft model. Extra thanks to Jon Almazán for the support and ideas.