从噪声到图像 – 扩散的互动指南

从噪声到图像 – 扩散的互动指南
From Noise to Image – interactive guide to diffusion

原始链接: https://lighthousesoftware.co.uk/projects/from-noise-to-image/

## 从噪声到图像：AI如何创造视觉内容 AI图像生成，例如扩散模型，在难以置信的巨大图像可能性空间中运作——估计为10^400,000。这些模型并非从零开始*创造*；它们从随机噪声开始，并根据你的提示逐步将其提炼成连贯的图像。这个过程发生在更易于管理的“潜在空间”中，这是所有可能图像的压缩表示。文本提示也被转换成高维的“嵌入空间”，作为引导模型旅程的指南针。关键因素会影响结果：**随机种子**决定起点，**步数**控制提炼频率，而**引导比例**则决定模型遵循提示的程度。更详细的提示提供更清晰的方向。有趣的是，模型甚至可以在提示*之间*生成图像，探索嵌入空间中不对应特定词语的区域。最终，AI图像生成是对巨大可能性的复杂导航，将混沌转化为视觉上有意义的结果。

黑客新闻新 | 过去 | 评论 | 提问 | 展示 | 工作 | 提交登录从噪声到图像 – 扩散的交互指南 (lighthousesoftware.co.uk) 8 分，由 simedw 1 小时前发布 | 隐藏 | 过去 | 收藏 | 2 条评论帮助 whilefalse 35 分钟前 | 下一个 [–] 我是作者，感谢分享！这篇指南刻意写得通俗易懂，不涉及技术细节，因为我的想法是，大多数对技术/人工智能不感兴趣的人不太关心训练过程，或者系统是如何达到现在的状态的。但他们对输入提示后系统实际运作方式有兴趣。很高兴回答任何问题或接受反馈。回复 K2h 35 分钟前 | 上一个 [–] 在手机上滑动浏览图片很困难。想看全部 29 步，但无法可靠地滑动。回复指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系方式搜索：

原文

Scroll mode

How many possible images are there?

Our universe has about 10⁸⁰ atoms. Now imagine if each atom contained its own universe, with 10⁸⁰ atoms inside. Even that barely scratches the surface. You'd need about 5,000 layers of atom-universes, nested one inside the next, before you reached the number of possible images the size of the one above - about 10^400,000. That's a 1 with 400,000 zeroes after it.

As you can imagine, the vast majority of these are nothing but random noise:

Depending on your computer, you're seeing up to 60 random images per second. If you see anything that looks like a real image before the heat death of the universe, let me (or my descendants) know.

Amazingly, diffusion models can navigate this vast space of possibilities to produce coherent results. Unlike humans who start with a blank canvas and add paint, diffusion models start with random noise, and gradually remove the noise until an image emerges.

From noise to image

A monarch butterfly on a purple coneflower, macro close-up, delicate orange and black wing detail, morning dew, soft bokeh

If we think of all possible images as occupying a vast, multi-dimensional space, then a diffusion model starts at a random point in that space, and gradually forges a path towards a point that's consistent with your prompt.

A smaller world

Actually it's not as bad as I made out, because the model operates in a compressed, lower-dimensional space called latent space. During training, the model is trained together with an encoder/decoder that can translate from this latent space to real images. In these examples, the latent space has 12x fewer dimensions than the full image space - still vast, but more manageable.

Latent space

32×32 latent PCA (shown 4×)

We can't show the full latent space because it has too many dimensions, but I've compressed it down to give you a feel for it.

From here on, we ignore the latent space and show the fully decoded image at each step - because it's easier for humans to understand. But in reality, the decoding only happens right at the end.

Where words live

Just like the multi-dimensional space of possible images, text prompts can be mapped to a high-dimensional "embedding space". Each prompt lives at a particular spot in this space, and similar prompts cluster together.

Again, we can't show the full embedding because it's too high-dimensional but we can show this 2D compression. Notice how similar concepts group together.

The embedding acts like a "compass" for the diffusion process. At each step on its journey, the model looks at the embedding to figure out the best direction to move.

The starting point

If we start at different points in the possible-image space, we end up at slightly different destinations. This is determined by the random seed - a number that starts off the initial randomisation.

Different random seeds, same prompt

Dividing the path

We can choose how many steps we take before stopping. A small number means we have to take big steps, and can end up off track - if we get there at all. But after a certain point, taking more steps doesn't help much, and just wastes time.

Different numbers of inference steps

The weight of words

A vague prompt leads to a more wobbly compass. More detailed prompts constrain the direction more tightly, leading to better results.

Variation of prompt detail

A monarch butterfly on a purple coneflower, macro

A monarch butterfly on a purple coneflower, macro close-up, delicate orange and black wing

A monarch butterfly on a purple coneflower, macro close-up, delicate orange and black wing detail, morning dew, soft bokeh

Explore the effect of your prompt on the generated image:

Prompt length explorer

Amonarchbutterflyonapurpleconeflower,macroclose-up,delicateorangeandblackwingdetail,morningdew,softbokeh

The space between words

Because prompts also exist in their own "embedding space", we can follow a path between any two prompt embeddings, and generate images along the way. These "in-between" points don't correspond to human words, but they exist in the embedding space.

The space between two prompts

A monarch butterfly on a purple coneflower, macro close-up, delicate orange and black wing detail, morning dew, soft bokeh

A snail resting on a moss-covered rock, extreme close-up, shell spiral detail, rain droplets, rich green, soft overcast lighting

You can explore this "in-between" space yourself:

Noisy

A monarch butterfly on a purple coneflower, macro close-up, delicate orange and black wing detail, morning dew, soft bokeh

Clear

The pull of the prompt

The model decides how strongly to follow your prompt using a number called the guidance scale. A higher value gives the prompt a stronger "steering" effect, but if we set it too high we can end up with unnatural, oversaturated images:

Changing the guidance scale

You can explore how the guidance scale affects the image generation process:

Noisy

Guidance Scale = 1.0

Clear

The full journey

Imagine you've been plonked in the middle of an unfamiliar terrain with an uncertain destination and nothing but a compass to guide you. You come up with the following plan:

Check the compass
Walk a bit in that direction
Check the compass again
Walk a bit in the new direction
Repeat a fixed number of times

Diffusion models follow a very similar pattern, guided by these 4 things:

Random seed - where you start: Different starting points lead to slightly different destinations.
Prompt - the compass: A rickety old compass that only points "northish" will get you somewhere, but not necessarily where you want to be.
Step count - how often you check the compass: Check the compass too rarely and you'll drift off course. But checking all the time will slow you down.
Guidance scale - how strongly you follow the compass: Blindly follow the compass, and you may end up stuck in a river. But ignore it completely, and you'll wander aimlessly.

By combining these, the model navigates from pure chaos to a coherent image that matches your prompt. Here's that journey again, from noise to structure:

From noise to image

A monarch butterfly on a purple coneflower, macro close-up, delicate orange and black wing detail, morning dew, soft bokeh

An AI model generating an image may look like magic, but now you know it's just navigation through an unimaginably vast space of possibilities.

Which, to be fair, is still super cool.

Thanks

Thanks to Photoroom, who open sourced their text-to-image model PRX, allowing me to generate all of these examples. All examples were generated using the Photoroom/prx-256-t2i-sft model. Extra thanks to Jon Almazán for the support and ideas.