随心引导 – 流模型指南

随心引导 – 流模型指南
Flow Where You Want – Guidance for Flow Models

原始链接: https://drscotthawley.github.io/blog/posts/FlowWhereYouWant.html

本教程探讨了如何为预训练的基于流的生成模型添加控制，使其能够执行其最初未训练的任务。以在MNIST数字上训练的模型为例，演示了两种关键技术：分类器引导（生成特定数字）和图像修复（填充缺失像素）。这两种方法都通过在潜在空间或像素空间中进行“速度校正”来微妙地调整模型的采样过程。教程详细介绍了多种方法，包括标准的分类器引导、更快的潜在空间分类器方法以及PnP-Flow，后者通过迭代地将样本投影以满足约束条件。PnP-Flow提供了一种独特的方法，直接调整潜在位置，而不是修改速度。核心概念是流模型通过迭代积分生成数据，每一步的微小校正都可以引导模型走向期望的结果。虽然类似的引导技术存在于扩散模型中，但流模型提供了一种更直观和确定性的方法。本教程强调了潜在空间操作在速度和效率方面的优势，并提供了代码示例来实现，展示了如何引导流生成特定结果。最终，这些“插件”方法提供了一种灵活的方式来控制生成模型，而无需重新训练。

一位教授和他们的学生在使用PyTorch的MPS后端（针对Apple Silicon）时，与CUDA和CPU相比，在使用Stable Audio Open Small模型时遇到了结果不一致的问题。他们正在寻求Hacker News社区的帮助来诊断问题，并链接了一个相关的pull request，其中详细说明了MPS的不准确性。该项目最初旨在为深度学习与AI伦理课程从头开始构建一个文本条件潜在流匹配模型，但由于资源限制，转而使用预训练模型进行指导调整——学生主要使用没有独立GPU的Mac电脑，并且云解决方案存在后勤挑战。一位评论者提出了一个更广泛的问题，关于从非语言数据（如随机数）训练的模型中生成有用输出的意义，质疑这是否代表了超越基于语言学习的“智能”AI的新路径。这篇文章既是对技术援助的请求，也是对AI模型中智能本质的讨论提示。

原文

This tutorial demonstrates how to add inference-time controls to pretrained flow-based generative models to make them perform tasks they weren’t trained to do. We take an unconditional flow model trained on MNIST digits and apply two types of guidance: classifier guidance to generate specific digits, and inpainting to fill in missing pixels. Both approaches work by adding velocity corrections during the sampling process to steer the model toward desired outcomes. Since modern generative models operate in compressed latent spaces, we examine guidance methods that work directly in latent space as well as those that decode to pixel space. We also explore PnP-Flow, which satisfies constraints by iteratively projecting samples backward and forward in time rather than correcting flow velocities. The approaches demonstrated here work with other flow models and control tasks, so you can guide flows where you want them to go.

“When we put bits into the mouths of horses to make them obey us, we can turn the whole animal. Or take ships as an example. Although they are so large and are driven by strong winds, they are steered by a very small rudder wherever the pilot wants to go.”
– James 3:3-4 (NIV)

In this tutorial, we’ll explore inference-time “plugin” methods for flow matching and rectified flow generative models like FLUX or Stable Audio Open Small. Unlike classifier-free guidance (CFG) [1], which requires training the model with your desired conditioning signal, these plugin guidance methods let you add controls at inference time—even for conditions the model never saw during training.

This tutorial assumes familiarity with flow-based generative models, by which we mean “flow matching” [2] and/or “rectified flows” [3]. See the blog post “Flow With What You Know” [4] for an overview, and/or my IJCNN 2025 tutorial [5] for further detail. The key insight is that flow models generate samples through iterative integration, and at each step we can add small velocity corrections to steer toward specific goals. This works for various objectives: generating specific classes, filling in missing regions, or satisfying other desired constraints.

Our discussion will bring us up to date on guidance methods for latent-space rectified flow models. While there’s an extensive literature on guidance for diffusion models [7] – see Sander Dieleman’s excellent blog post [8] for an overview — flow matching allows us to cast these in a more accessible and intuitive way. There’s some recent work unifying guidance for diffusion and flows [9], but in this tutorial we’ll focus on a simplified treatment for flows only.

The paradigm of latent generative models is covered in another superb Dieleman post [10], and combining latent-space models with flow-based guidance gives us powerful, flexible tools for adding flexible controls to efficient generation.

Let’s review the picture for flow-based generative modeling in latent space…

The Latent Flow-Matching Setup

The following diagrams illustrate the three key concepts:

a) A VAE compresses pixel-space images into compact latent representations. “E” is for encoder and “D” is for decoder:

b) The flow model operates in this latent space, transforming noise (“Source”, t=0) into structured data (“Target”, t=1) through iterative integration. The decoder then converts the final latents back to pixels.

c) While general flows can follow curved trajectories, some of our methods will focus on flows with nearly straight trajectories which allows for estimating endpoints without many integration steps:

These (nearly) straight trajectories can be obtained by “ReFlow” distillation of another model (covered in [4]) or by insisting during training that the models yield paths agreeing with Optimal Transport such as the “minibatch OT” method of Tong et al [11]. Even if the model’s trajectories aren’t super-straight, we’ll see that the guidance methods we use can be applied fairly generally anyway.

Intuitively, guidance amounts to “steering” during the integration of the flow model in order to end up at a desired end point. The following video provides a useful metaphor:

Ok, the analogy’s not quite right: you can’t just steer, you are going to have to paddle a little bit. In other words, you’re going to have to provide a bit of a extra velocity to correct where the “current” flow is taking you.

In flow matching, we go from a source data (distribution) at time \(t=0\) to target data at \(t=1\). Since this tutorial applies to latent space, we’ll use the letter \(z\) for position, such as \(z_t\) being the position at time \(t\).

When you’re “looking ahead” to estimate where you’ll end up, you project linearly along the current velocity \(\vec{v_t}\) for a duration of the remaining time. Let’s call this estimate \(\widehat{z_1}\), your projected endpoint :

\[ \widehat{z_1} = z_t + (1-t)\vec{v_t} \tag{1} \]

…but perhaps that’s not where you want to go. Where you want to go is a distance \(\Delta \widehat{z_1}\) from \(\widehat{z_1}\), and to get there you’ll have to make a “course correction” \(\Delta \hat{v}\), as shown in the following diagram:

By similar triangles, \(\Delta \widehat{z_1} = (1-t)\Delta \vec{v}\), which means the course correction you want is

\[ \Delta \vec{v} = { \Delta \widehat{z_1} \over 1-t } \tag{2} \]

Since you’re going to see more math once you try to read the scholarly literature on these topics, let’s go a bit further into the math…

So \(\Delta \widehat{z_1}\) is a measure of the deviation from the desired endpoint. Now, in practical application we won’t actually use the “distance” \(\Delta \widehat{z_1}\), but we’ll use something that functions like a distance, such as a K-L divergence or Mean Squared Error (MSE).

When doing inference, this deviation serves as a “loss” – something we minimize via gradient descent, except we’ll vary the flow positions \(z\) instead of the model weights. More specifically, we’ll consider the “likelihood” \(p( \widehat{z_1} | y )\) of getting a \(z_1\) that matches a given control \(y\), and we’ll seek to maximize that likelihood, or equivalently to minimize the negative log-likelihood.

The expression \(-\nabla_{\widehat{z_1}} \log p( \widehat{z_1} | y )\) essentially answers the question, “In which direction should I adjust \(\widehat{z_1}\) so as to make \(p( \widehat{z_1} | y )\) more likely?” This gives us a direction and a magnitude, which we then multiply by a ~~learning rate~~ “guidance strength” \(\eta\) to turn it into a step size.

Applying this gradient-based approach, our expression for \(\Delta v\) will involve replacing \(\Delta \widehat{z_1}\) in (2) with \(- \eta \nabla_{\widehat{z_1}} \log p( \widehat{z_1} | y\):

\[ \Delta \vec{v} = - \eta {1 \over 1-t } \nabla_{z_t} \log p( \widehat{z_1} | y ) \tag{3} \]

where we used the fact that \(\nabla_{\widehat{z_1}} = \nabla_{z_t}\) (since \(\widehat{z_1} \propto z_t\)). The factor of \(1/(1-t)\) means small corrections suffice early on, but later times require larger adjustments—though other time scalings are possible, as we’ll see.

Now let’s apply this to a concrete example.

If we want our model to generate a member of a particular class, we can use an external classifier to examine the generated samples. The constraint to minimize will be the difference between the desired class and the argmax of the classifier output (or some similar relationship that enforces the class compliance).

For our flow model, let’s use Marco Cassar’s winning submission from the 2025 DLAIE Leaderboard Contest on unconditional latent flow matching of MNIST digits. For the classifier, we’ll use the official evaluation classifier from the same contest.

Setup the Flow Model and Classifier

Let’s generate and draw some sample images.

Code

from torchvision.utils import make_grid
import matplotlib.pyplot as plt

# generate some samples
n_samples = 10
x1 = sub.generate_samples(n_samples=n_samples)
x1.shape


def show_grid(x1, title=None):
    if len(x1.shape) == 3: x1 = x1.unsqueeze(1)  # add channels dim
    grid = make_grid(x1, nrow=10, padding=2, normalize=False)
    plt.figure(figsize=(4, 4))
    plt.imshow(grid.permute(1, 2, 0).cpu(), cmap='gray')
    plt.axis('off')
    if title: plt.title(title)
    plt.tight_layout()
    plt.show()

show_grid(x1, "Sample generated images")