随心引导 – 流模型指南
Flow Where You Want – Guidance for Flow Models

原始链接: https://drscotthawley.github.io/blog/posts/FlowWhereYouWant.html

本教程探讨了如何为预训练的基于流的生成模型添加控制,使其能够执行其最初未训练的任务。以在MNIST数字上训练的模型为例,演示了两种关键技术:分类器引导(生成特定数字)和图像修复(填充缺失像素)。这两种方法都通过在潜在空间或像素空间中进行“速度校正”来微妙地调整模型的采样过程。 教程详细介绍了多种方法,包括标准的分类器引导、更快的潜在空间分类器方法以及PnP-Flow,后者通过迭代地将样本投影以满足约束条件。PnP-Flow提供了一种独特的方法,直接调整潜在位置,而不是修改速度。 核心概念是流模型通过迭代积分生成数据,每一步的微小校正都可以引导模型走向期望的结果。虽然类似的引导技术存在于扩散模型中,但流模型提供了一种更直观和确定性的方法。本教程强调了潜在空间操作在速度和效率方面的优势,并提供了代码示例来实现,展示了如何引导流生成特定结果。最终,这些“插件”方法提供了一种灵活的方式来控制生成模型,而无需重新训练。

一位教授和他们的学生在使用PyTorch的MPS后端(针对Apple Silicon)时,与CUDA和CPU相比,在使用Stable Audio Open Small模型时遇到了结果不一致的问题。他们正在寻求Hacker News社区的帮助来诊断问题,并链接了一个相关的pull request,其中详细说明了MPS的不准确性。 该项目最初旨在为深度学习与AI伦理课程从头开始构建一个文本条件潜在流匹配模型,但由于资源限制,转而使用预训练模型进行指导调整——学生主要使用没有独立GPU的Mac电脑,并且云解决方案存在后勤挑战。 一位评论者提出了一个更广泛的问题,关于从非语言数据(如随机数)训练的模型中生成有用输出的意义,质疑这是否代表了超越基于语言学习的“智能”AI的新路径。 这篇文章既是对技术援助的请求,也是对AI模型中智能本质的讨论提示。
相关文章

原文

This tutorial demonstrates how to add inference-time controls to pretrained flow-based generative models to make them perform tasks they weren’t trained to do. We take an unconditional flow model trained on MNIST digits and apply two types of guidance: classifier guidance to generate specific digits, and inpainting to fill in missing pixels. Both approaches work by adding velocity corrections during the sampling process to steer the model toward desired outcomes. Since modern generative models operate in compressed latent spaces, we examine guidance methods that work directly in latent space as well as those that decode to pixel space. We also explore PnP-Flow, which satisfies constraints by iteratively projecting samples backward and forward in time rather than correcting flow velocities. The approaches demonstrated here work with other flow models and control tasks, so you can guide flows where you want them to go.

“When we put bits into the mouths of horses to make them obey us, we can turn the whole animal. Or take ships as an example. Although they are so large and are driven by strong winds, they are steered by a very small rudder wherever the pilot wants to go.”
– James 3:3-4 (NIV)

In this tutorial, we’ll explore inference-time “plugin” methods for flow matching and rectified flow generative models like FLUX or Stable Audio Open Small. Unlike classifier-free guidance (CFG) [1], which requires training the model with your desired conditioning signal, these plugin guidance methods let you add controls at inference time—even for conditions the model never saw during training.

This tutorial assumes familiarity with flow-based generative models, by which we mean “flow matching” [2] and/or “rectified flows” [3]. See the blog post “Flow With What You Know” [4] for an overview, and/or my IJCNN 2025 tutorial [5] for further detail. The key insight is that flow models generate samples through iterative integration, and at each step we can add small velocity corrections to steer toward specific goals. This works for various objectives: generating specific classes, filling in missing regions, or satisfying other desired constraints.

Our discussion will bring us up to date on guidance methods for latent-space rectified flow models. While there’s an extensive literature on guidance for diffusion models [7] – see Sander Dieleman’s excellent blog post [8] for an overview — flow matching allows us to cast these in a more accessible and intuitive way. There’s some recent work unifying guidance for diffusion and flows [9], but in this tutorial we’ll focus on a simplified treatment for flows only.

The paradigm of latent generative models is covered in another superb Dieleman post [10], and combining latent-space models with flow-based guidance gives us powerful, flexible tools for adding flexible controls to efficient generation.

Let’s review the picture for flow-based generative modeling in latent space…

The Latent Flow-Matching Setup

The following diagrams illustrate the three key concepts:

a) A VAE compresses pixel-space images into compact latent representations. “E” is for encoder and “D” is for decoder:

b) The flow model operates in this latent space, transforming noise (“Source”, t=0) into structured data (“Target”, t=1) through iterative integration. The decoder then converts the final latents back to pixels.

c) While general flows can follow curved trajectories, some of our methods will focus on flows with nearly straight trajectories which allows for estimating endpoints without many integration steps:

These (nearly) straight trajectories can be obtained by “ReFlow” distillation of another model (covered in [4]) or by insisting during training that the models yield paths agreeing with Optimal Transport such as the “minibatch OT” method of Tong et al [11]. Even if the model’s trajectories aren’t super-straight, we’ll see that the guidance methods we use can be applied fairly generally anyway.

Intuitively, guidance amounts to “steering” during the integration of the flow model in order to end up at a desired end point. The following video provides a useful metaphor:

Ok, the analogy’s not quite right: you can’t just steer, you are going to have to paddle a little bit. In other words, you’re going to have to provide a bit of a extra velocity to correct where the “current” flow is taking you.

In flow matching, we go from a source data (distribution) at time \(t=0\) to target data at \(t=1\). Since this tutorial applies to latent space, we’ll use the letter \(z\) for position, such as \(z_t\) being the position at time \(t\).

When you’re “looking ahead” to estimate where you’ll end up, you project linearly along the current velocity \(\vec{v_t}\) for a duration of the remaining time. Let’s call this estimate \(\widehat{z_1}\), your projected endpoint :

\[ \widehat{z_1} = z_t + (1-t)\vec{v_t} \tag{1} \]

…but perhaps that’s not where you want to go. Where you want to go is a distance \(\Delta \widehat{z_1}\) from \(\widehat{z_1}\), and to get there you’ll have to make a “course correction” \(\Delta \hat{v}\), as shown in the following diagram:

By similar triangles, \(\Delta \widehat{z_1} = (1-t)\Delta \vec{v}\), which means the course correction you want is

\[ \Delta \vec{v} = { \Delta \widehat{z_1} \over 1-t } \tag{2} \]

Since you’re going to see more math once you try to read the scholarly literature on these topics, let’s go a bit further into the math…

So \(\Delta \widehat{z_1}\) is a measure of the deviation from the desired endpoint. Now, in practical application we won’t actually use the “distance” \(\Delta \widehat{z_1}\), but we’ll use something that functions like a distance, such as a K-L divergence or Mean Squared Error (MSE).

When doing inference, this deviation serves as a “loss” – something we minimize via gradient descent, except we’ll vary the flow positions \(z\) instead of the model weights. More specifically, we’ll consider the “likelihood” \(p( \widehat{z_1} | y )\) of getting a \(z_1\) that matches a given control \(y\), and we’ll seek to maximize that likelihood, or equivalently to minimize the negative log-likelihood.

The expression \(-\nabla_{\widehat{z_1}} \log p( \widehat{z_1} | y )\) essentially answers the question, “In which direction should I adjust \(\widehat{z_1}\) so as to make \(p( \widehat{z_1} | y )\) more likely?” This gives us a direction and a magnitude, which we then multiply by a learning rate “guidance strength” \(\eta\) to turn it into a step size.

Applying this gradient-based approach, our expression for \(\Delta v\) will involve replacing \(\Delta \widehat{z_1}\) in (2) with \(- \eta \nabla_{\widehat{z_1}} \log p( \widehat{z_1} | y\):

\[ \Delta \vec{v} = - \eta {1 \over 1-t } \nabla_{z_t} \log p( \widehat{z_1} | y ) \tag{3} \]

where we used the fact that \(\nabla_{\widehat{z_1}} = \nabla_{z_t}\) (since \(\widehat{z_1} \propto z_t\)). The factor of \(1/(1-t)\) means small corrections suffice early on, but later times require larger adjustments—though other time scalings are possible, as we’ll see.

Now let’s apply this to a concrete example.

If we want our model to generate a member of a particular class, we can use an external classifier to examine the generated samples. The constraint to minimize will be the difference between the desired class and the argmax of the classifier output (or some similar relationship that enforces the class compliance).

For our flow model, let’s use Marco Cassar’s winning submission from the 2025 DLAIE Leaderboard Contest on unconditional latent flow matching of MNIST digits. For the classifier, we’ll use the official evaluation classifier from the same contest.

Setup the Flow Model and Classifier

Let’s generate and draw some sample images.

“Guidance: a cheat code for diffusion models”[8], is a classic and should be read by all. Yet because of the stochastic/random nature of the diffusion path, there are several “complicating” aspects of diffusion guidance that we’re going to gloss over in this tutorial because in the case of deterministic, smooth flow-model trajectories, things become a lot more intuitive.

We’ll follow a method outlined in the paper “Training-free Linear Image Inverses via Flows”[12] by Pokle et al, a method that applies to general linear inverse problems of which inpainting is a particular case, and we’ll simplify their method to adapt it for just inpainting.

The method will be to try to generate an entire new image \(x_1\) that everywhere outside the mask matches up with the pixels in user-supplied (masked) image \(y\). So the constraint will be, given a 2D mask \(M\) (where \(M\)=1 means there’s an original pixel there, and \(M\)=0 is the masked-out region), to require that our estimate image \(\widehat{x_1}\) (i.e. the decoded/pixel version of the estimated latents \(\widehat{z_1}\) ) satisfies \(M*\widehat{x_1} = M* y\,\) , or in a “residual form”, we’ll just compute the Mean Squared Error (MSE) of \(M*(\widehat{x_1}-y)\):

\[ {\rm Constraint:} = M^2 * (\widehat{x_1}-y)^2 \] (and if we want, we can use the fact that \(M\) being a binary mask means \(M^2 = M\)).

If we want to do latent-only inpainting (which will be the fastest), then the same constraint applies just with the simplification \(\widehat{x_1} = \widehat{z_1}\)

The authors of the paper recommend only doing guidance from \(t=0.2\) onward because prior to that, it’s hard to make any meaningful estimate.. In fact, they don’t even integrate before \(t = 0.2\). They just interpolate between the source and the target data to get their starting point at \(t = 0.2\).

To use our constraint in the guidance equation (3) for computing \(\Delta v\,\), we’ll need to turn our constraint into a likelihood by raising it to an exponential power – so we get a Gaussian! But the guidance equation includes a logarithm that immediately undoes our exponentiation:

\[ \Delta v = - {\eta \over 1-t} \nabla_{z_t} \ \,{\color{red}{\text{l̸o̸g̸}\,\text{e̸x̸p̸}}} \left( M^2 * (\widehat{x_1}-y)^2 \right) .\]

The gradient part is \[ \nabla_{z_t} M^2 *(\widehat{x_1}-y)^2 = 2M^2*(\widehat{x_1}-y) {\partial \widehat{x_1} \over \partial z_t } \]

If we’re inpainting in latent space and not using the decoder for the constraint, then \({\partial \widehat{x_1} / \partial z_t } = 1\). Otherwise that term will require evaluation via PyTorch’s autograd (=slow).

Our earlier time scaling was \(1/(1-t)\); turns out that doesn’t work very well in practice when it comes to inpainting. Instead, we’ll use a different time scaling that delivers good (albeit not perfect) results: \((1-t)/t\). Thus our full equation for the velocity correction will be:

\[ \Delta \vec{v} = -\eta {1-t\over t} M^2 *(\widehat{x_1} - y){\partial\widehat{x_1}\over\partial{z_t}}, \] where we absorbed the factor of 2 into \(\eta\), and the last partial derivative term can be one if we do latent-only inpainting.

Let’s implement this in code, using two different versions of the gradient calculation, depending on whether we can do it all in latent space or if we need to propagate gradients through the decoder:

Code

[13] achieves similar results by adjusting latent positions \(z\) directly.\(^,\)

PnP-Flow assumes straight-line trajectories, making the forward projection trivial: \(\widehat{z_1}\) is reached by simple linear extrapolation. Instead of incrementally moving \(z\) from \(t=0\) to \(t=1\), PnP-Flow projects forward to \(\widehat{z_1}\) and iterates on that estimate through a series of correction and projection steps. The first step applies our gradient correction:

\[{\rm Step\ 1.}\ \ \ \ \ \ \ \ \ \ \ \ z_1^* := \widehat{z_1} - \eta\,\gamma_t \nabla F(\widehat{z_1},y)\]

where \(z_1^*\) (my notation) is our goal i.e. the endpoint of our projected course correction, and \(F(\widehat{z_1},y)\) is our (log-exp probability) constraint. For the time scaling, the PnP-Flow authors recommend \(\gamma_t = (1-t)^\alpha\) with \(\alpha \in [0,1]\) is a hyperparameter chosen according to the task – e.g., they use \(\alpha\)’s as large as 0.8 for denoising tasks, 0.5 for box inpainting, and 0.01 for random inpainting. This choice of \(\gamma_t\) is a bit different from our earlier one of \((1-t)/t\). Both go to zero as \(t \rightarrow 1\), but approach it differently and have different asymptotics as \(t\rightarrow 0\).

In the graph below, we show our earlier choice of \((1 - t)/t\) in green and \((1 - t)^\alpha\) in purple for various choices of \(\alpha\):

…where for “box inpainting” as we did above, they use \(\alpha\)=0.5.

But PnP-Flow doesn’t stop there! Two other key steps remain. We then project backward to overwrite \(z_t\) with a corrected value:

\[{\rm Step\ 2.}\ \ \ \ \ \ \ \ \ \ \ \ z_t := (1-t)\,z_0 + t\, z_1^* \]

We then compute a new projected estimate, same as we have before:

\[{\rm Step\ 3.}\ \ \ \ \ \ \widehat{z_1} := z_t + (1-t)\,v_t(z,t)\]

….and loop over Steps 1 to 3 for each value of \(t\) in our set of (discrete) integration steps, i.e. after Step 3, we let \(t := t+\Delta\,t\) and go back to Step 1. Our final value of \(\widehat{z_1}\) will be the output.

This image from the PnP-Flow paper may prove instructive, showing 3 different instances of the 3 PNP steps:

Image from PnP-Flow paper [12], slightly re-annotated by me.

This has a superficial resemblance to the “ping-pong” integration method used by the flow model Stable Audio Open Small (SAOS) [14], with a key distinction: the ping-pong integrator and updates the time-integrated latent variable \(z\) (called “\(x\)” in SAOS), whereas for PnP-Flow it is the projection \(\widehat{z_1}\) (called “denoised” in SAOS) that is the primary variable that is maintained between steps. This is a subtle distinction but worth noting.

To implement PnP-Flow in code, let’s replace our “integrator” with something specific to PnP-Flow:

[16], here we focused on pretrained flow models.

The key idea is simple: at each integration step, you project forward to estimate where you’ll end up (\(\,\widehat{z_1}\,\)) , check how far that is from where you want to be, and add a small velocity correction to steer toward your goal. We applied this to an unconditional MNIST flow model for two tasks: generating specific digit classes via classifier guidance, and filling in masked-out regions via inpainting.

We looked at four approaches. First, standard classifier guidance in pixel space—it works but it’s slow because you’re propagating gradients through the VAE decoder. Second, we trained a simple latent-space classifier and did the same thing much faster. Third, we implemented the linear inpainting method from Pokle et al, which operates directly on latents. Fourth, we tried PnP-Flow, which achieves guidance not by correcting velocities but by iteratively projecting samples forward and backward in time.

The math here is much simpler than the corresponding diffusion methods because flow trajectories are smooth and deterministic. We’ve glossed over a lot of detail compared to the research papers, but hopefully this gives you enough to experiment with your own controls. There are limits to the effectiveness of guidance: small models that don’t generalize well won’t suddenly work miracles if you try to push them too far outside their training distribution. Nevertheless, these plugin methods are worth exploring as accessible ways to steer generative flows where you want them to go.

This work was supported by Hyperstate Music, Inc. who are awesome and you should invest in them! ;-) For providing feedback on early drafts of this tutorial: thanks to Raymond Fan, Alan Lockett, and my amazing students in PHY/DSC/BSA 4420, “Deep Learning and AI Ethics”! Also thanks to Danilo Comminiello and the signal processing group at Sapienza University of Rome for encouraging me to present a preceding lecture [5] at IJCNN 2025 – this blog post is the missing piece!

This document was partially prepared using the SolveIt platform of Answer.ai and via many discussions with Claude.ai; still overwhelmingly authored by the hands of this here human.

[2]

Y. Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling,” in The Eleventh International Conference on Learning Representations (ICLR), 2023. Available: https://openreview.net/forum?id=PqvMRDCJT9t

[3]

X. Liu, C. Gong, and Q. Liu, “Flow straight and fast: Learning to generate and transfer data with rectified flow,” in 11th International Conference on Learning Representations (ICLR), 2023. Available: https://openreview.net/forum?id=XVjTT1nw5z

[7]

H. Ye et al., TFG: Unified training-free guidance for diffusion models,” Advances in Neural Information Processing Systems, vol. 37, pp. 22370–22417, 2024, Available: https://arxiv.org/abs/2409.15761

[9]

Z. W. Blasingame and C. Liu, “Greed is good: Guided generation from a greedy perspective,” in Frontiers in probabilistic inference: Learning meets sampling, 2025. Available: https://openreview.net/forum?id=o4yQzZ5qCW

[11]

A. Tong et al., “Improving and generalizing flow-based generative models with minibatch optimal transport,” Transactions on Machine Learning Research, 2024, Available: https://openreview.net/forum?id=CD9Snc73AW

[12]

A. Pokle, M. J. Muckley, R. T. Q. Chen, and B. Karrer, “Training-free linear image inverses via flows,” Transactions on Machine Learning Research, 2024, Available: https://openreview.net/forum?id=PLIt3a4yTm

[13]

S. Martin, A. Gagneux, P. Hagemann, and G. Steidl, “PnP-flow: Plug-and-play image restoration with flow matching.” 2024. Available: https://arxiv.org/abs/2410.02423

[14]

Z. Novack et al., “Fast text-to-audio generation with adversarial post-training,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2025. Available: https://arxiv.org/abs/2505.08175

[15]

Z. Novack, J. McAuley, T. Berg-Kirkpatrick, and N. J. Bryan, DITTO: Diffusion inference-time t-optimization for music generation,” in Forty-first International Conference on Machine Learning (ICML), 2024. Available: https://openreview.net/forum?id=z5Ux2u6t7U

[16]

D. Zhao, D. Beaglehole, T. Berg-Kirkpatrick, J. McAuley, and Z. Novack, “Steering autoregressive music generation with recursive feature machines.” 2025. Available: https://arxiv.org/abs/2510.19127


If you found this tutorial helpful, I encourage you to cite it: