不要错过未来
Never snooze a future

原始链接: https://jacko.io/snooze.html

## Async Rust 中的“休眠”错误 本文深入探讨了 Async Rust 中一个微妙但关键的错误:**休眠 (snoozing)**。与取消或饥饿不同,休眠发生在 future 准备好继续执行,但未被轮询 (poll) 的情况下,导致挂起和死锁——最近的“Futurelock”问题就是一个例子。作者认为休眠*几乎总是*一个错误,源于按引用轮询 future 的模式(例如使用 `select!` 或 `poll!`)而不是拥有它们。 核心问题是,当一个任务轮询一个 future,然后在其完成或销毁之前继续执行,使其“休眠”并无法释放资源(例如锁)。这类似于在持有锁的情况下暂停线程,这是传统线程编程中一种已知的反模式。 提出的解决方案包括使用 owned future(通过像 `join_me_maybe` 这样的 crate)来避免引用,以及可能修改 `Stream` trait 以确保取消安全性。一个关键的结论是,为开发者提出的规则:**避免在 async 函数中处理 `Pin<_>` 值**,因为 pinning 通常表明 future 没有被拥有,并且容易受到休眠的影响。最终,防止休眠需要仔细设计 async 代码,并专注于拥有而不是借用 future。

Hacker News 新闻 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 永不错过未来 (jacko.io) 3 分,vinhnx 发表于 1 小时前 | 隐藏 | 过去 | 收藏 | 讨论 帮助 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文

2026 March 2nd

Huh, that's confusing, because the task ought to be able to execute other futures in that case — so why are her connections stalling out without making progress?
- Barbara battles buffered streams

When a future is ready to make progress, but it's not getting polled, I call that "snoozing". Snoozing is to blame for a lot of hangs and deadlocks in async Rust, including the recent "Futurelock" case study from the folks at Oxide. I'm going to argue that snoozing is almost always a bug, that the tools and patterns that expose us to it should be considered harmful, and that reliable and convenient replacements are possible.

Before we dive in, I want to be clear that snoozing and cancellation are different things.&ZeroWidthSpace; If a snoozed future eventually wakes up, then clearly it wasn't cancelled. On the other hand, a cancelled future can also be snoozed, if there's a gap between when it's last polled and when it's finally dropped.&ZeroWidthSpace; Cancellation bugs are a big topic in async Rust, and it's good that we're talking about them, but cancellation itself isn't a bug. Snoozing is a bug, and I don't think we talk about it enough.

Deadlocks

Any time you have a single task polling multiple futures concurrently, be extremely careful that the task never stops polling a future that it previously started polling.
- Futurelock

Snoozing can cause mysterious latencies and timeouts, but the clearest and most dramatic snoozing bugs are deadlocks ("futurelocks"). Let's look at several examples. Our test subject today will be foo, a toy function that takes a private async lock and pretends to do some work:&ZeroWidthSpace;

static LOCK: tokio::sync::Mutex<()> = tokio::sync::Mutex::const_new(());

async fn foo() {
let _guard = LOCK.lock().await;
tokio::time::sleep(Duration::from_millis(10)).await;
}

As we go along, I want you to imagine that foo is buried three crates deep in some dependency you've never heard of. When these things happen in real life, the lock, the future that's holding it, and the mistake that snoozes that future can all be far apart from each other.&ZeroWidthSpace; With that in mind, here's the minimal futurelock:

let future1 = pin!(foo());
_ = poll!(future1);
foo().await;

There are two calls to foo here. We get future1 from the first call and poll! it,&ZeroWidthSpace; which runs it to the point where it's acquired the LOCK and started sleeping. Then we call foo again, it gives us another future, and this time we .await it. In other words, we poll the second foo future in a loop until it's finished.&ZeroWidthSpace; But it tries to take the same lock, and future1 isn't going to release that lock until we either poll future1 again or drop it. Our loop isn't going to do either of those things — we've "snoozed" future1 — so we're deadlocked.

That example is nice and short, but the poll! macro isn't common in real programs. What you're more likely to see in practice is something like this with select!:&ZeroWidthSpace;

let mut future1 = pin!(foo());
loop {
select! {
_ = &mut future1 => break,

_ = tokio::time::sleep(Duration::from_millis(5)) => {
foo().await;
}
}
}

This loop is trying to drive future1 to completion, while waking up every so often to do some background work. The select! macro polls both &mut future1 and a Sleep future until one of them is ready, then it drops both of them and runs the => body of the winner.&ZeroWidthSpace; The loop creates a new Sleep future each time around, but it doesn't want to restart foo, so it selects on future1 by reference. But that only keeps future1 alive; it doesn't mean that it keeps getting polled. The intent is to poll future1 again in the next loop iteration, but we snooze it during the background work, which happens to include another call to foo, and we're deadlocked again.

We can also provoke this deadlock by selecting on a stream:

let mut stream = pin!(stream::once(foo()));
select! {
_ = stream.next() => {}
_ = tokio::time::sleep(Duration::from_millis(5)) => {}
}
foo().await;

In this case the stream.next() future is actually a value, not a reference, and it does get dropped after the sleep finishes. But it contains a reference to the stream, and we still end up snoozing the foo future inside that stream after we cancel next.&ZeroWidthSpace;

Speaking of streams, another category of futurelocks comes from buffered streams:&ZeroWidthSpace;

futures::stream::iter([foo(), foo()])
.buffered(2)
.for_each(|_| foo())
.await;

Here the buffer starts polling both of its foo futures concurrently. When the first one finishes, control passes to the for_each closure. While that closure is running, the other foo in the buffer is snoozed.&ZeroWidthSpace;

Buffered streams are a wrapper around either FuturesOrdered or FuturesUnordered, and we can hit the same deadlock by looping over either of those directly:&ZeroWidthSpace;

let mut futures = FuturesUnordered::new();
futures.push(foo());
futures.push(foo());
while let Some(_) = futures.next().await {
foo().await;
}

Deadlocks are bad, but what's worse is that it's hard to pinpoint exactly what these examples have done wrong.&ZeroWidthSpace; Is foo broken? Are select! and buffered streams broken? Are these programs "holding them wrong"?

Rather than jumping straight into answering those questions,&ZeroWidthSpace; I want to ask an entirely different question: Why don't we have deadlocks like these when we use regular locks and threads?

Threads

How many times does
it have to be said: Never
call TerminateThread.
- Larry Osterman

Let's think about a regular, non-async version of foo:

static LOCK: std::sync::Mutex<()> = std::sync::Mutex::new(());

fn foo() {
let _guard = LOCK.lock().unwrap();
thread::sleep(Duration::from_millis(10));
}

Assuming that this foo is the only function that touches this LOCK, is it even possible for there to be a deadlock here?

The short, reasonable answer is no. But the long, pedantic answer is yes, if we're willing to break a long-standing rule of systems programming and kill the thread that foo is running on. The Windows TerminateThread function warns us about this: "If the target thread owns a critical section, the critical section will not be released."&ZeroWidthSpace; The classic cause of these problems on Unix is fork, which copies the whole address space of a process but only one of its running threads.&ZeroWidthSpace; There's nothing a function like foo can realistically do to protect itself from this,&ZeroWidthSpace; so instead the general rule is "Never kill a thread."

Given the historical tire fire that is thread cancellation, it's remarkable that cancelling futures works as well as it does. The crucial difference is that Rust knows how to drop a future and clean up the resources it owns, particularly the lock guards.&ZeroWidthSpace; The OS can clean up a whole process when it exits, but until then it doesn't know which thread owns what.

It's also possible to deadlock this version of foo if we pause the thread it's running on. The Windows docs warn us about this too: "Calling SuspendThread on a thread that owns a synchronization object, such as a mutex or critical section, can lead to a deadlock if the calling thread tries to obtain a synchronization object owned by a suspended thread." The classic cause of these problems on Unix is signal handlers, which hijack a thread whenever they run.&ZeroWidthSpace; Again there's nothing foo can realistically do to protect itself from this, so the general rule is "Never pause a thread."

In contrast to cancellation, snoozing a future is no better than pausing a thread. Futurelock is a new spin on the old problems that SuspendThread and Unix signal handlers have always had:&ZeroWidthSpace; Normal application code touches locks constantly, like when we print, allocate memory, load dynamic libraries, or talk to DNS. If we freeze some "normal code", and we don't want to risk deadlocking with it, then we need to avoid touching any locks ourselves until we unfreeze it. That's doable in some very low-level, very unsafe contexts, but in "normal code" it's almost hopeless.&ZeroWidthSpace;

And yet that's what we're confronted with, implicitly, when we use select!-by-reference or buffered streams today.&ZeroWidthSpace; What can we do about that?

select!

Fine-grained cancellation in select! is what enables async Rust to be a zero-cost abstraction and to avoid the need to create either locks or actors all over the place.
- Niko Matsakis

Using select! with owned futures is usually fine,&ZeroWidthSpace; as long as we're ok with cancellation, because select! drops all its "scrutinee" futures promptly. Using select! with references is what we really need to avoid. Unfortunately, that's easier said than done.

Running each future on its own task with tokio::spawn is one way to prevent snoozing — like threads, tasks have a "life of their own" — but it comes with a 'static bound that clashes with any sort of borrowing.&ZeroWidthSpace; The moro crate provides a non-'static task spawning API similar to std::thread::scope, and it can solve many of these problems.&ZeroWidthSpace; I recommend it enthusiastically, and I'm surprised it isn't more widely used. But moro can't replace select! entirely. Niko Matsakis' "case study of pub-sub in mini-redis" discusses a case that only select! can handle: it macro-expands into a match, and different match arms are allowed to mutate the same variables, while concurrent tasks are not.&ZeroWidthSpace;

I have an experimental crate that aims to close this gap: join_me_maybe. It provides a join! macro with some select!-like features. Here's one way it can replace the select! loop above:&ZeroWidthSpace;

join_me_maybe::join!(
foo(),



maybe async {
loop {
tokio::time::sleep(Duration::from_millis(5)).await;
foo().await;
}
}
);

Like other "join" patterns, this join! macro owns the futures that it polls, so there's no risk of snoozing anything.&ZeroWidthSpace; It needs some real-world feedback before I can recommend it for general use, but it can currently tackle both the original "Futurelock" select! and the select! that frustrated moro in mini-redis. There's a wide open design space for more concurrency patterns like this, and there's also room for new language features here that could give us even more borrow checker flexibility.

Streams

This method is cancel safe.
- .next()

"Cancel safety" isn't yet formally defined, but roughly speaking we say that an async function is cancel-safe if a cancelled call is guaranteed not to have any side effects.&ZeroWidthSpace; Deadlocks are certainly a side effect, and I think the definition of cancel safety needs to expand to include not snoozing any other futures. The .next() method on streams, as it's defined today both in futures and in tokio, is not generally cancel-safe in this expanded sense. That's how we produced the deadlock above with select! and next.

The other two stream deadlocks above, the ones using buffered and FuturesUnordered, are a separate problem. These examples don't cancel any calls to next.&ZeroWidthSpace; Instead, these streams hold pending futures internally, and they snooze those futures if anything else gets .awaited between calls to next. I don't have a smoking gun, but I bet this causes deadlocks in the wild today.

I see two possible solutions to this problem, and the Stream trait itself will ultimately need to pick one.&ZeroWidthSpace; The first possibility is that we keep next and declare that gaps between calls to it are expected and allowed.&ZeroWidthSpace; In that case, buffered and FuturesUnordered would be unfixable, and we'd need to deprecate them. Alternatively, we could add a poll_progress method to the Stream trait and declare that anything that calls poll_next must also call poll_progress until it returns Ready. Most stream combinators could be adapted to follow that new rule, but next would be unfixable, and we'd need to deprecate it. That isn't an option today, because using next with while let is the standard way to loop over a stream, but it could work if/when Rust adds an async for loop that integrates with poll_progress.

A general rule

The promise of Rust is that you don’t need to do this kind of non-local reasoning—that you can understand important behavior by looking at code directly around the behavior, then use the type system to scale that up to global correctness.
- Cancelling async Rust

Even if we like the suggestions above, what's the general rule here? For high-level application code, we need something that tools like Clippy can check automatically. I propose:

Don't pin things in async functions.&ZeroWidthSpace;

There's nothing wrong with pinning per se. It's a fundamental building block of async Rust, and we need it when we implement Future or Stream "by hand".&ZeroWidthSpace; But when we have to pin things in an async fn, it's usually because something is polling a future that it doesn't own.&ZeroWidthSpace; That's what's happening in the poll! and select! examples above, including the stream.next() case. Polling something we don't own and can't drop is a recipe for snoozing.

There are also plenty of Unpin futures out there that we can poll by reference without pinning, and there's no reason in principle that snoozing one of those couldn't hold a lock across an await point.&ZeroWidthSpace; I'm not aware of any real-world cases, but if we wanted to close that loophole proactively, we could consider an additional rule:

Don't use a reference to a future as a future itself.&ZeroWidthSpace;

Rules like these might be enough to catch snoozing mistakes in high-level code, but we still have to assume that our helpers and combinators aren't snoozing futures internally. Buffered streams violate that assumption today, and I think they'll need incompatible changes to fix that.

In general, there probably isn't a simple, mechanical rule to prove that a Future or Stream implementation is snooze-free. We have to be careful when we write those. But I think we can live with that. Writing poll and poll_next functions is "advanced mode" async Rust. We don't often need to do it in application logic, and we don't need to teach it to beginners. When we're looking at these low-level bits in code review,&ZeroWidthSpace; we can just try our best to remember:

Never snooze a future.


Discussion threads on r/rust and lobste.rs.

联系我们 contact @ memedata.com