异步 Python 实际上是确定性的

异步 Python 实际上是确定性的
Async Python Is Secretly Deterministic

原始链接: https://www.dbos.dev/blog/async-python-is-secretly-deterministic

为Python持久化执行库添加异步支持是一个挑战：持久化工作流*必须*是确定性的，以便通过重放实现可靠的恢复。然而，`asyncio`的并发性可能会引入不可预测的执行顺序，从而阻碍确定性重放。关键在于理解`asyncio`事件循环。尽管看起来是并发的，但它实际上是单线程的，以先进先出（FIFO）的方式从队列中处理任务。虽然任务的*内部*执行是不可预测的，但通过`asyncio.gather`创建的任务的*调度*是确定性的。为了利用这一点，作者实现了一个`@Step()`装饰器。这个装饰器在任何`await`调用之前，为每个工作流步骤分配一个唯一的、顺序递增的ID。这确保了即使在并发启动时，步骤也能以可预测的顺序处理。这种方法允许并发执行*和*确定性重放，这对于持久化工作流至关重要。结论是，对`asyncio`单线程本质的更深入理解简化了对并发的推理，并能够构建可靠的并发系统。

## 异步 Python 的确定性调度一篇 Hacker News 讨论强调了 Python 的 `asyncio` 事件循环的一个令人惊讶的特性：它实际上是确定性的。这意味着在同一个同步或异步函数中启动的任务，始终以它们被创建的顺序开始执行，尽管异步编程本身具有并发性。尽管这种行为已经存在近十年且稳定，但它被认为是一个实现细节。其他事件循环，例如 `trio`，会故意随机化启动顺序，以避免依赖此特性。人们担心未来的 Python 实现（例如用 Rust 编写的实现）可能会破坏这种确定性，导致依赖它的代码出现意外问题。这种确定性的好处在于更容易调试——允许重现结果。然而，它并不能保证程序的整体确定性，尤其是在涉及 I/O 操作时。它简化了内部逻辑，但无法控制外部因素，例如网络调用或文件访问。尽管存在局限性，有些人认为这种可预测的调度对于构建可靠的并发系统很有价值。

原文

When adding async support to our Python durable execution library, we ran into a fundamental challenge: durable workflows must be deterministic to enable replay-based recovery.

Making async Python workflows deterministic is difficult because they often run many steps concurrently. For example, a common pattern is to start many concurrent steps and use asyncio.gather to collect the results:

This is great for performance (assuming tasks are I/O-bound) as the workflow doesn’t have to wait for one step to complete before starting the next. But it’s not easy to order the workflow’s steps because those steps all run at the same time, with their executions overlapping, and they can complete in any order.

The problem is that concurrency introduces non-obvious step execution ordering. When multiple tasks run at the same time, the exact interleaving of their execution can vary. But during recovery, the workflow must be able to replay those steps deterministically, recovering completed steps from checkpoints and re-executing incomplete steps. This requires a well-defined step order that’s consistent across workflow executions.

So how do we get the best of both worlds? We want workflows that can execute steps concurrently, but still produce a deterministic execution order that can be replayed correctly during recovery. To make that possible, we need to better understand how the async Python event loop really works.

How Async Python Works

At the core of async Python is an event loop. Essentially, this is a single thread running a scheduler that executes a queue of tasks. When you call an async function, it doesn’t actually run; instead it creates a “coroutine,” a frozen function call that does not execute. To actually run an async function, you have to either await it directly (which immediately executes it, precluding concurrency) or create an async task for it (using asyncio.create_task or asyncio.gather), which schedules it on the event loop’s queue. The most common way to run many async functions concurrently is asyncio.gather, which takes in a list of coroutines, schedules each as a task, then waits for them all to complete.

Even after you schedule an async function by creating a task for it, it still doesn’t execute immediately. That’s because the event loop is single-threaded: it can only run one task at a time. For a new task to run, the current task has to yield control back to the event loop by calling await on something that isn’t ready yet. As tasks yield control, the event loop scheduler works its way through the queue, running each task sequentially until it itself yields control. When an awaited operation completes, the task awaiting it is placed back in the queue to resume where it left off.

Critically, the event loop schedules newly created tasks in FIFO order. Let’s say a list of coroutines is passed into asyncio.gather as in the code snippet above. asyncio.gather wraps each coroutine in a task, scheduling them for execution, then yields control back to the event loop. The event loop then dequeues the task created from the first coroutine passed into asyncio.gather and runs it until it yields control. Then, the event loop dequeues the second task, then the third, and so on.The order of execution after that is completely unpredictable and depends on what the tasks are actually doing, but tasks start in a deterministic order:

This makes it possible to deterministically order steps using code placed before the step’s first await. We can do this in the @Step() decorator, which wraps step execution. Before doing anything else, and in particular anything that might require an await, @Step() increments and assigns a step ID from workflow context. This way, step IDs are deterministically assigned in the exact order steps are passed into asyncio.gather. This guarantees that the step processing task one is step one, the step processing task two is step two, and so on.

To sum it up, when building Python libraries, it’s really important to understand the subtleties of asyncio and the event loop. While it might seem unintuitive at first, the single-threaded execution model is actually easier to reason about than parallel threads because tasks execute predictably and can only interleave when control is explicitly yielded via await. This makes it possible to write simple code that’s both concurrent and safe.

Learn More

If you like making systems reliable, we’d love to hear from you. At DBOS, our goal is to make durable workflows as easy to work with as possible. Check it out:

‍