关于持久执行的思考

关于持久执行的思考
How to think about durable execution

原始链接: https://hatchet.run/blog/durable-execution

## 耐用执行：摘要作者（前Porter CTO）最初不理解其价值，现在则大力推崇耐用执行，认为它是处理复杂、有状态工作流的强大解决方案。传统任务队列处理后台处理，通常依赖消息代理来实现持久性。然而，当任务变得复杂——涉及许多步骤、潜在故障和外部依赖关系时——简单的队列就显得不足。耐用执行通过持久化工作流的*中间状态*来解决这个问题，允许在故障后（如机器崩溃）从检查点恢复，而不会丢失进度。这是通过追加日志实现的，确保确定性的重试和子任务的精确一次处理。虽然耐用执行提供了显著的优势——抵御意外故障、程序化工作流定义以及跳过冗余工作——但它并非万能的解决方案。它需要确定性工作流（避免在执行过程中改变逻辑），并会引入开销。它最适合于管理复杂状态和保证完成至关重要的场景，可以补充而不是取代传统的队列系统。作者的公司Hatchet提供了一个基于这些原理构建的耐用执行平台。

## 耐用执行：摘要这次Hacker News讨论的核心是**耐用执行**——一种构建具有韧性的、长期运行的异步工作流的概念。虽然前景可观，但许多人难以理解其优势，以及它是否适合他们的工作负载。一个关键要点是，耐用执行并不能消除对**幂等性**的需求；工作流仍然需要任务可重复执行，而不会产生意想不到的副作用。然而，它简化了复杂故障场景的管理，尤其是在分布式系统中，协调跨多个服务的回滚具有挑战性。像**Temporal、Hatchet 和 DBOS** 这样的平台旨在提供一种构建这些工作流的结构化方法，提供诸如流式处理、动态调度和改进的日志记录等功能。其中一些，例如 DBOS，直接与现有数据库（Postgres）集成，以增强耐用性。其核心价值在于提供一个一致的框架来处理故障并持久化高级逻辑（循环、try/catch）——允许工作流从中断的地方恢复。围绕是否*强制*开发者使用耐用工作流，还是提供更灵活的抽象，存在争论。最终目标是使构建可靠的异步系统更容易，但需要仔细考虑幂等性和潜在的复杂性。

Back when I was CTO at Porter, we were responsible for managing deployments of hundreds of Kubernetes clusters into customer AWS accounts. In the early days, this was all built using a single-replica deployment called porter-provisioner , which was essentially a Go binary automating a bunch of Terraform and Helm commands. This was circa 2020, and one of our users asked if we had considered Temporal for this workload. The next day I opened up the Temporal docs, read for about 30 minutes, and came away with: nothing. For some reason, I couldn't wrap my head around how durable execution worked, or why it was relevant to our workload.

I'll avoid talking to vendors at all costs, so I decided to table my research for the time being and kept plugging away at building our own in-house orchestration system. Over the next few years of working on this system - writing a second and third version of the provisioner - I kept returning to Temporal, because I sensed that it should be the right fit for my workload. But I still didn't get it. I understood the benefits that they were selling, but I'm a bottom-up thinker; I need to understand how something works under the hood before I can use it.

After I left Porter, I spent a while researching durable execution platforms, starting out by building my own open-source Terraform automation project using Temporal (since abandoned), introduced Temporal as the CTO of a different company, and eventually started Hatchet.

Here's the post that I wish I read at the time: an in-depth look at durable execution, and how it relates to task queues, message brokers, and other orchestrators.

The task queue

We're going to start by introducing the traditional task queue. While most apps start with a REST API, there comes a time when work is either too long-running or resource intensive to run inside of an API handler. At Porter, this was needed on Day 1, when we needed to provision infrastructure for customers; since AWS can take over 30 minutes (!!) to provision a new EKS cluster or database, keeping a handler alive for that amount of time wasn't an option.

At this point, you typically introduce the task queue: a system which ensures that background tasks get scheduled and run on a separate set of workers. Task queues have varying properties: Durability at the messaging layer (for retries), durability at the task layer (for replays), exactly-once processing, the list goes on. Your architecture may look like this:

The message broker

It would be ill-advised to pass messages directly from your application to your workers without an intermediate persistence layer, or else you risk losing messages when your workers crash. So while a task queue could in theory be implemented using only a protocol like REST or gRPC (and yes, the Porter MVP invoked tasks using this mechanism), a task queue typically utilizes a message broker, which can also have varying properties: Durability, message retries, dead-letter queues, etc.

Redis, RabbitMQ, and database-as-queue are popular options for the persistence layer.

Simple tasks

Before proceeding, I'd like to introduce the term idempotency: in this context, it means that if you invoke the same task twice, it has the same effect as calling it once.

Ok, so we have a simple task queue in place; we enqueue tasks from our API, and we process them on our workers. And this next part is crucial: if your task is easily idempotent, and can update application state via a single database transaction, then this is likely all you need.

By easily idempotent, I simply mean either the task is stateless (inherently idempotent) or you can make the task idempotent with a minimal change.

For example, if you are processing a file upload, and your task consists of 2 steps:

Call out to an external API to process the file upload, and
If the file is not already marked as processed in your database, update its status to PROCESSED

…then you have very few failure modes. If your worker crashes while processing your task, and you have configured your message broker to resend the task to a new worker on failure or visibility timeout, you will end up with consistent application state.

Complex tasks

But what happens when tasks aren't so simple? Consider a model of our workload at Porter:

Create a new database entry to represent the infrastructure deployment
Call the AWS API to start provisioning a new AWS resource
Poll the AWS API for results
Update the database entry which represents the infrastructure deployment
(repeat 2-4 for 100s of AWS resources, over at most ~1 hour)

This task is not easily idempotent; it involves writing a ton of intermediate state and queries to determine that a step should not be repeated, which gets more difficult because the execution graph can change in between runs (it's also not possible to ensure that the two sources of state — the AWS resources and the deployment status — will ever be fully consistent, but that's a different story).

There are other reasons for not being able to model your workload as a single idempotent task running on a single worker, such as:

One part of your task is very resource-intensive, and needs to run on a different machine from the other parts of your task. You'd like to split the task into smaller, more manageable subtasks.
Your task needs to write data to an external source, and cannot do so in a way that's easily idempotent. Typically this is seen when the task has side effects which are stateful. A modern example of this would be a mutating KV cache for an agent's context.
Perhaps you have 50 different microservices, each one responsible for a different part of the system. Your task might need to call different microservices, all using potentially separate databases or underlying state stores.

This is typically where you start to explore an orchestration engine or a workflow engine, to bring some order and visibility into what is becoming a very chaotic, distributed system, and in particular to reduce the number of failure modes that can occur in this system.

And we finally get to the motivation of durable execution: it is just one approach to orchestrating and managing the state of a set of interrelated tasks (frequently called a workflow). There are many others: Systems which implement DAGs, Celery's chains and chords, event-sourcing, homegrown state machines, each targeting slightly different problems.

The specific solution provided by durable execution platforms is to persist the intermediate state of the workflow in a separate data store, and enable the workflow to start from this “checkpointed” state on subsequent runs.

Durable execution

There are a few ways you could persist this intermediate state. The solution utilized by what I call external durable execution state stores — like Temporal and Hatchet — is to persist each workflow's incremental state in an external system, and provide the workflow with the ability to skip any subtasks (or more generally, any actions persisted to the durable event log) which have been previously called by replaying the durable event log on a retry.

Let's see how this would work for a simple, two-step function which crashes halfway through. The function could look something like this:

Loading syntax highlighting...

On the first execution, let's imagine that the machine the workflow is running on hits an OOM - this will forcefully terminate the process, so there's no possibility of recovery. In this scenario, this OOM occurs after we have run the chargePayment step, but before we have run sendConfirmationEmail.

On the second execution, the workflow can automatically skip the first step, and move directly to the second step, as the execution history shows that chargePayment has already run.