持久化执行的艰难之路
Durable execution, the hard way

原始链接: https://github.com/hatchet-dev/durable-execution-the-hard-way

受《Kubernetes the Hard Way》启发,本指南提供了一种实践性、循序渐进的方法,教你如何使用 Go 和 PostgreSQL 从零构建一个持久化执行引擎。持久化执行允许长时间运行的有状态流程(如 AI 代理)进行进度检查点设置并从故障中恢复,这与 Temporal 或 Hatchet 等系统类似。 本课程专为熟悉后端基础知识和 SQL、且希望了解工作流引擎核心机制的开发者而设计。课程采用模块化结构,每节课都包含代码、数据库模式以及由 `sqlc` 生成的查询。该项目侧重于构建功能性的基础,而非追求高级的开发者体验,主要实现的功能包括: * 基于 Postgres 的任务队列。 * 持久化任务与常规任务的对比。 * 重试、重放和执行分支机制。 本指南刻意保持极简,旨在提供透明的架构视图,而不通过完整的客户端 SDK 进行抽象。无论你是想构建自己的引擎,还是仅仅想揭开现有平台运行机制的神秘面纱,这些课程都提供了一个具有实践意义的切入点。作者欢迎各方贡献与反馈,甚至为发现代码中的逻辑错误提供奖励。

这篇 Hacker News 帖子讨论了持久化执行系统的实现,并引用了 Hatchet 近期关于该主题的博客文章。 评论者们强调了系统从简单的单体架构向强大、可靠的分布式系统演进的过程,并指出持久化工作流引擎是实现后者的关键组件。 对于尚未需要全功能引擎的团队,用户 *liampulles* 提出了一个实现可靠性的“二八原则”方案: 1. **幂等性:** 确保工作流可以在任何故障点安全地重启和重新运行。 2. **持久化:** 存储触发消息以支持可审计性。 3. **可观测性:** 通过现有的日志聚合跟踪故障。 通过结合这些实践,开发人员可以在投入复杂且专用的持久化执行框架之前,利用简单的脚本(通过手动重新排队失败的消息)来管理工作流的可靠性。关于持久化执行理念的更多背景,可参考 2025 年 12 月的相关讨论。
相关文章

原文

Inspired by Kelsey Hightower's Kubernetes the hard way, we're going to build a durable execution engine from scratch using Go and Postgres.

Durable execution is a mechanism to incrementally checkpoint the state of a function as it makes progress, so that in the case of unexpected failure, the function can recover from where it left off. It's particularly relevant in newer stacks and projects implementing AI agents, which are long-running and stateful. A system which implements durable execution is often called a "workflow engine."

This guide uses Go and templated SQL using sqlc. The only dependencies are:

  • Go 1.25+
  • Postgres (by default, created via Docker)
  • pgx

If you are interested in contributing support for other languages, please create a Github issue. I'll be sharing updates (new lessons, other languages) for this guide on Twitter if you'd like to follow along.

You will benefit from this guide if you:

  • Want to understand how durable execution engines like Hatchet and Temporal work at a deeper level
  • Are implementing your own workflow engine and would like a simple starting point for your architecture

This guide expects that you understand the foundations of SQL databases, can read code, and are familiar with some minimal backend engineering concepts, such as queues. More advanced terminology will be introduced in each lesson.

For a motivating guide on durable execution, see the blog post How to think about durable execution.

Each directory in /lessons is set up with an identical structure:

  • A README.md file for navigating the lesson
  • A main.go file for running the example code produced by the lesson, which can be run via go run .
  • A sql directory which contains a schema.sql file, a queries.sql file, and some files for generating templated queries via sqlc

By the final lesson, we'll have a minimal but fully-working workflow engine. Note that these lessons are not focused on developer ergonomics: we'll be building the bare minimum to understand the fundamentals, but won't implement the typical niceties you'd see in a client SDK.

  1. Prerequisites
  2. Simple task queue
  3. Limiting concurrent tasks
  4. Task queue improvements
  5. Durable event log
  6. Tracking non-determinism
  7. Durable tasks

This guide is a somewhat opinionated view on durable execution. Specifically, it implements:

  • Durable execution entirely in Postgres.
  • Two types of functions: durable tasks and regular tasks. These map directly to durable tasks and tasks in Hatchet, and are akin to Temporal workflows and activities.
  • Regular tasks invokable as standalone tasks, meaning this guide implements a simple Postgres-backed task queue as well in the first few lessons.
  • Multiple types of retries and replays, which are treated as distinct:
    • Retries will retry a durable task without resetting the event history (preserving the execution state of the function)
    • Replays will reset a durable task's execution history to start from scratch
    • Forking will reset a durable task's execution history at a given point in the execution history, effectively creating a "fork" of that task. This will be the subject of a future lesson

You can modify the schema, queries, and code in each lesson to experiment. To regenerate the SQL files in each directory, run the following:

go run github.com/sqlc-dev/sqlc/cmd/sqlc generate --file sql/sqlc.yaml

If you discovered an error in the core logic of any lesson, please file a Github issue. We'd be happy to reward you with a baked good from a bakery near you (yes, we're serious). If a bakery isn't available, we'd be happy to send you a Hatchet tee or hat. If you understandably don't want more vendor swag, you'll have my eternal gratitude.

AI has not been used to write any prose in this guide. All mistakes and turns of phrase are my own. AI has been used to:

  • Verify that each lesson of this guide is independently runnable and instructions are easy to follow
  • Generate mermaid diagrams

If there's sufficient interest, I'd be happy to put together additional lessons, such as:

  • Using Postgres LISTEN/NOTIFY to speed up processing significantly
  • Durable sleep
  • Branching and forking the durable event log
联系我们 contact @ memedata.com