我们选择 OCaml 来编写 Stategraph。

我们选择 OCaml 来编写 Stategraph。
We chose OCaml to write Stategraph

原始链接: https://stategraph.dev/blog/why-we-chose-ocaml

## 状态图 & OCaml：具有保证正确性的基础设施状态图，一个 Terraform 状态管理系统，由于对*不可能*发生状态损坏的迫切需求而使用 OCaml 构建——不仅仅是不太可能。为他人管理基础设施需要一种传统测试难以达到的可靠性水平。 OCaml 强大的静态类型系统是这种可靠性的核心。它主动防止诸如访问不存在的数据字段、不正确的类型赋值和数据库模式不匹配等错误，并在*编译时*进行检查。类型安全的 SQL 查询确保数据库更改反映在代码中，并且 PPX 自动生成正确的 JSON 序列化，从而消除了数据丢失的风险。此外，OCaml 默认的不可变性消除了竞争条件，并辅以 PostgreSQL 的行级锁定。详尽的模式匹配强制执行强大的错误处理，确保解决所有潜在的故障状态。虽然 OCaml 开发人员较少，但具有扎实的系统和类型系统知识的工程师可以快速适应，并受益于专注于功能开发而不是调试可预防错误的稳定代码库。最终，状态图利用 OCaml 从*检测*错误转变为*预防*错误，从而提供了一种从根本上更可靠的基础设施解决方案。

Hacker News 上围绕着 Stategraph 项目（stategraph.dev）选择 OCaml 的讨论。最初的帖子强调了选择 OCaml 的理由，但评论者质疑 OCaml 究竟有哪些优越之处。多位用户指出，许多强类型语言，例如 TypeScript，可以实现相同的优势。一位有管理大型 Terraform 部署经验的评论者质疑 Stategraph 的目标受众以及它要解决的问题，并指出现有的锁定机制和 CI/CD 管道已经可以解决 Terraform 运行中潜在的竞态条件。争论的核心在于，所提到的优势是 OCaml 独有的，还是可以在各种现代编程语言中广泛应用。对话中还包含一个 Y Combinator 申请公告。

原文

OCaml Type Systems Functional Programming Infrastructure Stategraph

Josh Pollara • November 6th, 2025

TL;DR

$ cat why-ocaml.tldr

• Stategraph manages Terraform state, so correctness isn't optional

• Strongly-typed data structures catch field errors at compile time

• Type-safe SQL queries prevent schema drift before deployment

• Immutability by default eliminates race conditions

• PPX generates correct JSON serialization automatically

We're building infrastructure that manages other people's infrastructure. State corruption can't be "rare." It has to be impossible. That's why we chose OCaml.

Stategraph stores Terraform state as a dependency graph in PostgreSQL with resource-level locking. The challenge isn't building a database-backed state store. The challenge is ensuring that concurrent operations can never corrupt state, even with concurrent operations/users, that database schema changes break the build instead of production, and that JSON transformations are correct.

We chose OCaml because its type system catches entire categories of bugs at compile time that would require extensive testing and still slip through in other languages.

Type-safe data structures

Here's a scenario every infrastructure engineer has seen. Two Terraform operations run concurrently and both read a resource in an active state. One updates it while the other destroys it. Without proper coordination, you risk marking the resource as destroyed in state while it's still being modified in the cloud.

Most systems handle this defensively with locks and runtime validation, but race conditions are hard to test and the resulting state corruption usually appears in production, not CI.

Stategraph tackles this in two ways. Immutability and database-level locking prevent concurrent writes from corrupting state, while OCaml's type system makes the underlying data structures themselves safer by construction. Resources, outputs, and instances are all defined as strongly-typed records, so you can't access a field that doesn't exist or mix up field types. The compiler enforces correctness before anything runs.

type t = {

lineage : string;

outputs : Outputs.t option;

resources : Resources.t;

serial : int;

terraform_version : string;

version : int;

}

If you try to access state.versions (typo) instead of state.version, you get a compiler error. If you try to assign a string to serial, you get a compiler error. If you forget to handle None in the outputs field, you get a compiler error with exhaustiveness checking.

This extends throughout the codebase. Every Terraform resource type, every state transition, and every database record is strongly typed. The compiler catches entire categories of bugs at compile time, like accessing non-existent fields, missing null checks, or database schema mismatches.

The database schema drift problem

You're iterating on your database schema by renaming a column, changing a type, or adding a constraint. In most languages, you update the schema, deploy the migration, and hope you caught all the queries that reference the old structure. You didn't because a query somewhere references the old column name. It works in dev with the old schema but crashes in staging with the new schema.

Stategraph uses typed SQL where every query declares explicit types for its parameters and return values. When you change a query's type signature, every call site in the codebase must be updated to match, and the compiler enforces this.

let insert_resource_sql () =

Pgsql_io.Typed_sql.(

sql

// Ret.bigint

/^ "INSERT INTO resources (state_id, mode, type, name,

provider_id, module_) VALUES ($state_id, $mode,

$type, $name, $provider_id, $module_) RETURNING id"

/% Var.uuid "state_id"

/% Var.text "mode"

/% Var.text "type"

/% Var.text "name"

/% Var.uuid "provider_id"

/% (Var.option @@ Var.text "module_"))

This query expects specific types. The state_id must be a UUID, mode must be text, and module_ is optional text. The return value is typed as bigint. If you try to pass a string where a UUID is expected, you get a compiler error. If you forget to handle the optional return value, you get a compiler error.

When you update a query to match a new schema, the type system ensures every place that calls that query gets updated too. You can't deploy code where query definitions and their usage are out of sync.

JSON transformations that can't lose data

Stategraph ingests Terraform state as JSON, normalizes it into a graph, stores it in PostgreSQL, and reconstructs it back to JSON when Terraform requests it. Every transformation is a place where data can get lost or corrupted, whether from a field you forgot to serialize, a nested structure you flattened incorrectly, or a type that doesn't round-trip.

Testing can catch some of this, and round-trip tests help, but you're fundamentally relying on test coverage. Missed cases show up when someone's Terraform state comes back missing a field.

OCaml has a feature called PPX (preprocessor extensions) that generates serialization code automatically. You define the type, and the serializer is generated from the type definition.

type aws_instance = {

instance_id : string;

instance_type : string;

ami : string;

availability_zone : string option;

tags : (string * string) list;

} [@@deriving yojson]

When you add a field, the serializer is regenerated. When you change a type, the serializer is regenerated. If you forget to handle a case, the exhaustiveness checker catches it at compile time. You don't write serialization tests because the type system guarantees serialization is correct.

This is how Stategraph handles Terraform's resource types. Every AWS resource, every GCP resource, every Azure resource is an OCaml type with automatic JSON serialization. We don't write serialization code. We don't test round-trips manually. The type system handles it.

Race conditions prevented by default

Terraform operations are inherently concurrent. Multiple users apply changes, CI pipelines run in parallel, and drift detection scans resources continuously. Coordinating all of this without data races requires careful mutex management and defensive programming, and it's easy to get wrong.

OCaml provides immutability by default, so you can't accidentally share mutable state between concurrent operations because there is no mutable state by default. When you want to modify something, you create a new version explicitly. This eliminates entire categories of race conditions.

One operation can't corrupt another operation's view of state because state is immutable by default. When combined with PostgreSQL's row-level locking at the database layer, concurrent operations compose correctly without manual mutex management or defensive copying.

Error handling with discipline

Type safety is only half of what makes Stategraph robust. The other half is discipline in how we use those types.

We encode errors as variants and exhaustively match every case. We never use a catch-all "else" clause that matches everything. When we add a new error to the system, the compiler tells us every place we aren't handling it. This is how robust systems are built. Systems can fail in far more ways than they succeed, and the compiler ensures we handle all of them.

This discipline extends throughout the codebase. Every error case is explicit. Every state transition is enumerated. Every optional value is handled. The type system gives us the tools, but discipline is what turns those tools into reliability.

The difference in practice

The same categories of bugs. Different places to catch them.

Production systems that can't afford bugs

This isn't academic type theory. Production systems use OCaml for exactly this reason.

At Terrateam, we process thousands of concurrent Terraform operations daily, managing infrastructure for hundreds of organizations where a state corruption bug would cascade across every customer. We're built on OCaml, and the type system catches bugs at compile time that would be production incidents in other languages.

Jane Street trades billions daily on OCaml infrastructure. Their trading systems handle concurrent market data and execute trades with zero tolerance for race conditions or undefined behavior. They chose OCaml because correctness isn't optional.

Pattern Recognition

Systems that absolutely cannot fail choose languages where certain failures are impossible, not just unlikely. Testing finds bugs, but types prevent entire categories of bugs from existing.

But who knows OCaml?

This is the most common objection, and OCaml developers are rare. This is true.

But here's what we've found. Engineers who understand distributed systems, type systems, and correctness learn OCaml quickly. The learning curve from Rust, Haskell, or even TypeScript with advanced types is gentler than you'd expect because the concepts transfer even if the syntax is unfamiliar.

More importantly, OCaml codebases are stable. We're not debugging race conditions or chasing down production crashes from schema drift. We're not writing extensive test suites for serialization edge cases. We're building features while the type system handles the category of bugs that would otherwise consume engineering time.

When you encode correctness in types, maintenance gets easier instead of harder. New engineers spend less time understanding implicit invariants and more time writing code the compiler verifies.

Correctness as a feature

We're building Stategraph to manage Terraform state for infrastructure that runs production applications. State corruption has to be impossible instead of unlikely. Invalid state transitions need to be prevented by the compiler instead of caught by tests. Schema drift needs to break the build instead of production.

That's what OCaml gives us. It provides a type system that makes entire categories of bugs impossible instead of just unlikely. The compiler proves properties about our code that testing can only approximate.

OCaml's compile-time guarantees are why we use it to build Stategraph.