OCaml Type Systems Functional Programming Infrastructure Stategraph
$ cat why-ocaml.tldr
• Stategraph manages Terraform state, so correctness isn't optional
• Strongly-typed data structures catch field errors at compile time
• Type-safe SQL queries prevent schema drift before deployment
• Immutability by default eliminates race conditions
• PPX generates correct JSON serialization automatically
We're building infrastructure that manages other people's infrastructure. State corruption can't be "rare." It has to be impossible. That's why we chose OCaml.
Stategraph stores Terraform state as a dependency graph in PostgreSQL with resource-level locking. The challenge isn't building a database-backed state store. The challenge is ensuring that concurrent operations can never corrupt state, even with concurrent operations/users, that database schema changes break the build instead of production, and that JSON transformations are correct.
We chose OCaml because its type system catches entire categories of bugs at compile time that would require extensive testing and still slip through in other languages.
Type-safe data structures
Here's a scenario every infrastructure engineer has seen. Two Terraform operations run concurrently and both read a resource in an active state. One updates it while the other destroys it. Without proper coordination, you risk marking the resource as destroyed in state while it's still being modified in the cloud.
Most systems handle this defensively with locks and runtime validation, but race conditions are hard to test and the resulting state corruption usually appears in production, not CI.
Stategraph tackles this in two ways. Immutability and database-level locking prevent concurrent writes from corrupting state, while OCaml's type system makes the underlying data structures themselves safer by construction. Resources, outputs, and instances are all defined as strongly-typed records, so you can't access a field that doesn't exist or mix up field types. The compiler enforces correctness before anything runs.
type t = {
lineage : string;
outputs : Outputs.t option;
resources : Resources.t;
serial : int;
terraform_version : string;
version : int;
}
If you try to access state.versions (typo) instead of state.version, you get a compiler error. If you try to assign a string to serial, you get a compiler error. If you forget to handle None in the outputs field, you get a compiler error with exhaustiveness checking.
This extends throughout the codebase. Every Terraform resource type, every state transition, and every database record is strongly typed. The compiler catches entire categories of bugs at compile time, like accessing non-existent fields, missing null checks, or database schema mismatches.
The database schema drift problem
You're iterating on your database schema by renaming a column, changing a type, or adding a constraint. In most languages, you update the schema, deploy the migration, and hope you caught all the queries that reference the old structure. You didn't because a query somewhere references the old column name. It works in dev with the old schema but crashes in staging with the new schema.
Stategraph uses typed SQL where every query declares explicit types for its parameters and return values. When you change a query's type signature, every call site in the codebase must be updated to match, and the compiler enforces this.
let insert_resource_sql () =
Pgsql_io.Typed_sql.(
sql
// Ret.bigint
/^ "INSERT INTO resources (state_id, mode, type, name,
provider_id, module_) VALUES ($state_id, $mode,
$type, $name, $provider_id, $module_) RETURNING id"
/% Var.uuid "state_id"
/% Var.text "mode"
/% Var.text "type"
/% Var.text "name"
/% Var.uuid "provider_id"
/% (Var.option @@ Var.text "module_"))
This query expects specific types. The state_id must be a UUID, mode must be text, and module_ is optional text. The return value is typed as bigint. If you try to pass a string where a UUID is expected, you get a compiler error. If you forget to handle the optional return value, you get a compiler error.
When you update a query to match a new schema, the type system ensures every place that calls that query gets updated too. You can't deploy code where query definitions and their usage are out of sync.
JSON transformations that can't lose data
Stategraph ingests Terraform state as JSON, normalizes it into a graph, stores it in PostgreSQL, and reconstructs it back to JSON when Terraform requests it. Every transformation is a place where data can get lost or corrupted, whether from a field you forgot to serialize, a nested structure you flattened incorrectly, or a type that doesn't round-trip.
Testing can catch some of this, and round-trip tests help, but you're fundamentally relying on test coverage. Missed cases show up when someone's Terraform state comes back missing a field.
OCaml has a feature called PPX (preprocessor extensions) that generates serialization code automatically. You define the type, and the serializer is generated from the type definition.
type aws_instance = {
instance_id : string;
instance_type : string;
ami : string;
availability_zone : string option;
tags : (string * string) list;
} [@@deriving yojson]
When you add a field, the serializer is regenerated. When you change a type, the serializer is regenerated. If you forget to handle a case, the exhaustiveness checker catches it at compile time. You don't write serialization tests because the type system guarantees serialization is correct.
This is how Stategraph handles Terraform's resource types. Every AWS resource, every GCP resource, every Azure resource is an OCaml type with automatic JSON serialization. We don't write serialization code. We don't test round-trips manually. The type system handles it.
Race conditions prevented by default
Terraform operations are inherently concurrent. Multiple users apply changes, CI pipelines run in parallel, and drift detection scans resources continuously. Coordinating all of this without data races requires careful mutex management and defensive programming, and it's easy to get wrong.
OCaml provides immutability by default, so you can't accidentally share mutable state between concurrent operations because there is no mutable state by default. When you want to modify something, you create a new version explicitly. This eliminates entire categories of race conditions.
One operation can't corrupt another operation's view of state because state is immutable by default. When combined with PostgreSQL's row-level locking at the database layer, concurrent operations compose correctly without manual mutex management or defensive copying.
Error handling with discipline
Type safety is only half of what makes Stategraph robust. The other half is discipline in how we use those types.
We encode errors as variants and exhaustively match every case. We never use a catch-all "else" clause that matches everything. When we add a new error to the system, the compiler tells us every place we aren't handling it. This is how robust systems are built. Systems can fail in far more ways than they succeed, and the compiler ensures we handle all of them.
This discipline extends throughout the codebase. Every error case is explicit. Every state transition is enumerated. Every optional value is handled. The type system gives us the tools, but discipline is what turns those tools into reliability.
The difference in practice
The same categories of bugs. Different places to catch them.
Production systems that can't afford bugs
This isn't academic type theory. Production systems use OCaml for exactly this reason.
At Terrateam, we process thousands of concurrent Terraform operations daily, managing infrastructure for hundreds of organizations where a state corruption bug would cascade across every customer. We're built on OCaml, and the type system catches bugs at compile time that would be production incidents in other languages.
Jane Street trades billions daily on OCaml infrastructure. Their trading systems handle concurrent market data and execute trades with zero tolerance for race conditions or undefined behavior. They chose OCaml because correctness isn't optional.
Pattern Recognition
Systems that absolutely cannot fail choose languages where certain failures are impossible, not just unlikely. Testing finds bugs, but types prevent entire categories of bugs from existing.
But who knows OCaml?
This is the most common objection, and OCaml developers are rare. This is true.
But here's what we've found. Engineers who understand distributed systems, type systems, and correctness learn OCaml quickly. The learning curve from Rust, Haskell, or even TypeScript with advanced types is gentler than you'd expect because the concepts transfer even if the syntax is unfamiliar.
More importantly, OCaml codebases are stable. We're not debugging race conditions or chasing down production crashes from schema drift. We're not writing extensive test suites for serialization edge cases. We're building features while the type system handles the category of bugs that would otherwise consume engineering time.
When you encode correctness in types, maintenance gets easier instead of harder. New engineers spend less time understanding implicit invariants and more time writing code the compiler verifies.
Correctness as a feature
We're building Stategraph to manage Terraform state for infrastructure that runs production applications. State corruption has to be impossible instead of unlikely. Invalid state transitions need to be prevented by the compiler instead of caught by tests. Schema drift needs to break the build instead of production.
That's what OCaml gives us. It provides a type system that makes entire categories of bugs impossible instead of just unlikely. The compiler proves properties about our code that testing can only approximate.
OCaml's compile-time guarantees are why we use it to build Stategraph.