基于对象存储的单个JSON文件的分布式队列
A distributed queue in a single JSON file on object storage

原始链接: https://turbopuffer.com/blog/object-storage-queue

## Turbopuffer 索引队列:自底向上的方法 Turbopuffer 最近对其内部索引作业队列进行了改造,从分片系统迁移到基于对象存储的高度可扩展解决方案。核心原则是:利用对象存储的简单性、可预测性和可扩展性,而不是复杂的自定义解决方案。 队列经历了几个阶段的演变。最初,单个 `queue.json` 文件被重复覆盖为完整的队列内容,使用比较并设置(CAS)操作进行原子更新。这对于较低的吞吐量来说出奇地有效。为了处理更高的流量,实施了一种“组提交”策略,在写入对象存储之前将请求缓冲在内存中,批量处理。 然而,对单个队列文件的竞争成为瓶颈。通过引入无状态代理来管理与对象存储的所有交互,从而解决了这个问题,有效地序列化写入并大幅提高吞吐量。最后,通过允许多个代理(在 `queue.json` 中标识)并实施心跳来处理工作进程故障,实现了高可用性。 这种设计实现了 10 倍更低的尾部延迟、至少一次保证以及强大的可扩展性,能够处理 Turbopuffer 的 1000 万+ 次写入/秒和 10000+ 次查询/秒,以及 2.5T+ 的文档。它展示了如何利用简单的基本元素构建强大、有弹性的系统。

黑客新闻 新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 一个在对象存储上的单个JSON文件中的分布式队列 (turbopuffer.com) 6点 由 Sirupsen 1小时前 | 隐藏 | 过去 | 收藏 | 讨论 帮助 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系 搜索:
相关文章

原文

February 12, 2026Dan Harrison (Engineer)

We recently replaced our internal indexing job queue, which notifies indexing nodes to build and update search indexes after data is written to the WAL. The queue is not part of the write path; it's purely a notification system used to schedule asynchronous indexing work. The prior version sharded queues across indexing nodes, so a slow node would block all jobs assigned to it even if other nodes were idle. The new version uses a single queue file on object storage with a stateless broker for FIFO execution, at-least-once guarantees, and 10x lower tail latency versus our prior implementation, so indexing jobs spend less time in the queue.

Why are we so obsessed with building on object storage? Because it's simple, predictable, easy to be on-call for, and extremely scalable. We know how it behaves, and as long as we design within those boundaries, we know it will perform.

Rather than present the final design of our new queue from the top down, let's build it from the bottom up, starting with the simplest thing that works and adding complexity as needed.

Step 1: queue.json

The total size of the data in a turbopuffer job queue is small, well less than 1 GiB. This easily fits in memory, so the simplest functional design is a single file (e.g., queue.json) repeatedly overwritten with the full contents of the queue.

A queue pusher reads the contents of the queue, appends a new job to the end, and writes it using compare-and-set (CAS).

A queue worker similarly uses CAS to mark the first unclaimed job as in progress (○ → ◐), and then gets to work.

We'll call pushers and workers clients, and push and claim operations requests.

The compare-and-set (CAS) primitive makes this atomic. The write only succeeds if queue.json hasn't changed since it was read. If it has changed, the client reads the new contents and tries again. This gives strong consistency guarantees without complex locking.

queue.json                     
┌─────────────────────────────────┐ 
│ {"jobs":["◐","○","○","○","○",]} │ 
└─────────────────────────────────┘ 
            ▲                 ▲   
            │                 │
            │                 │
        CAS │             CAS │
      write │           write │
            │                 │
            │                 │
      ┌─────┴──┐        ┌─────┴──┐
      │ worker │        │ pusher │
      └────────┘        └────────┘

This simplest of queues works surprisingly well! For up to 1 request per second (a limit imposed by GCS), it's already production grade thanks to everything that object storage does for us.

But most queues (including ours) receive more than one request per second. We need more throughput.

Step 2: queue.json with group commit

Object storage has many virtues, but low write latency is not one of them. Replacing a file can take up to 200ms, so instead of writing jobs one-by-one, we need to batch. Whenever a write is in flight, we buffer incoming requests in memory. As soon as the write finishes, we flush the buffer as the next CAS write.

This technique is commonly called group commit, and it's the same pattern turbopuffer uses for batching writes to the WAL. Traditional databases also use this technique to coalesce fsync(2) calls to maximize the committed throughput to disk.

queue.json                     
┌─────────────────────────────────┐ 
│ {"jobs":["◐","◐","◐","○","○",]} │ 
└─────────────────────────────────┘ 
                ▲             ▲
          group │       group │   
         commit │      commit │ 
                │             │
    ┌─buffer────┴─┐ ┌─buffer──┴───┐
    │┌───┬───┬───┐│ │┌───┬───┬───┐│
    ││ ◐ │ ◐ │ ◐ ││ ││ ○ │ ○ │ ○ ││
    │└───┴───┴───┘│ │└───┴───┴───┘│
    └──────▲──────┘ └──────▲──────┘
           │               │
      ┌────┴───┐      ┌────┴───┐
      │ worker │      │ pusher │
      └────────┘      └────────┘

Group commit solves our throughput problem by decoupling write rate from request rate. The scaling bottleneck shifts from write latency (~200ms/write) to network bandwidth (~10 GB/s) – far greater than what turbopuffer needs to track indexing jobs.

However, there’s still a problem. In any turbopuffer region, tens or hundreds of clients will contend over the single queue object as new data is written to many namespaces.

Since CAS ensures strong consistency by forcing each write to be non-overlapping in time, we can only fit 1 / ~200ms = ~5 writes / second (and we still have the 1 RPS limit on GCS).

The problem is no longer throughput. We need fewer writers.

Note: This design, coupled with sharding to local queues, is roughly what we had in production prior to this update. The next sections describe turbopuffer's current production indexing queue.

Step 3: queue.json with a brokered group commit

To eliminate contention over the queue object, we introduce a stateless broker which is responsible for all interactions with object storage. All clients must now liaise with the broker instead of writing to object storage directly.

The broker runs a single group commit loop on behalf of all clients, so no one contends for the object. Critically, it doesn't acknowledge a write until the group commit has landed in object storage. No client moves on until its data is durably committed.

Now the broker is the bottleneck, but a single broker process can serve hundreds or thousands of clients without breaking a sweat because the writes are so small. It's just holding open connections and buffering requests in memory while waiting on I/O. Object storage does the heavy lifting.

queue.json                     
┌─────────────────────────────────┐ 
│ {"jobs":["◐","◐","◐","○","○",]} │ 
└─────────────────────────────────┘ 
                ▲                
                │ brokered
                │ group commit
                │
╔══ broker ═════╧═════════════════╗
║  ┌─ buffer ───────────────────┐ ║
║  │ ┌───┬───┬───┬───┬───┬───┐  │ ║
║  │ │ ◐ │ ◐ │ ◐ │ ○ │ ○ │ ○ │  │ ║
║  │ └───┴───┴───┴───┴───┴───┘  │ ║
║  └────────────────────────────┘ ║
╚════════╤═══════════════╤════════╝
         │               │
    ┌────┴────┐     ┌────┴────┐
    │ workers │     │ pushers │
    └─────────┘     └─────────┘

That's it for scaling. The system can now handle turbopuffer's indexing traffic. But we need high-availability.

Step 4: queue.json with an HA brokered group commit

The broker's machine might die at any time. Similarly, some worker might claim a job and then never finish it. The fix for each of these has the same shape — notice when something is gone and hand off the responsibility — but the details differ.

If any request from a client to the broker takes too long, we start a new broker. Clients will need a way to find the new broker, so we write the broker's address to queue.json.

The broker is stateless, so it's easy and inexpensive to move. And if we end up with more than one broker at a time? That's fine: CAS ensures correctness even with two brokers. The previous broker eventually discovers it's no longer the broker when it gets a CAS failure on queue.json. The only downside is a bit of contention, and thus slowness, for this brief duration.

For the job claims, we add a heartbeat. Periodically, the worker will confirm that it's still on track by sending the broker a timestamp, which is then written to queue.json for that job (one heartbeat per claimed job). If the last heartbeat for a job in the queue is ever more than some timeout, we assume the original worker is gone and the next worker takes over where it left off.

queue.json                     
┌─────────────────────────────────┐ 
│  {                              │
│   "broker":"10.0.0.42:3000",    │
│   "jobs":["◐(♥)","◐(♥)","○",]   │
│  }                              │
└─────────────────────────────────┘ 
                ▲               ▲
       brokered │          read │
   group commit │               │
                │               │
╔══ broker ═════╧═════════════════╗
║  ┌─ buffer ───────────────────┐ ║
║  │ ┌───┬───┬───┬───┬───┬───┐  │ ║
║  │ │ ◐ │ ◐ │ ○ │ ○ │ ○ │ ○ │  │ ║
║  │ └───┴───┴───┴───┴───┴───┘  │ ║
║  └────────────────────────────┘ ║
╚════════╤═══════════════╤════════╝
         │               │      │
    ┌────┴────┐     ┌────┴────┐ │
    │ workers │     │ pushers │─┘
    └─────────┘     └─────────┘

Ship it

We built a reliable distributed job queue with just a single file on object storage and a handful of stateless processes. It easily handles our throughput, guarantees at-least-once delivery, and fails over to any node as needed. Those familiar with turbopuffer's core architecture will see the parallels. Object storage offers few, but powerful, primitives. Once you learn how they behave, you can wield them to build resilient, performant, and highly scalable distributed systems with what's already there.

turbopuffer

We host 2.5T+ documents, handle writes at 10M+ writes/s, and serve 10k+ queries/s. We are ready for far more. We hope you'll trust us with your queries.

Get started
联系我们 contact @ memedata.com