道路的分岔口：决定Kafka的无盘未来

道路的分岔口：决定Kafka的无盘未来
A Fork in the Road: Deciding Kafka's Diskless Future

原始链接: https://jack-vanlightly.com/blog/2025/10/22/a-fork-in-the-road-deciding-kafkas-diskless-future

## 卡夫卡的十字路口：成本与架构的平衡卡夫卡社区面临一个关键的决定，关于如何降低跨可用区（AZ）运行时高昂的复制成本。目前有三个提出的KIP（KIP-1150、KIP-1176、KIP-1183 – 但KIP-1183可能不会推进）提供了不同的解决方案。这不仅仅是短期节省的问题，所选择的路径将塑造卡夫卡未来十年的发展。卡夫卡正在应对相互对立的力量：在保持本地部署、低延迟稳定性的同时，适应弹性、云原生工作负载的需求。后者越来越倾向于无状态计算和共享对象存储（如S3）以实现持久性，将可扩展性的瓶颈从数据复制转移到元数据协调。核心争论集中在两条路径上：一种**革命性**的方法，采用直接写入S3的主题以实现最大的弹性和降低运营复杂性，可能需要大量的代码更改；另一种**演进式**的方法，利用现有的卡夫卡组件来节省成本，但可能限制对象存储的优势并增加代码复杂性。所有方案的共同点包括将数据组合成对象以提高效率，以及在无领导者环境中解决序列化/元数据管理问题。最终，社区必须决定是根本性地重新定义卡夫卡的架构，还是逐步使其适应新的存储范式。

黑客新闻新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录道路分岔：决定Kafka的无磁盘未来 (jack-vanlightly.com) 57 分，来自 g0xA52A2A 1 天前 | 隐藏 | 过去 | 收藏 | 2 条评论 delifue 1 天前 | 下一个 [–] “无磁盘”实际上是用S3替换磁盘。 skywhopper 1 天前 | 上一个 [–] 有趣。至少八年前，Kafka的设计在云环境中就存在这个问题。很高兴看到他们终于开始解决它了。考虑申请YC 2026冬季批次！申请截止日期为11月10日。指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系方式搜索：

原文

“The Kafka community is currently seeing an unprecedented situation with three KIPs (KIP-1150, KIP-1176, KIP-1183) simultaneously addressing the same challenge of high replication costs when running Kafka across multiple cloud availability zones.” — Luke Chen, The Path Forward for Saving Cross-AZ Replication Costs KIPs

At the time of writing the Kafka project finds itself at a fork in the road where choosing the right path forward for implementing S3 topics has implications for the long-term success of the project. Not just the next couple of years, but the next decade. Open-source projects live and die by these big decisions and as a community, we need to make sure we take the right one.

This post explains the competing KIPs, but goes further and asks bigger questions about the future direction of Kafka.

Before comparing proposals, we should step back and ask what kind of system we want Kafka to become.

Kafka now faces two almost opposing forces. One force is stabilizing: the on-prem deployments and the low latency workloads that depend on local disks and replication. Kafka must continue to serve those use cases. The other force is disrupting: the elastic, cloud-native workloads that favor stateless compute and shared object storage.

Relaxed-latency workloads such as analytics have seen a shift in system design with durability increasingly delegated to shared object storage, freeing the compute layer to be stateless, elastic, and disposable. Many systems now scale by adding stateless workers rather than rebalancing stateful nodes. In a stateless compute design, the bottleneck shifts from data replication to metadata coordination. Once durability moves to shared storage, sequencing and metadata consistency become the new limits of scalability.

That brings us to the current moment, with three competing KIPs defining how to integrate object storage directly into Kafka. While we evaluate these KIPs, it’s important to consider the motivations for building direct-to-S3 topics. Cross-AZ charges are typically what are on people’s minds, but it’s a mistake to think of S3 simply as a cheaper disk or a networking cheat. The shift is also architectural, providing us an opportunity to achieve those operational benefits such as elastic stateless compute.

The devil is in the details: how each KIP enables Kafka to leverage object storage while also retaining Kafka’s soul and what made it successful in the first place.

Many KIPs, but only two paths

With that in mind, while three KIPs have been submitted, it comes down to two different paths:

Revolutionary: Choose a direct-to-S3 topic design that maximizes the benefits of an object-storage architecture, with greater elasticity and lower operational complexity. However, in doing so, we may increase the implementation cost and possibly the long-term code maintenance too by maintaining two very different topic-models in the same project (leader-based replication and direct-to-S3).
Evolutionary: Shoot for an evolutionary design that makes use of existing components to reduce the need for large refactoring or duplication of logic. However, by coupling to the existing architecture, we forfeit the extra benefits of object storage, focusing primarily on networking cost savings (in AWS and GCP). Through this coupling, we also run the risk of achieving the opposite: harder to maintain code by bending and contorting a second workload into an architecture optimized for something else.

In this post I will explain the two paths in this forked road, how the various KIPs map onto those paths, and invite the whole community to think through what they want for Apache Kafka for the next decade.

Note that I do not include KIP-1183 as it looks dead in the water, and not a serious contender. The KIP proposes AutoMQ’s storage abstractions without the accompanying implementation. Which perhaps cynically, seems to benefit AutoMQ were it ever adopted, leaving the community to rewrite the entire storage subsystem again. If you want a quick summary of the three KIPs (including KIP-1183), you can read Luke Chen’s The Path Forward for Saving Cross-AZ Replication Costs KIPs or Anton Borisov’s summary of the three KIPs.

This post is structured as follows:

The term “Diskless” vs “Direct-to-S3”
The Common Parts. Some approaches are shared across multiple implementations and proposals.
Revolutionary: KIPs and real-world implementations
Evolutionary: Slack’s KIP-1176
The Hybrid: balancing revolution with evolution
Deciding Kafka’s future

I used the term “diskless” in the title as that is the current hype word. But it is clear that not all designs are actually diskless in the same spirit as “serverless”. Serverless implies that users no longer need to consider or manage servers at all, not that there are no servers.

In the world of open-source, where you run stuff yourself, diskless would have to mean literally “no disks”, else you will be configuring disks as part of your deployment. But all the KIPs (in their current state) depend on disks to some extent, even KIP-1150 which was proposed as diskless. In most cases, disk behavior continues to influence performance and therefore correct disk provisioning will be important.

So I’m not a fan of “diskless”, I prefer “direct-to-S3”, which encompasses all designs that treat S3 (and other object stores) as the only source of durability.

Combined Objects

The main commonality between all Direct-to-S3 Kafka implementations and design proposals is the uploading of objects that combine the data of multiple topics. The reasons are two-fold:

Avoiding the small file problem. Most designs are leaderless for producer traffic, allowing for any server to receive writes to any topic. To avoid uploading a multitude of tiny files, servers accumulate batches in a buffer until ready for upload. Before upload, the buffer is sorted by topic id and partition, to make compaction and some reads more efficient by ensuring that data of the same topic and same partition are in contiguous byte ranges.
Pricing. The pricing of many (but not all) cloud object storage services penalize excessive requests, so it can be cheaper to roll-up whatever data has been received in the last X milliseconds and upload it with a single request.

Sequencing and metadata storage

In the leader-based model, the leader determines the order of batches in a topic partition. But in the leaderless model, multiple brokers could be simultaneously receiving produce batches of the same topic partition, so how do we order those batches? We need a way of establishing a single order for each partition and we typically use the word “sequencing” to describe that process. Usually there is a central component that does the sequencing and metadata storage, but some designs manage the sequencing in other ways.

Zone-aligned producers via metadata manipulation

WarpStream was the first to demonstrate that you could hack the metadata step of initiating a producer to provide it with broker information that would align the producer with a zone-local broker for the topics it is interested in. The Kafka client is leader-oriented, so we just pass it a zone-local broker and tell the client “this is the leader”. This is how all the leaderless designs ensure producers write to zone-local brokers. It’s not pretty, and we should make a future KIP to avoid the need for this kind of hack.

Zone-aligned consumers

Consumer zone-alignment heavily depends on the particular design, but two broad approaches exist:

Leaderless: The same way that producer alignment works via metadata manipulation or using KIP-392 (fetch from follower) which can be used in a leaderless context.
Leader-based:

Zone-aware consumer group assignment as detailed in KIP-881: Rack-aware Partition Assignment for Kafka Consumers. The idea is to use consumer-to-partition assignment to ensure consumers are only assigned zone-local partitions (where the partition leader is located).
KIP-392 (fetch-from-follower), which is effective for designs that have followers (which isn’t always the case).

Object Compaction

Given almost all designs upload combined objects, we need a way to make those mixed objects more read optimized. This is typically done through compaction, where combined objects are ultimately separated into per-topic or even per-partition objects. Compaction could be one-shot or go through multiple rounds.

The “revolutionary” path draws a new boundary inside Kafka by separating what can be stateless from what must remain stateful. Direct-to-S3 traffic is handled by a lightweight, elastic layer of brokers that simply serve producers and consumers. The direct-to-S3 coordination (sequencing/metadata) is incorporated into the stateful side of regular brokers where coordinators, classic topics and KRaft live.

I cover three designs in the “revolutionary” section:

WarpStream (as a reference, a kind of yardstick to compare against)
KIP-1150 revision 1
Aiven Inkless (a Kafka-fork)

3.1. WarpStream (as a yardstick)

Before we look at the KIPs that describe possible futures for Apache Kafka, let’s look at a system that was designed from scratch with both cross-AZ cost savings and elasticity (from object storage) as its core design principles. WarpStream was unconstrained by an existing stateful architecture, and with this freedom, it divided itself into:

Leaderless, stateless and diskless agents that handle Kafka clients, as well as compaction/cleaning work.
Coordination layer: A central metadata store for sequencing, metadata storage and housekeeping coordination.