Kafka增量摄取结束。
The end of the road for Kafka-delta-ingest

原始链接: https://brokenco.de/2025/10/30/kafka-delta-ingest-was-fun.html

经过五年后,Scribd 已经退役了其内部数据摄取工具 `kafka-delta-ingest`,尽管它最初成功地将流数据成本降低了 95%。该项目催生了 `delta-rs`,一个流行的 Rust 开源库,用于与 Delta Lake 表交互——此前仅限于 Apache Spark。 虽然 `kafka-delta-ingest` 实现了其目标,但更新的基础设施以及作者的 “oxbow” 套件和中介数据架构进一步将摄取成本降低到总数据平台支出的 10% 以下。随着 Scribd 上其他 Kafka 消费者消失,该工具的价值降低,基于 Kafka 的专用摄取变得不那么划算。 虽然在 Scribd 已经不再需要,但维护者将继续通过 `kafka-delta-ingest` 更新 `delta-rs` 以进行测试。该工具可能对*已经*使用 Kafka 的组织有用,但不建议将其作为采用该平台的唯一理由。

一个名为“kafka-delta-ingest”的项目即将停止维护,详情请见Hacker News上的帖子。该项目旨在从Kafka摄取数据并将其持久化为Parquet文件。然而, identified的核心问题是Apache Kafka本身对资源的需求很高。 作者和评论者一致认为,Kafka最适合*已经*将其用于其他目的的组织。如果Kafka仅仅用于数据摄取,则存在更精简、更便宜的替代方案。在这种情况下,随着其他消费Kafka数据的应用程序消失, “kafka-delta-ingest”的价值也随之降低。 评论员还提到了WarpStream和Oxbow等潜在替代方案,表明数据摄取解决方案正朝着更高效的方向发展。尽管项目最终未能持续,但其在delta-rs方面所做的工作仍然很有价值。
相关文章

原文

After five years in production kafka-delta-ingest at Scribd has been shut off and removed from our infrastructure. kafka-delta-ingest was the motivation behind my team creating delta-rs, the most successful open source project I have started to date. With kafka-delta-ingest we achieved our original stated goals and reduced streaming data ingestion costs by 95%. In the time since however, we have further reduced that cost with even more efficient infrastructure.

The original kafka-delta-ingest/delta-rs implementations were created by the joint efforts of the following talented developers across three continents in the middle of 2020, an otherwise totally chill time in world history.

Prior to our creation of delta-rs, the only way to read and write Delta Lake tables was through Apache Spark. While it is an incredibly powerful tool for reading and transforming data, it is completely slow and overweight for the task of high-throughput data ingestion. QP and I found ourselves loving Rust and I was able to corner the funding to get the project started on the promise of lower operational costs.

Boy howdy has the investment in Rust delivered. The implementation of kafka-delta-ingest dramatically lowered our operation costs as Christian shares in this video:

Christian also shared some architecture and discussion in this video, which I think are useful for anybody building streaming systems around Delta Lake.

Here’s a demo by Christian too!


The reason kafka-delta-ingest was decommissioned ultimately was that I created an even cheaper ingestion process. My work on the oxbow suite coupled with the medallion architecture has made contemporary Delta Lake ingestion less than 10% of the total data platform cost.

The big argument against kafka-delta-ingest was Apache Kafka. If an organization has Kafka for other reasons, then kafka-delta-ingest can be a useful “sidecar” process to persist data flowing through Kafka. If however the organization is running Kafka just for ingestion, there are cheaper options available. As the organization evolved, the other consumers of Kafka drifted away, driving the value proposition of kafka-delta-ingest lower and lower.

This doesn’t mean kafka-delta-ingest is not useful, it’s just no longer useful at Scribd.


Kyjah Keyes and I are the maintainers of kafka-delta-ingest and we now are both in the position of not actually using it anymore.

I will continue to make delta-rs upgrades to it, since kafka-delta-ingest continues to be a useful test bed for API changes and integration testing, but I don’t have big plans or ideas on how to grow the project further.

联系我们 contact @ memedata.com