展示 HN：海象 – 用 Rust 编写的 Kafka 替代方案

展示 HN：海象 – 用 Rust 编写的 Kafka 替代方案
Show HN: Walrus – a Kafka alternative written in Rust

## 海象：高性能分布式消息流平台海象是一个基于高性能日志存储引擎构建的、容错的分布式消息流平台。它通过基于段的分割、由 Raft 共识管理的自动领导者轮换以及基于租约的写屏障来实现可扩展性和可靠性。主要特性包括自动负载均衡、简单的客户端协议（连接到任何节点）以及通过密封段从任何副本提供历史读取的能力。该系统包括节点控制器、Raft 引擎、集群元数据存储以及基于海象的桶存储。生产者和消费者连接到任何节点，集群会智能地路由请求。主题被划分为段（约 1M 条目），领导者在轮换时进行切换，确保负载均衡。海象提供了一个简单的基于 TCP 的 CLI 用于交互（创建、生产、消费、状态、指标）。性能基准测试表明，其写入和读取吞吐量具有竞争力，通常超过 Kafka 和 RocksDB。正式的 TLA+ 规范验证了数据一致性，并且核心存储引擎也作为独立的 Rust 库提供。提供了全面的测试和详细的文档。

黑客新闻新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录展示HN: 海象 – 用Rust编写的Kafka替代方案 (github.com/nubskr) 6点由 janicerk 1小时前 | 隐藏 | 过去 | 收藏 | 讨论指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系搜索：

原文

Walrus is a distributed message streaming platform built on a high-performance log storage engine. It provides fault-tolerant streaming with automatic leadership rotation, segment-based partitioning, and Raft consensus for metadata coordination.

Key Features:

Automatic load balancing via segment-based leadership rotation
Fault tolerance through Raft consensus (3+ nodes)
Simple client protocol (connect to any node, auto-forwarding)
Sealed segments for historical reads from any replica
High-performance storage with io_uring on Linux

Producers and consumers connect to any node (or via load balancer). The cluster automatically routes requests to the appropriate leader and manages segment rollovers for load distribution.

Each node contains four key components: Node Controller (routing and lease management), Raft Engine (consensus for metadata), Cluster Metadata (replicated state), and Bucket Storage (Walrus engine with write fencing).

Node Controller

Routes client requests to appropriate segment leaders
Manages write leases (synced from cluster metadata every 100ms)
Tracks logical offsets for rollover detection
Forwards operations to remote leaders when needed

Raft Engine (Octopii)

Maintains Raft consensus for metadata changes only (not data!)
Handles leader election and log replication
Syncs metadata across all nodes via AppendEntries RPCs

Cluster Metadata (Raft State Machine)

Stores topic → segment → leader mappings
Tracks sealed segments and their entry counts
Maintains node addresses for routing
Replicated identically across all nodes

Storage Engine

Wraps Walrus engine with lease-based write fencing
Only accepts writes if node holds lease for that segment
Stores actual data in WAL files on disk
Serves reads from any segment (sealed or active)

cd distributed-walrus

make cluster-bootstrap

# Interact via CLI
cargo run --bin walrus-cli -- --addr 127.0.0.1:9091

# In the CLI:

# create a topic named 'logs'
> REGISTER logs

# produce a message to the topic
> PUT logs "hello world"

# consume message from topic
> GET logs

# get the segment states of the topic
> STATE logs

# get cluster state
> METRICS

Simple length-prefixed text protocol over TCP:

Wire format:
  [4 bytes: length (little-endian)] [UTF-8 command]

Commands:
  REGISTER <topic>       → Create topic if missing
  PUT <topic> <payload>  → Append to topic
  GET <topic>            → Read next entry (shared cursor)
  STATE <topic>          → Get topic metadata (JSON)
  METRICS                → Get Raft metrics (JSON)

Responses:
  OK [payload]           → Success
  EMPTY                  → No data available (GET only)
  ERR <message>          → Error

See distributed-walrus/docs/cli.md for detailed CLI usage.

Topics split into segments (~1M entries each by default)
Each segment has a leader node that handles writes
Leadership rotates round-robin on segment rollover
Automatic load distribution across cluster

Lease-Based Write Fencing

Only the leader for a segment can write to it
Leases derived from Raft-replicated metadata
100ms sync loop ensures lease consistency
Prevents split-brain writes during leadership changes

Old segments "sealed" when rolled over
Original leader retains sealed data for reads
Reads can be served from any replica with the data
No data movement required during rollover

Monitor loop (10s) checks segment sizes
Triggers rollover when threshold exceeded
Proposes metadata change via Raft
Leader transfer happens automatically

Flag	Default	Description
`--node-id`	(required)	Unique node identifier
`--data-dir`	`./data`	Root directory for storage
`--raft-port`	`6000`	Raft/Internal RPC port
`--raft-host`	`127.0.0.1`	Raft bind address
`--raft-advertise-host`	(raft-host)	Advertised Raft address
`--client-port`	`8080`	Client TCP port
`--client-host`	`127.0.0.1`	Client bind address
`--join`	-	Address of existing node to join

Variable	Default	Description
`WALRUS_MAX_SEGMENT_ENTRIES`	`1000000`	Entries before rollover
`WALRUS_MONITOR_CHECK_MS`	`10000`	Monitor loop interval
`WALRUS_DISABLE_IO_URING`	-	Use mmap instead of io_uring
`RUST_LOG`	`info`	Log level (debug, info, warn)

Comprehensive test suite included:

cd distributed-walrus

# Run all tests
make test

# Individual tests
make cluster-test-logs         # Basic smoke test
make cluster-test-rollover     # Segment rollover
make cluster-test-resilience   # Node failure recovery
make cluster-test-recovery     # Cluster restart persistence
make cluster-test-stress       # Concurrent writes
make cluster-test-multi-topic  # Multiple topics

Write throughput: Single writer per segment (lease-based)
Read throughput: Scales with replicas (sealed segments)
Latency: ~1-2 RTT for forwarded ops + storage latency
Consensus overhead: Metadata only (not data path)
Segment rollover: ~1M entries default (~100MB depending on payload size)

Walrus includes a formal TLA+ specification of the distributed data plane that models segment-based sharding, lease-based write fencing, and cursor advancement across sealed segments.

Specification: distributed-walrus/spec/DistributedWalrus.tla

Domain Consistency: Topic metadata, WAL entries, and reader cursors stay synchronized
Single Writer per Segment: Only the designated leader can write to each segment
No Writes Past Open Segment: Closed segments remain immutable after rollover
Sealed Counts Stable: Entry counts for sealed segments match actual WAL contents
Read Cursor Bounds: Cursors never exceed segment boundaries or entry counts
Sequential Write Order: Entries within each segment maintain strict ordering

Rollover Progress: Segments exceeding the entry threshold eventually roll over
Read Progress: Available entries eventually get consumed by readers

The specification abstracts Raft consensus as a single authoritative metadata source and models Walrus storage as per-segment entry sequences. Model checking with TLC verifies correctness under concurrent operations

Storage Engine Benchmarks

The underlying storage engine delivers exceptional performance:

System	Avg Throughput (writes/s)	Avg Bandwidth (MB/s)	Max Throughput (writes/s)	Max Bandwidth (MB/s)
Walrus	1,205,762	876.22	1,593,984	1,158.62
Kafka	1,112,120	808.33	1,424,073	1,035.74
RocksDB	432,821	314.53	1,000,000	726.53

System	Avg Throughput (writes/s)	Avg Bandwidth (MB/s)	Max Throughput (writes/s)	Max Bandwidth (MB/s)
RocksDB	5,222	3.79	10,486	7.63
Walrus	4,980	3.60	11,389	8.19
Kafka	4,921	3.57	11,224	8.34

Benchmarks compare single Kafka broker (no replication, no networking overhead) and RocksDB's WAL against the legacy append_for_topic() endpoint using pwrite() syscalls (no io_uring batching).

Architecture Deep Dive - Detailed component interactions, data flow diagrams, startup sequence, lease synchronization, rollover mechanics, and failure scenarios
CLI Guide - Interactive CLI usage and commands
System Documentation - Full system documentation

Using Walrus as a Library

The core Walrus storage engine is also available as a standalone Rust library for embedded use cases:

[dependencies]
walrus-rust = "0.2.0"

use walrus_rust::{Walrus, ReadConsistency};

// Create a new WAL instance
let wal = Walrus::new()?;

// Write data to a topic
wal.append_for_topic("my-topic", b"Hello, Walrus!")?;

// Read data from the topic
if let Some(entry) = wal.read_next("my-topic", true)? {
    println!("Read: {:?}", String::from_utf8_lossy(&entry.data));
}

See the standalone library documentation for single node usage, configuration options, and API reference.

We welcome patches, check CONTRIBUTING.md for the workflow.

This project is licensed under the MIT License, see the LICENSE file for details.

New: Distributed message streaming platform with Raft consensus
New: Segment-based leadership rotation and load balancing
New: Automatic rollover and lease-based write fencing
New: TCP client protocol with simple text commands
New: Interactive CLI for cluster interaction
New: Comprehensive test suite for distributed scenarios

New: Atomic batch write operations (batch_append_for_topic)
New: Batch read operations (batch_read_for_topic)
New: io_uring support for batch operations on Linux
New: Dual storage backends (FD backend with pread/pwrite, mmap backend)
New: Namespace isolation via _for_key constructors
New: FsyncSchedule::SyncEach and FsyncSchedule::NoFsync modes
Improved: Comprehensive documentation with architecture and design docs
Improved: Enhanced benchmarking suite with batch operation benchmarks
Fixed: Tail read offset tracking in concurrent scenarios

Initial release
Core WAL functionality
Topic-based organization
Configurable consistency modes
Comprehensive benchmark suite
Memory-mapped I/O implementation
Persistent read offset tracking