展示 HN:海象 – 用 Rust 编写的 Kafka 替代方案
Show HN: Walrus – a Kafka alternative written in Rust

原始链接: https://github.com/nubskr/walrus

## 海象:高性能分布式消息流平台 海象是一个基于高性能日志存储引擎构建的、容错的分布式消息流平台。它通过基于段的分割、由 Raft 共识管理的自动领导者轮换以及基于租约的写屏障来实现可扩展性和可靠性。 主要特性包括自动负载均衡、简单的客户端协议(连接到任何节点)以及通过密封段从任何副本提供历史读取的能力。该系统包括节点控制器、Raft 引擎、集群元数据存储以及基于海象的桶存储。 生产者和消费者连接到任何节点,集群会智能地路由请求。主题被划分为段(约 1M 条目),领导者在轮换时进行切换,确保负载均衡。海象提供了一个简单的基于 TCP 的 CLI 用于交互(创建、生产、消费、状态、指标)。 性能基准测试表明,其写入和读取吞吐量具有竞争力,通常超过 Kafka 和 RocksDB。正式的 TLA+ 规范验证了数据一致性,并且核心存储引擎也作为独立的 Rust 库提供。提供了全面的测试和详细的文档。

黑客新闻 新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 展示HN: 海象 – 用Rust编写的Kafka替代方案 (github.com/nubskr) 6点 由 janicerk 1小时前 | 隐藏 | 过去 | 收藏 | 讨论 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系 搜索:
相关文章

原文

Walrus is a distributed message streaming platform built on a high-performance log storage engine. It provides fault-tolerant streaming with automatic leadership rotation, segment-based partitioning, and Raft consensus for metadata coordination.

Walrus Demo

Key Features:

  • Automatic load balancing via segment-based leadership rotation
  • Fault tolerance through Raft consensus (3+ nodes)
  • Simple client protocol (connect to any node, auto-forwarding)
  • Sealed segments for historical reads from any replica
  • High-performance storage with io_uring on Linux

Walrus Architecture

Producers and consumers connect to any node (or via load balancer). The cluster automatically routes requests to the appropriate leader and manages segment rollovers for load distribution.

Walrus Node Architecture

Each node contains four key components: Node Controller (routing and lease management), Raft Engine (consensus for metadata), Cluster Metadata (replicated state), and Bucket Storage (Walrus engine with write fencing).

Node Controller

  • Routes client requests to appropriate segment leaders
  • Manages write leases (synced from cluster metadata every 100ms)
  • Tracks logical offsets for rollover detection
  • Forwards operations to remote leaders when needed

Raft Engine (Octopii)

  • Maintains Raft consensus for metadata changes only (not data!)
  • Handles leader election and log replication
  • Syncs metadata across all nodes via AppendEntries RPCs

Cluster Metadata (Raft State Machine)

  • Stores topic → segment → leader mappings
  • Tracks sealed segments and their entry counts
  • Maintains node addresses for routing
  • Replicated identically across all nodes

Storage Engine

  • Wraps Walrus engine with lease-based write fencing
  • Only accepts writes if node holds lease for that segment
  • Stores actual data in WAL files on disk
  • Serves reads from any segment (sealed or active)
cd distributed-walrus

make cluster-bootstrap

# Interact via CLI
cargo run --bin walrus-cli -- --addr 127.0.0.1:9091

# In the CLI:

# create a topic named 'logs'
> REGISTER logs

# produce a message to the topic
> PUT logs "hello world"

# consume message from topic
> GET logs

# get the segment states of the topic
> STATE logs

# get cluster state
> METRICS

Simple length-prefixed text protocol over TCP:

Wire format:
  [4 bytes: length (little-endian)] [UTF-8 command]

Commands:
  REGISTER <topic>       → Create topic if missing
  PUT <topic> <payload>  → Append to topic
  GET <topic>            → Read next entry (shared cursor)
  STATE <topic>          → Get topic metadata (JSON)
  METRICS                → Get Raft metrics (JSON)

Responses:
  OK [payload]           → Success
  EMPTY                  → No data available (GET only)
  ERR <message>          → Error

See distributed-walrus/docs/cli.md for detailed CLI usage.

  • Topics split into segments (~1M entries each by default)
  • Each segment has a leader node that handles writes
  • Leadership rotates round-robin on segment rollover
  • Automatic load distribution across cluster

Lease-Based Write Fencing

  • Only the leader for a segment can write to it
  • Leases derived from Raft-replicated metadata
  • 100ms sync loop ensures lease consistency
  • Prevents split-brain writes during leadership changes
  • Old segments "sealed" when rolled over
  • Original leader retains sealed data for reads
  • Reads can be served from any replica with the data
  • No data movement required during rollover
  • Monitor loop (10s) checks segment sizes
  • Triggers rollover when threshold exceeded
  • Proposes metadata change via Raft
  • Leader transfer happens automatically
Flag Default Description
--node-id (required) Unique node identifier
--data-dir ./data Root directory for storage
--raft-port 6000 Raft/Internal RPC port
--raft-host 127.0.0.1 Raft bind address
--raft-advertise-host (raft-host) Advertised Raft address
--client-port 8080 Client TCP port
--client-host 127.0.0.1 Client bind address
--join - Address of existing node to join
Variable Default Description
WALRUS_MAX_SEGMENT_ENTRIES 1000000 Entries before rollover
WALRUS_MONITOR_CHECK_MS 10000 Monitor loop interval
WALRUS_DISABLE_IO_URING - Use mmap instead of io_uring
RUST_LOG info Log level (debug, info, warn)

Comprehensive test suite included:

cd distributed-walrus

# Run all tests
make test

# Individual tests
make cluster-test-logs         # Basic smoke test
make cluster-test-rollover     # Segment rollover
make cluster-test-resilience   # Node failure recovery
make cluster-test-recovery     # Cluster restart persistence
make cluster-test-stress       # Concurrent writes
make cluster-test-multi-topic  # Multiple topics
  • Write throughput: Single writer per segment (lease-based)
  • Read throughput: Scales with replicas (sealed segments)
  • Latency: ~1-2 RTT for forwarded ops + storage latency
  • Consensus overhead: Metadata only (not data path)
  • Segment rollover: ~1M entries default (~100MB depending on payload size)

Walrus includes a formal TLA+ specification of the distributed data plane that models segment-based sharding, lease-based write fencing, and cursor advancement across sealed segments.

Specification: distributed-walrus/spec/DistributedWalrus.tla

  • Domain Consistency: Topic metadata, WAL entries, and reader cursors stay synchronized
  • Single Writer per Segment: Only the designated leader can write to each segment
  • No Writes Past Open Segment: Closed segments remain immutable after rollover
  • Sealed Counts Stable: Entry counts for sealed segments match actual WAL contents
  • Read Cursor Bounds: Cursors never exceed segment boundaries or entry counts
  • Sequential Write Order: Entries within each segment maintain strict ordering
  • Rollover Progress: Segments exceeding the entry threshold eventually roll over
  • Read Progress: Available entries eventually get consumed by readers

The specification abstracts Raft consensus as a single authoritative metadata source and models Walrus storage as per-segment entry sequences. Model checking with TLC verifies correctness under concurrent operations

Storage Engine Benchmarks

The underlying storage engine delivers exceptional performance:

Walrus vs RocksDB vs Kafka - No Fsync

System Avg Throughput (writes/s) Avg Bandwidth (MB/s) Max Throughput (writes/s) Max Bandwidth (MB/s)
Walrus 1,205,762 876.22 1,593,984 1,158.62
Kafka 1,112,120 808.33 1,424,073 1,035.74
RocksDB 432,821 314.53 1,000,000 726.53

Walrus vs RocksDB vs Kafka - With Fsync

System Avg Throughput (writes/s) Avg Bandwidth (MB/s) Max Throughput (writes/s) Max Bandwidth (MB/s)
RocksDB 5,222 3.79 10,486 7.63
Walrus 4,980 3.60 11,389 8.19
Kafka 4,921 3.57 11,224 8.34

Benchmarks compare single Kafka broker (no replication, no networking overhead) and RocksDB's WAL against the legacy append_for_topic() endpoint using pwrite() syscalls (no io_uring batching).

  • Architecture Deep Dive - Detailed component interactions, data flow diagrams, startup sequence, lease synchronization, rollover mechanics, and failure scenarios
  • CLI Guide - Interactive CLI usage and commands
  • System Documentation - Full system documentation

Using Walrus as a Library

The core Walrus storage engine is also available as a standalone Rust library for embedded use cases:

Crates.io Documentation

[dependencies]
walrus-rust = "0.2.0"
use walrus_rust::{Walrus, ReadConsistency};

// Create a new WAL instance
let wal = Walrus::new()?;

// Write data to a topic
wal.append_for_topic("my-topic", b"Hello, Walrus!")?;

// Read data from the topic
if let Some(entry) = wal.read_next("my-topic", true)? {
    println!("Read: {:?}", String::from_utf8_lossy(&entry.data));
}

See the standalone library documentation for single node usage, configuration options, and API reference.

We welcome patches, check CONTRIBUTING.md for the workflow.

This project is licensed under the MIT License, see the LICENSE file for details.

  • New: Distributed message streaming platform with Raft consensus
  • New: Segment-based leadership rotation and load balancing
  • New: Automatic rollover and lease-based write fencing
  • New: TCP client protocol with simple text commands
  • New: Interactive CLI for cluster interaction
  • New: Comprehensive test suite for distributed scenarios
  • New: Atomic batch write operations (batch_append_for_topic)
  • New: Batch read operations (batch_read_for_topic)
  • New: io_uring support for batch operations on Linux
  • New: Dual storage backends (FD backend with pread/pwrite, mmap backend)
  • New: Namespace isolation via _for_key constructors
  • New: FsyncSchedule::SyncEach and FsyncSchedule::NoFsync modes
  • Improved: Comprehensive documentation with architecture and design docs
  • Improved: Enhanced benchmarking suite with batch operation benchmarks
  • Fixed: Tail read offset tracking in concurrent scenarios
  • Initial release
  • Core WAL functionality
  • Topic-based organization
  • Configurable consistency modes
  • Comprehensive benchmark suite
  • Memory-mapped I/O implementation
  • Persistent read offset tracking

联系我们 contact @ memedata.com