Linux 极致性能 H1 加载生成器
Linux extreme performance H1 load generator

原始链接: https://www.gcannon.org/

## 玻璃炮:高性能负载生成器 玻璃炮通过绕过传统的负载测试方法,直接利用Linux内核的`io_uring`接口,实现了卓越的I/O性能。与依赖于通过`epoll`进行重复系统调用的`wrk`和`hey`等工具不同,玻璃炮采用共享内存环缓冲区进行提交和完成,从而大大减少了上下文切换。 其架构包含一个主线程生成多个工作线程,每个线程拥有独立的`io_uring`环、连接和预构建的请求缓冲区——消除了工作线程之间的通信和锁。主要特性包括预注册的缓冲区池以实现零拷贝接收,“多射击”接收以实现连续数据流,以及请求流水线以最大化吞吐量。 每个`io_uring`提交都包含用户数据打包,其中包含操作类型、生成计数器(用于处理重新连接)和连接索引。核心循环有效地批量处理完成项(最多2048个),然后再提交新的操作,从而最大限度地减少内核转换。这种批量处理扩展到流水线填充,确保请求始终预构建并准备好发送。 最终,玻璃炮的设计优先考虑最小化系统调用并最大化内核级处理,从而产生显著更高的性能。

黑客新闻 新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 Linux 极限性能 H1 加载生成器 (gcannon.org) 7 分,由 MDA2AV 1小时前发布 | 隐藏 | 过去 | 收藏 | 2 条评论 帮助 Veserv 23分钟前 [–] 如果没有附带的基准测试或比较,声称“极限”性能有什么意义?在标题声明中使用不合格的形容词,而不提供支持证据,这真的应该感到羞愧。回复 raks619 2分钟前 | 父评论 [–] 你向下滚动了吗? 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文

Glass Cannon uses a fundamentally different approach to I/O than traditional load generators. Instead of one thread per connection or async callbacks, it talks directly to the Linux kernel's io_uring interface.

The Traditional Way

Most load generators (wrk, hey, ab) use epoll — the application asks the kernel "which sockets are ready?", then makes individual read() and write() system calls for each one. Every syscall means a context switch between your program and the kernel.

The Glass Cannon Way

io_uring uses two shared memory ring buffers between your program and the kernel. You write requests into the submission queue, and the kernel writes results into the completion queue. No system calls per operation. The kernel processes batches of I/O while your code processes batches of results.

Your Program

submit 2048 operations

(send, recv, connect...)

──── Submission Queue ────>

<── Completion Queue ────

process completions

(no syscall needed)

Minimal context switches in steady state

Linux Kernel

process all I/O

in kernel space

batch completions

into shared ring

Architecture

The main thread spawns N worker threads. Each worker owns an independent io_uring ring, a set of connections, and pre-built request buffers. There is zero communication between workers during the benchmark — no mutexes, no atomics, no shared state.

main thread | |— spawn workers, wait for duration, aggregate stats | worker 0 worker 1 worker N | | | io_uring ring io_uring ring io_uring ring buffer ring buffer ring buffer ring connections[0..K] connections[0..K] connections[0..K] | | | connect → send → recv → count → refill → ...

Provided Buffer Rings

Pre-registered buffer pool managed by the kernel. Eliminates per-recv buffer allocation and reduces submission overhead.

Multishot Receive

One submission arms continuous receive on a socket. The kernel keeps delivering data without repeated requests.

Request Pipelining

N copies of the HTTP request are pre-built and concatenated at startup. One send call pushes the entire pipeline.

Generation Counters

Each connection has a generation counter packed into io_uring user-data. Stale completions from reconnected sockets are safely ignored.

Worker Initialization

Each worker thread sets up its own isolated environment before entering the event loop. Nothing is shared between workers — each one is a fully self-contained load generator.

1

Create the io_uring ring

A ring with 4096 submission queue entries is created with IORING_SETUP_SINGLE_ISSUER (only this thread will ever submit) and IORING_SETUP_DEFER_TASKRUN (kernel defers work until the next submission, reducing interrupts). These two flags together eliminate all internal locking in the kernel's io_uring path.

2

Set up the provided buffer ring

4096 receive buffers (4KB each, ~16MB total) are pre-registered with the kernel via io_uring_setup_buf_ring(). When a recv completion fires, the kernel picks a buffer from this pool automatically — no userspace allocation or buffer address needed in the submission. After processing, the buffer is returned to the ring for reuse.

3

Allocate connections and connect

An array of gc_conn_t structs is allocated — one per connection assigned to this worker. Each connection gets a non-blocking TCP socket with TCP_NODELAY, and an io_uring_prep_connect() is submitted for each. All connects happen in parallel via the ring.

The Event Loop

The core of Glass Cannon is a tight loop that processes io_uring completions in batches. Each iteration drains up to 2048 completions, processes them, then submits new operations — all in a single kernel transition.

worker_loop() io_uring_peek_batch_cqe() ── drain up to 2048 completions ├── no completions? io_uring_submit_and_wait_timeout() ── submit pending, wait 1ms └── for each CQE: ├── unpack user_data kind, gen, conn_idx ├── stale gen? → skip (return buffer if recv) ├── UD_CONNECT → arm multishot recv → fire requests ├── UD_SEND → partial? resubmit remainder : refill pipeline ├── UD_RECV → parse response(s) → record latency → refill └── UD_CANCEL → no-op (cleanup from reconnect) io_uring_cq_advance() ── mark CQEs as consumed io_uring_submit() ── flush all new SQEs to kernel └── loop until g_running == 0

User-Data Packing

Every io_uring submission carries a 64-bit user-data value that comes back with the completion. Glass Cannon packs three fields into this value to identify what each completion means:

64-bit user_data layout ┌──────────────┬──────────────┬──────────────────────────────┐ kind (16b) gen (16b) conn_idx (32b) └──────────────┴──────────────┴──────────────────────────────┘ kind ─ operation type: CONNECT(1), RECV(2), SEND(3), CANCEL(4) gen ─ 16-bit generation counter, incremented on each reconnect conn_idx ─ index into the worker's connection array

The generation counter solves a critical race: when a connection is closed and a new socket is opened on the same index, completions from the old socket may still arrive. The worker compares cqe_gen != c->gen and silently discards stale completions. This avoids use-after-free bugs without any locking.

Connection Lifecycle

Each connection follows a state machine driven entirely by io_uring completions. No threads block, no callbacks are registered — just CQEs moving state forward.

HTTP connection CONNECTING ──── connect() CQE ────→ ACTIVE arm_recv_multishot() fire_requests(pipeline_depth) ┌────┴────┐ SEND │──→ partial? resubmit remainder RECV │──→ parse response(s) │ record latency │ fire_requests(completed) └────┬────┘ quota reached? ─────┘ CLOSED ←── close + cancel recv ─── reconnect() ──→ gen++ → CONNECTING WebSocket connection CONNECTING ── connect() CQE ──→ WS_UPGRADING │ fire_ws_upgrade() │ RECV: parse HTTP 101 ACTIVE │ fire_requests() sends masked WS frames │ RECV: ws_parse_frames() counts echoes │ same latency tracking as HTTP

How Recv Works: Multishot + Provided Buffers

Traditional recv requires submitting a new SQE for every chunk of data you want to read, and you must tell the kernel which buffer to write into. Glass Cannon uses two io_uring features to eliminate both costs:

Multishot Recv

A single io_uring_prep_recv_multishot() call arms continuous receive on a socket. The kernel fires a CQE for each chunk of data that arrives, without the application resubmitting. The IORING_CQE_F_MORE flag on each CQE indicates whether the multishot is still armed. Only when it's not (socket closed, error, or buffer exhaustion) does the application need to rearm.

Provided Buffers

Instead of specifying a buffer address in each recv SQE, the application registers a pool of buffers with the kernel upfront. The kernel picks a free buffer from the pool when data arrives and reports which one it chose via the IORING_CQE_F_BUFFER flag and buffer ID in the CQE. After processing, the application returns the buffer to the pool. No allocation, no copying.

Combined, these mean a single SQE handles all future data on a connection, with zero-copy buffer management. The application only touches buffers when it has data to parse.

Pipeline Refill

Glass Cannon keeps the pipeline full at all times. When responses arrive, it immediately sends new requests to replace them:

pipeline_depth = 4 fire_requests(4) ── initial fill: send 4 pipelined requests │ pipeline_inflight = 4 RECV: 2 responses parsed ── pipeline_inflight drops to 2 fire_requests(2) ── refill: send 2 more to restore depth │ pipeline_inflight = 4 ── always at target depth

All requests in a pipeline batch are pre-built and concatenated at startup into a single buffer. Sending N pipelined requests is one io_uring_prep_send() call — no per-request formatting or allocation. If the kernel can't send the full buffer in one shot (partial send), the worker resubmits the remainder automatically.

Batch Processing

The key to Glass Cannon's throughput is batching at every level. Instead of processing one event at a time:

CQE Batching

io_uring_peek_batch_cqe() drains up to 2048 completions in one call. All are processed before any new submissions, amortizing the cost of the kernel transition across thousands of operations.

SQE Batching

New operations (sends, recv rearms, connects) are queued as SQEs during CQE processing, then flushed to the kernel in a single io_uring_submit() at the end of each batch. One syscall for potentially thousands of new operations.

Pipeline Batching

N HTTP requests are concatenated and sent in a single send operation. The response parser handles multiple responses per recv buffer, matching them to pipelined requests in FIFO order.

联系我们 contact @ memedata.com