Glass Cannon uses a fundamentally different approach to I/O than traditional load generators. Instead of one thread per connection or async callbacks, it talks directly to the Linux kernel's io_uring interface.
The Traditional Way
Most load generators (wrk, hey, ab) use epoll — the application asks the kernel "which sockets are ready?", then makes individual read() and write() system calls for each one. Every syscall means a context switch between your program and the kernel.
The Glass Cannon Way
io_uring uses two shared memory ring buffers between your program and the kernel. You write requests into the submission queue, and the kernel writes results into the completion queue. No system calls per operation. The kernel processes batches of I/O while your code processes batches of results.
Your Program
submit 2048 operations
(send, recv, connect...)
──── Submission Queue ────>
<── Completion Queue ────
process completions
(no syscall needed)
Minimal context switches in steady state
Linux Kernel
process all I/O
in kernel space
batch completions
into shared ring
Architecture
The main thread spawns N worker threads. Each worker owns an independent io_uring ring, a set of connections, and pre-built request buffers. There is zero communication between workers during the benchmark — no mutexes, no atomics, no shared state.
main thread | |— spawn workers, wait for duration, aggregate stats | worker 0 worker 1 worker N | | | io_uring ring io_uring ring io_uring ring buffer ring buffer ring buffer ring connections[0..K] connections[0..K] connections[0..K] | | | connect → send → recv → count → refill → ...
Provided Buffer Rings
Pre-registered buffer pool managed by the kernel. Eliminates per-recv buffer allocation and reduces submission overhead.
Multishot Receive
One submission arms continuous receive on a socket. The kernel keeps delivering data without repeated requests.
Request Pipelining
N copies of the HTTP request are pre-built and concatenated at startup. One send call pushes the entire pipeline.
Generation Counters
Each connection has a generation counter packed into io_uring user-data. Stale completions from reconnected sockets are safely ignored.
Worker Initialization
Each worker thread sets up its own isolated environment before entering the event loop. Nothing is shared between workers — each one is a fully self-contained load generator.
1
Create the io_uring ring
A ring with 4096 submission queue entries is created with IORING_SETUP_SINGLE_ISSUER (only this thread will ever submit) and IORING_SETUP_DEFER_TASKRUN (kernel defers work until the next submission, reducing interrupts). These two flags together eliminate all internal locking in the kernel's io_uring path.
2
Set up the provided buffer ring
4096 receive buffers (4KB each, ~16MB total) are pre-registered with the kernel via io_uring_setup_buf_ring(). When a recv completion fires, the kernel picks a buffer from this pool automatically — no userspace allocation or buffer address needed in the submission. After processing, the buffer is returned to the ring for reuse.
3
Allocate connections and connect
An array of gc_conn_t structs is allocated — one per connection assigned to this worker. Each connection gets a non-blocking TCP socket with TCP_NODELAY, and an io_uring_prep_connect() is submitted for each. All connects happen in parallel via the ring.
The Event Loop
The core of Glass Cannon is a tight loop that processes io_uring completions in batches. Each iteration drains up to 2048 completions, processes them, then submits new operations — all in a single kernel transition.
worker_loop() │ │ io_uring_peek_batch_cqe() ── drain up to 2048 completions │ │ │ ├── no completions? │ │ io_uring_submit_and_wait_timeout() ── submit pending, wait 1ms │ │ │ └── for each CQE: │ │ │ ├── unpack user_data → kind, gen, conn_idx │ ├── stale gen? → skip (return buffer if recv) │ │ │ ├── UD_CONNECT → arm multishot recv → fire requests │ ├── UD_SEND → partial? resubmit remainder : refill pipeline │ ├── UD_RECV → parse response(s) → record latency → refill │ └── UD_CANCEL → no-op (cleanup from reconnect) │ │ io_uring_cq_advance() ── mark CQEs as consumed │ io_uring_submit() ── flush all new SQEs to kernel │ └── loop until g_running == 0
User-Data Packing
Every io_uring submission carries a 64-bit user-data value that comes back with the completion. Glass Cannon packs three fields into this value to identify what each completion means:
64-bit user_data layout ┌──────────────┬──────────────┬──────────────────────────────┐ │ kind (16b) │ gen (16b) │ conn_idx (32b) │ └──────────────┴──────────────┴──────────────────────────────┘ kind ─ operation type: CONNECT(1), RECV(2), SEND(3), CANCEL(4) gen ─ 16-bit generation counter, incremented on each reconnect conn_idx ─ index into the worker's connection array
The generation counter solves a critical race: when a connection is closed and a new socket is opened on the same index, completions from the old socket may still arrive. The worker compares cqe_gen != c->gen and silently discards stale completions. This avoids use-after-free bugs without any locking.
Connection Lifecycle
Each connection follows a state machine driven entirely by io_uring completions. No threads block, no callbacks are registered — just CQEs moving state forward.
HTTP connection CONNECTING ──── connect() CQE ────→ ACTIVE │ │ │ arm_recv_multishot() │ fire_requests(pipeline_depth) │ │ │ ┌────┴────┐ │ │ SEND │──→ partial? resubmit remainder │ │ RECV │──→ parse response(s) │ │ │ record latency │ │ │ fire_requests(completed) │ └────┬────┘ │ │ │ quota reached? ─────┘ │ │ CLOSED ←── close + cancel recv ─── reconnect() ──→ gen++ → CONNECTING WebSocket connection CONNECTING ── connect() CQE ──→ WS_UPGRADING │ │ fire_ws_upgrade() │ RECV: parse HTTP 101 │ ACTIVE │ fire_requests() sends masked WS frames │ RECV: ws_parse_frames() counts echoes │ same latency tracking as HTTP
How Recv Works: Multishot + Provided Buffers
Traditional recv requires submitting a new SQE for every chunk of data you want to read, and you must tell the kernel which buffer to write into. Glass Cannon uses two io_uring features to eliminate both costs:
Multishot Recv
A single io_uring_prep_recv_multishot() call arms continuous receive on a socket. The kernel fires a CQE for each chunk of data that arrives, without the application resubmitting. The IORING_CQE_F_MORE flag on each CQE indicates whether the multishot is still armed. Only when it's not (socket closed, error, or buffer exhaustion) does the application need to rearm.
Provided Buffers
Instead of specifying a buffer address in each recv SQE, the application registers a pool of buffers with the kernel upfront. The kernel picks a free buffer from the pool when data arrives and reports which one it chose via the IORING_CQE_F_BUFFER flag and buffer ID in the CQE. After processing, the application returns the buffer to the pool. No allocation, no copying.
Combined, these mean a single SQE handles all future data on a connection, with zero-copy buffer management. The application only touches buffers when it has data to parse.
Pipeline Refill
Glass Cannon keeps the pipeline full at all times. When responses arrive, it immediately sends new requests to replace them:
pipeline_depth = 4 fire_requests(4) ── initial fill: send 4 pipelined requests │ │ pipeline_inflight = 4 │ RECV: 2 responses parsed ── pipeline_inflight drops to 2 │ fire_requests(2) ── refill: send 2 more to restore depth │ │ pipeline_inflight = 4 ── always at target depth
All requests in a pipeline batch are pre-built and concatenated at startup into a single buffer. Sending N pipelined requests is one io_uring_prep_send() call — no per-request formatting or allocation. If the kernel can't send the full buffer in one shot (partial send), the worker resubmits the remainder automatically.
Batch Processing
The key to Glass Cannon's throughput is batching at every level. Instead of processing one event at a time:
CQE Batching
io_uring_peek_batch_cqe() drains up to 2048 completions in one call. All are processed before any new submissions, amortizing the cost of the kernel transition across thousands of operations.
SQE Batching
New operations (sends, recv rearms, connects) are queued as SQEs during CQE processing, then flushed to the kernel in a single io_uring_submit() at the end of each batch. One syscall for potentially thousands of new operations.
Pipeline Batching
N HTTP requests are concatenated and sent in a single send operation. The response parser handles multiple responses per recv buffer, matching them to pipelined requests in FIFO order.