一个对 io_uring 和 kqueue 的程序员友好型 I/O 抽象层

一个对 io_uring 和 kqueue 的程序员友好型 I/O 抽象层
A programmer-friendly I/O abstraction over io_uring and kqueue (2022)

原始链接: https://tigerbeetle.com/blog/2022-11-23-a-friendly-abstraction-over-iouring-and-kqueue/

## 从阻塞I/O到事件循环：性能演进本次讨论探讨了I/O处理的演进，以提升性能。从传统的阻塞I/O（使用`open()`、`read()`、`write()`、`socket()`、`connect()`）开始，逐步进展到更高效的方法。虽然非阻塞I/O避免了无限期等待，但重复检查就绪状态代价高昂，因为存在系统调用开销。像Linux的`io_uring`和FreeBSD/macOS的`kqueue`等解决方案允许提交一批I/O请求到内核，减少上下文切换。`io_uring`更进一步，使内核能够*直接*执行读/写操作，最大限度地减少用户空间的参与。这引出了事件循环的概念——一个中央调度器，用于调度I/O操作并在完成时调用回调函数。这抽象了内核特定的细节（自动选择`io_uring`或`kqueue`），并允许从代码库中的任何位置发起I/O操作。像libuv（Node.js使用）和TigerBeetle这样的库是这种方法的示例。核心思想是提交带有用户数据（包含回调指针）的请求，并接收带有相同数据的完成事件，从而实现异步操作。这种架构通常是单线程的，以实现简单性和确定性，但可以适应多线程场景，以最大限度地提高并行工作负载的吞吐量。作者暗示可能会将此I/O抽象作为C API发布，以提高更广泛的语言兼容性。

## 虎甲虫 I/O 抽象层与 Hacker News 讨论一个基于 `io_uring` 和 `kqueue` 构建的、更易于程序员使用的 I/O 抽象层（tigerbeetle.com）最近在 Hacker News 上分享，引发了关于系统调用效率和文件 I/O 的讨论。核心观点是，虽然非阻塞 I/O (`O_NONBLOCK`) 对于套接字等流很有用，但对于传统文件提供的益处有限。由于上下文切换和缓存未命中，系统调用仍然很昂贵，可能会超过 I/O 本身的成本。文件通常*总是*准备好进行读/写（除非为空或已满），因此持续轮询效率低下。评论还涉及了相关项目，例如 `libxev`（一个受虎甲虫启发的 Zig 语言事件循环），以及块设备调度和 HDD 控制器的复杂性。几位用户指出，对于已经使用低级 API（如 `io_uring`）的用户来说，额外的抽象层可能是不必要的，标准的 UNIX 文件 API 通常就足够了。

原文

Consider this tale of I/O and performance. We’ll start with blocking I/O, explore io_uring and kqueue, and take home an event loop very similar to some software you may find familiar.

This is a twist on King’s talk at Software You Can Love Milan ’22.

When you want to read from a file you might open() and then call read() as many times as necessary to fill a buffer of bytes from the file. And in the opposite direction, you call write() as many times as needed until everything is written. It’s similar for a TCP client with sockets, but instead of open() you first call socket() and then connect() to your server. Fun stuff.

In the real world though you can’t always read everything you want immediately from a file descriptor. Nor can you always write everything you want immediately to a file descriptor.

You can switch a file descriptor into non-blocking mode so the call won’t block while data you requested is not available. But system calls are still expensive, incurring context switches and cache misses. In fact, networks and disks have become so fast that these costs can start to approach the cost of doing the I/O itself. For the duration of time a file descriptor is unable to read or write, you don’t want to waste time continuously retrying read or write system calls.

So you switch to io_uring on Linux or kqueue on FreeBSD/macOS. (I’m skipping the generation of epoll/select users.) These APIs let you submit requests to the kernel to learn about readiness: when a file descriptor is ready to read or write. You can send readiness requests in batches (also referred to as queues). Completion events, one for each submitted request, are available in a separate queue.

Being able to batch I/O like this is especially important for TCP servers that want to multiplex reads and writes for multiple connected clients.

However in io_uring, you can even go one step further. Instead of having to call read() or write() in userland after a readiness event, you can request that the kernel do the read() or write() itself with a buffer you provide. Thus almost all of your I/O is done in the kernel, amortizing the overhead of system calls.

If you haven’t seen io_uring or kqueue before, you’d probably like an example! Consider this code: a simple, minimal, not-production-ready TCP echo server.

const std = @import("std");
const os = std.os;
const linux = os.linux;
const allocator = std.heap.page_allocator;

const State = enum{ accept, recv, send };
const Socket = struct {
    handle: os.socket_t,
    buffer: [1024]u8,
    state: State,
};

pub fn main() !void {
    const entries = 32;
    const flags = 0;
    var ring = try linux.IO_Uring.init(entries, flags);
    defer ring.deinit();

    var server: Socket = undefined;
    server.handle = try os.socket(os.AF.INET, os.SOCK.STREAM, os.IPPROTO.TCP);
    defer os.closeSocket(server.handle);

    const port = 12345;
    var addr = std.net.Address.initIp4(.{127, 0, 0, 1}, port);
    var addr_len: os.socklen_t = addr.getOsSockLen();

    try os.setsockopt(server.handle, os.SOL.SOCKET, os.SO.REUSEADDR, &std.mem.toBytes(@as(c_int, 1)));
    try os.bind(server.handle, &addr.any, addr_len);
    const backlog = 128;
    try os.listen(server.handle, backlog);

    server.state = .accept;
    _ = try ring.accept(@ptrToInt(&server), server.handle, &addr.any, &addr_len, 0);

    while (true) {
        _ = try ring.submit_and_wait(1);

        while (ring.cq_ready() > 0) {
            const cqe = try ring.copy_cqe();
            var client = @intToPtr(*Socket, @intCast(usize, cqe.user_data));

            if (cqe.res < 0) std.debug.panic("{}({}): {}", .{
                client.state,
                client.handle,
                @intToEnum(os.E, -cqe.res),
            });

            switch (client.state) {
                .accept => {
                    client = try allocator.create(Socket);
                    client.handle = @intCast(os.socket_t, cqe.res);
                    client.state = .recv;
                    _ = try ring.recv(@ptrToInt(client), client.handle, .{.buffer = &client.buffer}, 0);
                    _ = try ring.accept(@ptrToInt(&server), server.handle, &addr.any, &addr_len, 0);
                },
                .recv => {
                    const read = @intCast(usize, cqe.res);
                    client.state = .send;
                    _ = try ring.send(@ptrToInt(client), client.handle, client.buffer[0..read], 0);
                },
                .send => {
                    os.closeSocket(client.handle);
                    allocator.destroy(client);
                },
            }
        }
    }
}

This is a great, minimal example. But notice that this code ties io_uring behavior directly to business logic (in this case, handling echoing data between request and response). It is fine for a small example like this. But in a large application you might want to do I/O throughout the code base, not just in one place. You might not want to keep adding business logic to this single loop.

Instead, you might want to be able to schedule I/O and pass a callback (and sometimes with some application context) to be called when the event is complete.

The interface might look like:

io_dispatch.dispatch({
    // some big struct/union with relevant fields for all event types
}, my_callback);

This is great! Now your business logic can schedule and handle I/O no matter where in the code base it is.

Under the hood it can decide whether to use io_uring or kqueue depending on what kernel it’s running on. The dispatch can also batch these individual calls through io_uring or kqueue to amortize system calls. The application no longer needs to know the details.

Additionally, we can use this wrapper to stop thinking about readiness events, just I/O completion. That is, if we dispatch a read event, the io_uring implementation would actually ask the kernel to read data into a buffer. Whereas the kqueue implementation would send a “read” readiness event, do the read back in userland, and then call our callback.

And finally, now that we’ve got this central dispatcher, we don’t need spaghetti code in a loop switching on every possible submission and completion event.

Every time we call io_uring or kqueue we both submit event requests and poll for completion events. The io_uring and kqueue APIs tie these two actions together in the same system call.

To sync our requests to io_uring or kqueue we’ll build a flush function that submits requests and polls for completion events. (In the next section we’ll talk about how the user of the central dispatch learns about completion events.)

To make flush more convenient, we’ll build a nice wrapper around it so that we can submit as many requests (and process as many completion events) as possible. To avoid accidentally blocking indefinitely we’ll also introduce a time limit. We’ll call the wrapper run_for_ns.

Finally we’ll put the user in charge of setting up a loop to call this run_for_ns function, independent of normal program execution.

This is now your traditional event loop.

You may have noticed that in the API above we passed a callback. The idea is that after the requested I/O has completed, our callback should be invoked. But the question remains: how to track this callback between the submission and completion queue?

Thankfully, io_uring and kqueue events have user data fields. The user data field is opaque to the kernel. When a submitted event completes, the kernel sends a completion event back to userland containing the user data value from the submission event.

We can store the callback in the user data field by setting it to the callback’s pointer casted to an integer. When the completion for a requested event comes up, we cast from the integer in the user data field back to the callback pointer. Then, we invoke the callback.

As described above, the struct for io_dispatch.dispatch could get quite large handling all the different kinds of I/O events and their arguments. We could make our API a little more expressive by creating wrapper functions for each event type.

So if we wanted to schedule a read function we could call:

io_dispatch.read(fd, &buf, nBytesToRead, callback);

io_dispatch.write(fd, buf, nBytesToWrite, callback);

intrusive linked list to contain all request context, including the callback. The latter is what we do in TigerBeetle.

Put another way: every time code calls io_dispatch, we’ll try to immediately submit the requested event to io_uring or kqueue. But if there’s no room, we store the event in an overflow queue.

The overflow queue needs to be processed eventually, so we update our flush function (described in Callbacks and context above) to pull as many events from our overflow queue before submitting a batch to io_uring or kqueue.

We’ve now built something similar to libuv, the I/O library that Node.js uses. And if you squint, it is basically TigerBeetle’s I/O library! (And interestingly enough, TigerBeetle’s I/O code was adopted into Bun! Open-source for the win!)

Let’s check out how the Darwin version of TigerBeetle’s I/O library (with kqueue) differs from the Linux version. As mentioned, the complete send call in the Darwin implementation waits for file descriptor readiness (through kqueue). Once ready, the actual send call is made back in userland:

pub fn send(
    self: *IO,
    comptime Context: type,
    context: Context,
    comptime callback: fn (
        context: Context,
        completion: *Completion,
        result: SendError!usize,
    ) void,
    completion: *Completion,
    socket: os.socket_t,
    buffer: []const u8,
) void {
    self.submit(
        context,
        callback,
        completion,
        .send,
        .{
            .socket = socket,
            .buf = buffer.ptr,
            .len = @intCast(u32, buffer_limit(buffer.len)),
        },
        struct {
            fn do_operation(op: anytype) SendError!usize {
                return os.send(op.socket, op.buf[0..op.len], 0);
            }
        },
    );
}

Linux version (with io_uring) where the kernel handles everything and there is no send system call in userland:

pub fn send(
    self: *IO,
    comptime Context: type,
    context: Context,
    comptime callback: fn (
        context: Context,
        completion: *Completion,
        result: SendError!usize,
    ) void,
    completion: *Completion,
    socket: os.socket_t,
    buffer: []const u8,
) void {
    completion.* = .{
        .io = self,
        .context = context,
        .callback = struct {
            fn wrapper(ctx: ?*anyopaque, comp: *Completion, res: *const anyopaque) void {
                callback(
                    @intToPtr(Context, @ptrToInt(ctx)),
                    comp,
                    @intToPtr(*const SendError!usize, @ptrToInt(res)).*,
                );
            }
        }.wrapper,
        .operation = .{
            .send = .{
                .socket = socket,
                .buffer = buffer,
            },
        },
    };
    // Fill out a submission immediately if possible, otherwise adds to overflow buffer
    self.enqueue(completion);
}

Linux and macOS for event processing. Look at run_for_ns on Linux and macOS for the public API users must call. And finally, look at what puts this all into practice, the loop calling run_for_ns in src/main.zig.

We’ve come this far and you might be wondering — what about cross-platform support for Windows? The good news is that Windows also has a completion based system similar to io_uring but without batching, called IOCP. And for bonus points, TigerBeetle provides the same I/O abstraction over it! But it’s enough to cover just Linux and macOS in this post. :)

In both this blog post and in TigerBeetle, we implemented a single-threaded event loop. Keeping I/O code single-threaded in userspace is beneficial (whether or not I/O processing is single-threaded in the kernel is not our concern). It’s the simplest code and best for workloads that are not embarrassingly parallel. It is also best for determinism, which is integral to the design of TigerBeetle because it enables us to do Deterministic Simulation Testing

But there are other valid architectures for other workloads.

For workloads that are embarrassingly parallel, like many web servers, you could instead use multiple threads where each thread has its own queue. In optimal conditions, this architecture has the highest I/O throughput possible.

But if each thread has its own queue, individual threads can become starved if an uneven amount of work is scheduled on one thread. In the case of dynamic amounts of work, the better architecture would be to have a single queue but multiple worker threads doing the work made available on the queue.

Hey, maybe we’ll split this out so you can use it too. It’s written in Zig so we can easily expose a C API. Any language with a C foreign function interface (i.e. every language) should work well with it. Keep an eye on our GitHub. :)

Additional resources:

一个对 io_uring 和 kqueue 的程序员友好型 I/O 抽象层 A programmer-friendly I/O abstraction over io_uring and kqueue (2022)

一个对 io_uring 和 kqueue 的程序员友好型 I/O 抽象层
A programmer-friendly I/O abstraction over io_uring and kqueue (2022)