一个对 io_uring 和 kqueue 的程序员友好型 I/O 抽象层
A programmer-friendly I/O abstraction over io_uring and kqueue (2022)

原始链接: https://tigerbeetle.com/blog/2022-11-23-a-friendly-abstraction-over-iouring-and-kqueue/

## 从阻塞I/O到事件循环:性能演进 本次讨论探讨了I/O处理的演进,以提升性能。从传统的阻塞I/O(使用`open()`、`read()`、`write()`、`socket()`、`connect()`)开始,逐步进展到更高效的方法。虽然非阻塞I/O避免了无限期等待,但重复检查就绪状态代价高昂,因为存在系统调用开销。 像Linux的`io_uring`和FreeBSD/macOS的`kqueue`等解决方案允许提交一批I/O请求到内核,减少上下文切换。`io_uring`更进一步,使内核能够*直接*执行读/写操作,最大限度地减少用户空间的参与。 这引出了事件循环的概念——一个中央调度器,用于调度I/O操作并在完成时调用回调函数。这抽象了内核特定的细节(自动选择`io_uring`或`kqueue`),并允许从代码库中的任何位置发起I/O操作。像libuv(Node.js使用)和TigerBeetle这样的库是这种方法的示例。 核心思想是提交带有用户数据(包含回调指针)的请求,并接收带有相同数据的完成事件,从而实现异步操作。这种架构通常是单线程的,以实现简单性和确定性,但可以适应多线程场景,以最大限度地提高并行工作负载的吞吐量。作者暗示可能会将此I/O抽象作为C API发布,以提高更广泛的语言兼容性。

## 虎甲虫 I/O 抽象层与 Hacker News 讨论 一个基于 `io_uring` 和 `kqueue` 构建的、更易于程序员使用的 I/O 抽象层(tigerbeetle.com)最近在 Hacker News 上分享,引发了关于系统调用效率和文件 I/O 的讨论。 核心观点是,虽然非阻塞 I/O (`O_NONBLOCK`) 对于套接字等流很有用,但对于传统文件提供的益处有限。由于上下文切换和缓存未命中,系统调用仍然很昂贵,可能会超过 I/O 本身的成本。文件通常*总是*准备好进行读/写(除非为空或已满),因此持续轮询效率低下。 评论还涉及了相关项目,例如 `libxev`(一个受虎甲虫启发的 Zig 语言事件循环),以及块设备调度和 HDD 控制器的复杂性。 几位用户指出,对于已经使用低级 API(如 `io_uring`)的用户来说,额外的抽象层可能是不必要的,标准的 UNIX 文件 API 通常就足够了。
相关文章

原文

Consider this tale of I/O and performance. We’ll start with blocking I/O, explore io_uring and kqueue, and take home an event loop very similar to some software you may find familiar.

This is a twist on King’s talk at Software You Can Love Milan ’22.

When you want to read from a file you might open() and then call read() as many times as necessary to fill a buffer of bytes from the file. And in the opposite direction, you call write() as many times as needed until everything is written. It’s similar for a TCP client with sockets, but instead of open() you first call socket() and then connect() to your server. Fun stuff.

In the real world though you can’t always read everything you want immediately from a file descriptor. Nor can you always write everything you want immediately to a file descriptor.

You can switch a file descriptor into non-blocking mode so the call won’t block while data you requested is not available. But system calls are still expensive, incurring context switches and cache misses. In fact, networks and disks have become so fast that these costs can start to approach the cost of doing the I/O itself. For the duration of time a file descriptor is unable to read or write, you don’t want to waste time continuously retrying read or write system calls.

So you switch to io_uring on Linux or kqueue on FreeBSD/macOS. (I’m skipping the generation of epoll/select users.) These APIs let you submit requests to the kernel to learn about readiness: when a file descriptor is ready to read or write. You can send readiness requests in batches (also referred to as queues). Completion events, one for each submitted request, are available in a separate queue.

Being able to batch I/O like this is especially important for TCP servers that want to multiplex reads and writes for multiple connected clients.

However in io_uring, you can even go one step further. Instead of having to call read() or write() in userland after a readiness event, you can request that the kernel do the read() or write() itself with a buffer you provide. Thus almost all of your I/O is done in the kernel, amortizing the overhead of system calls.

If you haven’t seen io_uring or kqueue before, you’d probably like an example! Consider this code: a simple, minimal, not-production-ready TCP echo server.

intrusive linked list to contain all request context, including the callback. The latter is what we do in TigerBeetle.

Put another way: every time code calls io_dispatch, we’ll try to immediately submit the requested event to io_uring or kqueue. But if there’s no room, we store the event in an overflow queue.

The overflow queue needs to be processed eventually, so we update our flush function (described in Callbacks and context above) to pull as many events from our overflow queue before submitting a batch to io_uring or kqueue.

We’ve now built something similar to libuv, the I/O library that Node.js uses. And if you squint, it is basically TigerBeetle’s I/O library! (And interestingly enough, TigerBeetle’s I/O code was adopted into Bun! Open-source for the win!)

Let’s check out how the Darwin version of TigerBeetle’s I/O library (with kqueue) differs from the Linux version. As mentioned, the complete send call in the Darwin implementation waits for file descriptor readiness (through kqueue). Once ready, the actual send call is made back in userland:

Linux version (with io_uring) where the kernel handles everything and there is no send system call in userland:

Linux and macOS for event processing. Look at run_for_ns on Linux and macOS for the public API users must call. And finally, look at what puts this all into practice, the loop calling run_for_ns in src/main.zig.

We’ve come this far and you might be wondering — what about cross-platform support for Windows? The good news is that Windows also has a completion based system similar to io_uring but without batching, called IOCP. And for bonus points, TigerBeetle provides the same I/O abstraction over it! But it’s enough to cover just Linux and macOS in this post. :)

In both this blog post and in TigerBeetle, we implemented a single-threaded event loop. Keeping I/O code single-threaded in userspace is beneficial (whether or not I/O processing is single-threaded in the kernel is not our concern). It’s the simplest code and best for workloads that are not embarrassingly parallel. It is also best for determinism, which is integral to the design of TigerBeetle because it enables us to do Deterministic Simulation Testing

But there are other valid architectures for other workloads.

For workloads that are embarrassingly parallel, like many web servers, you could instead use multiple threads where each thread has its own queue. In optimal conditions, this architecture has the highest I/O throughput possible.

But if each thread has its own queue, individual threads can become starved if an uneven amount of work is scheduled on one thread. In the case of dynamic amounts of work, the better architecture would be to have a single queue but multiple worker threads doing the work made available on the queue.

Hey, maybe we’ll split this out so you can use it too. It’s written in Zig so we can easily expose a C API. Any language with a C foreign function interface (i.e. every language) should work well with it. Keep an eye on our GitHub. :)

Additional resources:

联系我们 contact @ memedata.com