Linux 中的 Epoll 与 Io

Linux 中的 Epoll 与 Io_uring 对比
Epoll vs. io_uring in Linux

原始链接: https://sibexi.co/posts/epoll-vs-io_uring/

本文探讨了 Linux 异步 I/O 的演进，通过反向代理开发项目的视角，对比了传统的 `epoll` 模型与现代的 `io_uring` 接口。 `epoll` 基于“就绪”模型。虽然有效，但由于每次 I/O 操作都需要多次系统调用——即检查数据是否就绪 (`epoll_wait`) 以及执行实际的数据传输 (`read`/`write`)——这会产生显著的开销。在高负载下，用户态与内核态之间的上下文切换成为性能瓶颈。相比之下，`io_uring` 采用“完成”模型。它利用应用程序与内核之间的共享内存环形缓冲区，允许批量提交和获取多个 I/O 操作。这显著减少了所需的系统调用总数。为了实现极致性能，`SQPOLL` 等特性可以自动化提交过程，在稳定运行状态下甚至可能完全消除系统调用。鉴于 `io_uring` 通过将复杂性从应用程序转移到内核中从而提供了更高的效率，作者认为它是高性能 Linux 开发的新标准，对于现代新建项目而言，`epoll` 已基本过时。

这篇 Hacker News 讨论探讨了在高性能 Linux 网络编程中 `epoll` 与 `io_uring` 之间的权衡。虽然 `io_uring` 通过减少系统调用开销并促进内核与用户空间之间的共享内存通信，通常能提供更优的吞吐量，但评论者对其安全性表示了重大担忧。由于过去曾出现与 `io_uring` 内存共享相关的漏洞，许多大型项目和企业级发行版在将其作为默认选项时持谨慎态度，尽管 RHEL 等企业环境中的支持力度正在增加。技术贡献者指出，要实现性能峰值，仅更换 API 是不够的；开发者还应关注 CPU 绑定、套接字对齐以及内存管理（使用如 `mimalloc` 或 `libxdp` 等工具）。用户还提到，`io_uring` 带来的性能提升有时会表现为 CPU 利用率的增加，因为系统在更高效地处理 I/O，而不是让核心处于空闲状态。最终，参与者一致认为，虽然对于愿意承担安全风险的人来说 `io_uring` 是一个强大的工具，但构建高性能代理仍然是一项复杂的工程挑战，其难度远不止于简单的多路复用选择。

原文

First, I want to tell you how exactly I got to this point and why I started researching different options for handling asynchronous I/O on Linux… Last year, my students and I built a reverse proxy server called TinyGate. It was super simple, worker-based, and it basically worked well. Of course, I didn’t expect it to be very fast, but it was an educational project, and since we’d made a real, kind of production-ready tool, I was really proud of it. But my students weren’t as happy as I was - they wanted to build something genuinely useful, and they were really disappointed that our “product” had strong architectural limits and couldn’t outperform titans like nginx and haproxy. So they literally forced me to research together how those tools work under the hood and how to handle asynchronous I/O to cut down on the heavy overhead… Long story short, we made a second version of TinyGate, based on epoll. It still lost to nginx/haproxy in benchmarks, but it had a dramatic performance boost compared to the first version. But epoll isn’t perfect either (as I’ll explain below), and we eventually switched to io_uring, which led to a full rewrite of our project from scratch, again… So it’s a really interesting topic, and today I’ll share an overview of the two queueing systems Linux gives you for asynchronous I/O.

When I just started developing for Linux, epoll was a new feature, and basically it had no alternatives. Everyone used it to manage asynchronous execution - there was no other choice. The problem is, epoll relies heavily on syscalls: it tells you when I/O is possible, but you still have to call read()/write() yourself afterward - that’s two syscalls per I/O event, on top of the one-time epoll_ctl registration. Each of these syscalls causes a context switch between user and kernel mode, which creates HUGE overhead once you’re handling a lot of connections. But we have a solution! About 17 years after epoll landed in the Linux kernel (2002), io_uring appeared (2019)! Instead of telling you when I/O is possible, it tells you when I/O is done - no polling loop, and far less associated syscalls.

The kernel consumes submissions from memory shared between your app and the kernel, and posts completions back into that same shared memory - both live in ring buffers, hence the name. The catch: by default you still have to call io_uring_enter() to tell the kernel “go check the submission queue” - but one call can submit a whole batch of operations and reap a whole batch of completions, instead of one syscall pair per operation like with epoll + read. If you want close to zero syscalls during steady state, there’s IORING_SETUP_SQPOLL, which spins up a dedicated kernel thread that polls the submission queue for you - at the cost of that thread burning CPU (more on this below).

Basic architecture: as I said before, epoll notifies you when I/O is possible, io_uring notifies you when I/O is done. Where epoll makes every I/O operation cross the kernel boundary, io_uring lets you pay a small “setup fee” once (creating the ring) plus a per-batch fee (the io_uring_enter() call) instead of a fee per operation. So instead of a syscall pair per I/O, you get a syscall per batch of I/Os - or, with SQPOLL, close to none at all. As you can see, with a ton of I/O happening, this saves a lot of syscalls.

On relatively new systems where io_uring is supported (kernel v5.1+, released in 2019), there’s often not much reason to reach for epoll. The shift from a readiness model to a completion model is a huge architectural change - it moves a big part of the work out of your application and into the kernel.

Of course, I won’t leave you without some code showing how both systems work. We’ll use C. (The io_uring example uses liburing, the userspace helper library - install it via liburing-dev/liburing-devel, or drop down to the raw io_uring_setup/io_uring_enter syscalls if you want zero dependencies.)

epoll

Let’s make a simple example of how epoll works. We’ll create the instance, register a file descriptor (stdin, in our case), and process the incoming event.

#include <stdio.h>
#include <unistd.h>
#include <sys/epoll.h>
#include <stdlib.h>

#define MAX_EVENTS 8

int main() {
    // Creating the epoll instance
    int epoll_fd = epoll_create1(0);
    if (epoll_fd == -1) {
        perror("epoll_create1");
        return 1;
    }

    // Registering a file descriptor (stdin in our case)
    struct epoll_event ev, events[MAX_EVENTS];
    ev.events = EPOLLIN;
    ev.data.fd = STDIN_FILENO;

    if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, STDIN_FILENO, &ev) == -1) {
        perror("epoll_ctl");
        return 1;
    }

    // Blocking until something is readable
    int n = epoll_wait(epoll_fd, events, MAX_EVENTS, -1);
    if (n == -1) {
        perror("epoll_wait");
        return 1;
    }

    // For each fd, issue a SEPARATE syscall to do the I/O
    for (int i = 0; i < n; i++) {
        if (events[i].data.fd == STDIN_FILENO) {
            char buf[256];
            ssize_t count = read(STDIN_FILENO, buf, sizeof(buf));
            printf("read %zd bytes\n", count);
        }
    }

    // Cleaning up
    close(epoll_fd);
    return 0;
}

As you can see, this example uses three syscalls in total: epoll_ctl (a one-time registration), then epoll_wait and read for the event - so two syscalls per actual I/O event, like I mentioned above. The code itself is pretty easy to follow.

io_uring

Now let’s do the same thing with io_uring instead of epoll.

#define _GNU_SOURCE
#include <stdio.h>
#include <unistd.h>
#include <liburing.h>
#include <stdlib.h>

int main() {
    struct io_uring ring;
    char buf[256];

    // Setting up the ring
    if (io_uring_queue_init(8, &ring, 0) < 0) {
        perror("io_uring_queue_init");
        return 1;
    }

    // Prepare a READ operation on stdin
    struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
    io_uring_prep_read(sqe, STDIN_FILENO, buf, sizeof(buf), 0);

    // Submitting the read
    io_uring_submit(&ring);

    // Waiting for completion
    struct io_uring_cqe *cqe;
    if (io_uring_wait_cqe(&ring, &cqe) < 0) {
        perror("io_uring_wait_cqe");
        return 1;
    }
    if (cqe->res < 0) {
        fprintf(stderr, "read failed: %d\n", cqe->res);
    } else {
        printf("read %d bytes\n", cqe->res);
    }

    // Marking seen then cleaning up
    io_uring_cqe_seen(&ring, cqe);
    io_uring_queue_exit(&ring);
    return 0;
}

What can we see here?

Similar instance creation step.
No epoll_ctl registration step needed.
No readiness check needed before submission.
No separate read() call at completion.

Yeah, io_uring takes way fewer resources for this - though, as noted above, there’s still one io_uring_enter() call hiding inside io_uring_submit() and io_uring_wait_cqe() unless you’re running with SQPOLL.

When you test these examples, keep in mind that for the sake of simplicity, some important parts are missing. For example, it will block forever if stdin never produces any data, and the io_uring example skips checking for a NULL sqe (which io_uring_get_sqe() can return if the submission queue is full).

Zero-copy. For real zero-copy I/O, register your buffers ahead of time with io_uring_register_buffers() - this avoids the kernel re-mapping memory on every single operation. For network sends specifically, look at IORING_OP_SEND_ZC (kernel 6.0+ needed), which skips copying the buffer into the kernel entirely.
SQPOLL uses CPU. Even when your queue is empty, IORING_SETUP_SQPOLL keeps a kernel thread spinning and polling, which burns CPU. There’s an idle timeout (sq_thread_idle) after which it backs off to sleeping, but it’s not free.
Asynchronous error handling. Errors come back (and must be handled) asynchronously, as part of the cqe’s res field - not as a direct return value like a normal synchronous syscall.

io_uring is the new standard for async I/O in the modern Linux world, and honestly, I don’t see much reason to still reach for epoll on a system that has it. For a from-scratch project on a modern Linux server, like our TinyGate rewrite, io_uring is absolutely the way to go. I’m a die-hard supporter of dropping support for old systems as soon as it’s reasonable - if you’re still running a kernel released more than 7 years ago, in my opinion, that’s not a great idea…