快速服务器

快速服务器
Fast-Servers

原始链接: https://geocar.sdf1.org/fast-servers.html

## 高性能网络服务器设计传统的网络服务器设计通常依赖于主循环根据文件描述符分发事件，历史上使用 `fork()` 或工作线程。然而，更有效的方法是利用 `epoll` 或 `kqueue` – 通常通过像 `libevent` 这样的库 – 但经常仍然会陷入类似的性能陷阱。推荐的设计是为每个 CPU 核心使用一个线程，每个线程固定到特定的处理器并拥有自己的 `epoll/kqueue` 实例。状态转换（接受连接、读取数据）由专用线程处理，并在它们之间传递文件描述符。这消除了线程内的决策点，依赖于简单的阻塞 I/O。关键步骤包括创建一个与核心数量匹配的线程池，增加文件描述符限制，禁用套接字的延迟关闭，并可能启用延迟接受。传入的连接在一个专用循环中接受，然后通过轮询或潜在的、基于工作负载的 `pick()` 函数分发到工作线程。每个线程然后使用 `epoll_wait`/`kevent` 监控文件描述符，并调用 `handle()` 函数来处理事件，理想情况下保持操作简单，并在需要时将进一步的工作调度到其他线程。这种架构旨在通过简单性和高效的资源利用率实现高吞吐量 – 轻松超过 10 万请求/秒。

Hacker News新 | 过去 | 评论 | 提问 | 展示 | 工作 | 提交登录 Fast-Servers (sdf1.org) 10 分，by tosh 1小时前 | 隐藏 | 过去 | 收藏 | 3 评论帮助 lmz 13分钟前 | 下一个 [–] 似乎类似于 SEDA 架构 https://en.wikipedia.org/wiki/Staged_event-driven_architectu...回复 kogus 17分钟前 | 前一个 | 下一个 [–] 略微离题，但为什么第一个图表以 0.1 的不透明度重复出现？回复 ratrocket 14分钟前 | 前一个 [–] 2016年讨论过：https://news.ycombinator.com/item?id=10872209 (53 评论)回复指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索：

fast-servers

There's a network-server programming pattern which is so popular that it's the canonical approach towards writing network servers:

Flowchart of network server design described below

...

This design is easy to recognise: The main loop waits for some event, then dispatches based on the file descriptor and state that the file descriptor is in. At one point it was in vogue to actually fork() so that each file descriptor could be handled by a different thread, but now "worker threads" are usually created that all perform the same task and rely on the kernel to schedule file descriptors to them.

A much better design is possible because of the epoll and kqueue, however most people use these "new" system calls using a wrapper like libevent which just encourages the same slow design people have been using for over twenty years now.

The design I currently use and recommend involves two major points:

One thread per core, pinned (affinity) to separate CPUs, each with their own epoll/kqueue fd
Each major state transition (accept, reader) is handled by a separate thread, and transitioning one client from one state to another involves passing the file descriptor to the epoll/kqueue fd of the other thread.

Flowchart of improved network server design

...

This design has no decision points, simple blocking/IO calls, and makes simple one-page performant servers that easily get into the 100k requests/second territory on modern systems.

Creating the thread pool

Ask the operating system how many cores there are. Sometimes reserving some cores make sense, so let the user lower this number. If raising this number helps, then your state transitions are too complex and you will need to break them up.

pthread_attr_t a;
pthread_attr_init(&a);pthread_attr_setscope(&a,PTHREAD_SCOPE_SYSTEM);
pthread_attr_setdetachstate(&a,PTHREAD_CREATE_DETACHED);
t=sysconf(_SC_NPROCESSORS_ONLN);
for(i=0;i<t;++i)pthread_create(&id,&a,(void*)run,(void*)i);
while(busy(t)){pthread_mutex_lock(&tm);pthread_cond_wait(&tc,&tm);pthread_mutex_unlock(&tm);}

Then in each thread, do any per-thread initialisation and allocate your kevent/epoll fd:

void*run(int id){
set_affinity(id);
pthread_mutex_lock(&tm);
#ifdef __linux__
worker[id].q=epoll_create1(0);
#else
worker[id].q=kqueue();
#endif
...
pthread_mutex_unlock(&tm);pthread_cond_signal(&tc);

Setting processor affinity is something that must be done in-process on some platforms, but the system administrator should be able to provide input:

cpu_set_t c;CPU_ZERO(&c);CPU_SET(id,&c);pthread_setaffinity_np(pthread_self(),sizeof(c),&c);

Apple OSX doesn't support pthread_setaffinity_np() directly, but what we need is easy to implement:

extern int thread_policy_set(thread_t thread, thread_policy_flavor_t flavor, thread_policy_t policy_info, mach_msg_type_number_t count);
thread_affinity_policy_data_t ap;
thread_extended_policy_data_t ep;
ap.affinity_tag=id+1;
ep.timeshare=FALSE;
thread_policy_set(mach_thread_self(),THREAD_EXTENDED_POLICY,(thread_policy_t)&ep,THREAD_EXTENDED_POLICY_COUNT);
thread_policy_set(mach_thread_self(),THREAD_EXTENDED_POLICY,(thread_policy_t)&ap,THREAD_EXTENDED_POLICY_COUNT);

Creating the listening socket

Increase the number of file descriptors to handle the number of connections you want to handle n=2048 per thread:

getrlimit(RLIMIT_NOFILE, &r);
if(r.rlim_curn)){r.rlim_cur=n;P(setrlimit(RLIMIT_NOFILE,&r)==-1,oops("setrlimit"));}

Disabling lingering is important otherwise you'll run out of file descriptors:

s=socket(sin.sin_family=AF_INET,SOCK_STREAM,IPPROTO_TCP);
lf.l_onoff=1;lf.l_linger=0;setsockopt(s,SOL_SOCKET,SO_LINGER,&lf,sizeof(lf));
listen(s,n*t);

If the client speaks first (as in HTTP), then enable deferred accepts on Linux:

#ifdef __linux__
o=5;setsockopt(s,SOL_TCP,TCP_DEFER_ACCEPT,&o,sizeof(o));
#endif

The accept-loop

There's no point in waiting for epoll/kevent since all this loop does is accept connections:

for(;;){
f=accept(s,0,0);
...
q=pick();
#ifdef __linux__
struct epoll_event ev={0};
ev.data.fd=f;ev.events=EPOLLIN|EPOLLRDHUP|EPOLLERR|EPOLLET;
epoll_ctl(q,EPOLL_CTL_ADD,f,&ev);
#else
struct kevent ev;
EV_SET(&ev,f,EVFILT_READ,EV_ADD|EV_CLEAR,0,NM,NULL);
kevent(q,&ev,1,0,0,0);
#endif
}

Any socket options should be enabled before handing to the next step. Consider enabling a timeout (SO_RCVTIMEO) to the socket instead of tracking timers in your application:

struct timeval tv={0};
tv.tv_sec=5;
setsockopt(f,SOL_SOCKET,SO_RCVTIMEO,&tv,sizeof(tv));
fcntl(f,F_SETFL,O_NONBLOCK);

Hasan Alayli observes that on Linux you can use accept4() to combine accept() and the fcntl().

Scheduling tasks can usually be done by rotating through the threads:

int pick(void){ static int c; ++c; return worker[c%t].q; }

...however some workloads benefit from some analysis here and choosing a worker based on the probability that input will belong to one type or another can actually improve the mean throughput if some bias is introduced into the pick() routine. Experiment and benchmark.

The request-loop

A task that has some input will begin with a epoll_wait() or kevent() step:

#ifdef __linux__
struct epoll_event e[1000];
for(i=0;i<epoll_wait(q,e,1000,-1);++i)
if(e[i].events&(EPOLLRDHUP|EPOLLHUP))close(e[i].data.fd);
else handle(e[i].data.fd);
#else
struct kevent e[1000];
for(i=0;i<kevent(q,0,0,e,1000,0);++i)
if(e[i].flags&EV_EOF)close(e[i].ident);
else handle(e[i].ident);
#endif

Each file descriptor will only be used by a single request in a single state, so having an array for input buffers for file descriptors can simplify a lot of algorithms. handle(fd) can read from the input, use write() or sendfile() as necessary however if more than one syscall is needed schedule the task with a worker that performs that operation.