RIP pthread_cancel
RIP pthread_cancel

原始链接: https://eissing.org/icing/posts/rip_pthread_cancel/

Curl 8.16.0 版本尝试使用 `pthread_cancel()` 提升性能,结果适得其反,现在该功能正在被移除(#18540)。目标是使用单独的线程进行 `getaddrinfo()`(DNS 解析),以防止在长时间的名称查找过程中阻塞主 curl 进程。 然而,使用 `pthread_cancel()` 取消线程导致了内存泄漏。问题在于 `glibc` 处理 `getaddrinfo()` 的方式及其与 `/etc/gai.conf` 文件的交互。在 `getaddrinfo()` 内部读取文件时进行取消会导致分配的内存成为孤儿,并且这种情况可能会反复发生。 开发者认为,在 `glibc` 的设计中可靠地防止这些泄漏过于复杂且对于一个如此常用的库来说是不可接受的。他们选择恢复到可能在 `getaddrinfo()` 上阻塞,而不是冒着内存泄漏的风险。需要非阻塞 DNS 解析的用户建议使用 `c-ares`,但它可能无法提供 `glibc` 的全部功能。

## Hacker News 讨论:pthread_cancel 终结与 DNS 解析 一场 Hacker News 讨论围绕着一篇关于 libcurl 中 `pthread_cancel` 问题的最新博文,具体与阻塞 DNS 解析相关。核心问题:取消执行 `getaddrinfo` 的线程可能导致内存泄漏,因为缺少清理程序。 用户强调可靠线程取消的困难,提倡使用非阻塞 API。 许多人指出跨平台缺乏标准化的、可取消的异步 DNS 解析。 讨论的解决方案包括使用像 c-ares 这样的库,利用特定平台的异步 API(Windows、BSD、Android),或采用线程池。 一个反复出现的主题是 POSIX 标准的历史遗留问题和复杂性,特别是关于超时和异步操作。 许多人认为 DNS 解析应该作为系统服务处理,而不是在 libc 中处理,以便更好地缓存和管理。 讨论还涉及线程创建开销与潜在资源泄漏之间的权衡,以及在使用线程解决方案时 fork() 兼容性的挑战。
相关文章

原文

I posted about adding pthread_cancel use in curl about three weeks ago, we released this in curl 8.16.0 and it blew up right in our faces. Now, with #18540 we are ripping it out again. What happened?

short recap

pthreads define “Cancelation points”, a list of POSIX functions where a pthread may be cancelled. In addition, there is also a list of functions that may be cancelation points, among those getaddrinfo().

getaddrinfo() is exactly what we are interested in for libcurl. It blocks until it has resolved a name. That may hang for a long time and libcurl is unable to do anything else. Meh. So, we start a pthread and let that call getaddrinfo(). libcurl can do other things while that thread runs.

But eventually, we have to get rid of the pthread again. Which means we either have to pthread_join() it - which means a blocking wait. Or we call pthread_detach() - which returns immediately but the thread keeps on running. Both are bad when you want to do many, many transfers. Either we block and stall or we let pthreads pile up in an uncontrolled way.

So, we added pthread_cancel() to interrupt a running getaddrinfo() and get rid of the pthread we no longer needed. So the theory. And, after some hair pulling, we got this working.

cancel yes, leakage also yes!

After releasing curl 8.16.0 we got an issue reported in #18532 that cancelled pthreads leaked memory.

modern times sigh

Digging into the glibc source shows that there is this thing called /etc/gai.conf which defines how getaddrinfo() should sort returned answers.

The implementation in glibc first resolves the name to addresses. For these, it needs to allocate memory. Then it needs to sort them if there is more than one address. And in order to do that it needs to read /etc/gai.conf. And in order to do that it calls fopen() on the file. And that may be a pthread “Cancelation Point” (and if not, it surely calls open() which is a required cancelation point).

So, the pthread may get cancelled when reading /etc/gai.conf and leak all the allocated responses. And if it gets cancelled there, it will try to read /etc/gai.conf again the next time it has more than one address resolved.

At this point, I decided that we need to give up on the whole pthread_cancel() strategy. The reading of /etc/gai.conf is one point where a cancelled getaddrinfo() may leak. There might be others. Clearly, glibc is not really designed to prevent leaks here (admittedly, this is not trivial).

RIP

Leaking memory potentially on something libcurl does over and over again is not acceptable. We’d rather pay the price of having to eventually wait on a long running getaddrinfo().

Applications using libcurl can avoid this by using c-ares which resolves unblocking and without the use of threads. But that will not be able to do everything that glibc does.

DNS continues to be tricky to use well.

联系我们 contact @ memedata.com