I posted about adding pthread_cancel use in curl about three weeks ago, we released this in curl 8.16.0 and it blew up right in our faces. Now, with #18540 we are ripping it out again. What happened?
short recap
pthreads
define “Cancelation points”, a list of POSIX functions where
a pthread may be cancelled. In addition, there is also a list of functions
that may be cancelation points, among those getaddrinfo()
.
getaddrinfo()
is exactly what we are interested in for libcurl
. It blocks
until it has resolved a name. That may hang for a long time and libcurl
is unable to do anything else. Meh. So, we start a pthread and let that
call getaddrinfo()
. libcurl
can do other things while that thread runs.
But eventually, we have to get rid of the pthread again. Which means we
either have to pthread_join()
it - which means a blocking wait. Or we
call pthread_detach()
- which returns immediately but the thread keeps
on running. Both are bad when you want to do many, many transfers. Either we block and
stall or we let pthreads pile up in an uncontrolled way.
So, we added pthread_cancel()
to interrupt a running getaddrinfo()
and get rid of the pthread we no longer needed. So the theory. And, after
some hair pulling, we got this working.
cancel yes, leakage also yes!
After releasing curl 8.16.0 we got an issue reported in #18532 that cancelled pthreads leaked memory.
Digging into the glibc source
shows that there is this thing called
/etc/gai.conf
which defines how getaddrinfo()
should sort returned answers.
The implementation in glibc first resolves the name to addresses. For these,
it needs to allocate memory. Then it needs to sort them if there is more
than one address. And in order
to do that it needs to read /etc/gai.conf
. And in order to do that
it calls fopen()
on the file. And that may be a pthread “Cancelation Point”
(and if not, it surely calls open()
which is a required cancelation point).
So, the pthread may get cancelled when reading /etc/gai.conf
and leak all
the allocated responses. And if it gets cancelled there, it will try to
read /etc/gai.conf
again the next time it has more than one address
resolved.
At this point, I decided that we need to give up on the whole pthread_cancel()
strategy. The reading of /etc/gai.conf
is one point where a cancelled
getaddrinfo()
may leak. There might be others. Clearly, glibc is not really
designed to prevent leaks here (admittedly, this is not trivial).
RIP
Leaking memory potentially on something libcurl
does over and over again is
not acceptable. We’d rather pay the price of having to eventually wait on
a long running getaddrinfo()
.
Applications using libcurl
can avoid this by using c-ares
which resolves
unblocking and without the use of threads. But that will not be able to do
everything that glibc does.
DNS continues to be tricky to use well.