低延迟网络中令人惊讶的 gRPC 客户端瓶颈

低延迟网络中令人惊讶的 gRPC 客户端瓶颈
The Surprising gRPC Client Bottleneck in Low-Latency Networks

原始链接: https://blog.ydb.tech/the-surprising-grpc-client-bottleneck-in-low-latency-networks-and-how-to-get-around-it-69d6977a1d02

## YDB 的 gRPC 瓶颈与优化 YDB 在基于 gRPC 的数据库 API 中发现了一个性能瓶颈，表现为客户端延迟增加以及集群规模减小时资源利用率不足。尽管网络速度快且服务器设置看似高效，但基准测试显示可扩展性差——增加客户端带来的吞吐量提升微乎其微。根本原因被确定为由于在单个 TCP 连接上复用 RPC（即使使用了多个通道）导致 gRPC 客户端内部的争用。这导致在等待确认期间出现不活动状态，从而有效地序列化请求。解决方案遵循 gRPC 最佳实践，确保每个 worker 使用具有唯一参数的专用 gRPC 通道（或启用 `GRPC_ARG_USE_LOCAL_SUBCHANNEL_POOL`）。这强制创建单独的 TCP 连接，从而实现真正的并发性。这项优化带来了 4.5-6 倍的吞吐量提升，并显著降低了延迟，证明了解决客户端瓶颈的重要性。在高延迟网络环境中，这个问题不太明显，这突显了它对网络速度的敏感性。团队鼓励进一步调查和贡献，以优化 gRPC 性能。

## gRPC 客户端瓶颈总结一篇 Hacker News 讨论强调了 gRPC 中一个令人惊讶的性能瓶颈，尤其是在低延迟网络中。尽管 gRPC 使用 HTTP/2 多路复用，但 Go gRPC SDK 往往每个客户端只创建一个 *单个* TCP 连接。这会将请求序列化到单个 IO 线程，即使有充足的 CPU 核心也会限制吞吐量。核心问题似乎是队头阻塞——即使是快速请求也会因为一个请求处理时间较长而延迟。一位用户发现了一个解决方法，涉及一个 Linux 内核设置 (`net.ipv4.tcp_slow_start_after_idle`)，该设置会在一段时间不活动后导致慢启动。许多评论者建议，通过完全绕过 gRPC 并使用更简单的 JSON API 可以显著提高性能。另一些人建议探索像 `connect-rpc` 这样的替代方案。这次讨论强调了性能基准和在选择通信协议时进行仔细评估的重要性。

原文

Zoom image will be displayed

“Improving anything but the bottleneck is an illusion.” — Eliyahu M. Goldratt

At YDB, we use gRPC to expose our database API to clients. Therefore, all our load generators and benchmarks are gRPC clients. Recently, we discovered that the fewer cluster nodes we have, the harder it is for the benchmark to load the cluster. Moreover, when we shrink the cluster size, it results in more and more idle resources, while we observe steadily increasing client-side latency. Fortunately, we identified the root cause as a bottleneck on the client side of gRPC.

In this post, we describe the issue and the steps to reproduce it using a provided gRPC server/client microbenchmark. Then, we show a recipe to avoid the discovered bottleneck and achieve high throughput and low latency simultaneously. We present a comparative performance evaluation that illustrates the relationship between latency and throughput, as well as the number of concurrent in-flight requests.

gRPC is usually considered “a performant, robust platform for inter-service communication”. Within a gRPC client, there are multiple gRPC channels, each supporting many RPCs (streams). gRPC is implemented over the HTTP/2 protocol, and each gRPC stream corresponds to an HTTP/2 stream.

gRPC channels to different gRPC servers have their own TCP connections. Also, when you create a channel, you might specify channel arguments (channel configuration), and channels created with different arguments will have their own TCP connections. Otherwise, as we discovered, all channels share the same TCP connection regardless of traffic (which is quite unexpected), and gRPC uses HTTP/2 to multiplex RPCs.

In gRPC’s Performance Best Practices, it is stated that:

(Special topic) Each gRPC channel uses 0 or more HTTP/2 connections and each connection usually has a limit on the number of concurrent streams. When the number of active RPCs on the connection reaches this limit, additional RPCs are queued in the client and must wait for active RPCs to finish before they are sent. Applications with high load or long-lived streaming RPCs might see performance issues because of this queueing. There are two possible solutions:
1. Create a separate channel for each area of high load in the application.
2. Use a pool of gRPC channels to distribute RPCs over multiple connections (channels must have different channel args to prevent re-use so define a use-specific channel arg such as channel number).

Our gRPC clients use the first solution. It’s also worth noting that, by default, the limit on the number of concurrent streams per connection is 100, and the number of our in-flight requests per channel is lower. In this post, we show that — at least in our case — these two solutions are actually two steps of the same fix, not separate options.

To tackle the problem, we implemented a simple gRPC ping microbenchmark in C++, which uses the latest version of gRPC (v1.72.0, fetched via CMake). Note that we observed the same issue with clients written in Java, so these results likely apply to most gRPC implementations.

There is a grpc_ping_server, which uses the gRPC async API with completion queues. The number of workers, completion queues, and callbacks per completion queue are specified at startup. There is also a grpc_ping_client. At startup, you specify the number of parallel workers. Each worker performs RPC calls with in-flight equals 1, using the gRPC sync API (results are the same when using the async API) and its own gRPC channel. This is a closed-loop benchmark, with the total number of concurrent requests in the system equal to the number of client workers. For simplicity, the ping message has no payload.

We run the client and the server on separate bare-metal machines. Each has two Intel Xeon Gold 6338 CPUs at 2.00 GHz with hyper-threading enabled. Here is the topology:

> numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
node 0 size: 257546 MB
node 0 free: 51628 MB
node 1 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
node 1 size: 258032 MB
node 1 free: 82670 MB
node distances:
node   0   1
0:  10  20
1:  20  10

The machines are physically close to each other and connected with a 50 Gbps network:

100 packets transmitted, 100 received, 0% packet loss, time 101367ms
rtt min/avg/max/mdev = 0.031/0.043/0.085/0.012 ms

Just for comparison, here is a localhost ping:

100 packets transmitted, 100 received, 0% packet loss, time 101387ms
rtt min/avg/max/mdev = 0.013/0.034/0.047/0.009 ms

We run the gRPC server using the following command:

taskset -c 32-63 ./grpc_ping_server --num-cqs 8 --workers-per-cq 2 --callbacks-per-cq 10

We use taskset to ensure that all threads always run within the same NUMA node. This is crucial for accurate performance results, and we recommend the same approach in production.

According to the gRPC documentation, 2 workers per completion queue is optimal. We expect that a server with 32 CPU cores and 8 completion queues would handle a huge RPS with good latency.

Now, let’s start the client and check the results for both regular RPCs and streaming RPCs, which should have better performance:

echo "Single connection, no streaming"
taskset -c 0-31 ./grpc_ping_client --host server-host --min-inflight 1 --max-inflight 48 --warmup 5 --interval 10 --with-csvecho "Single connection streaming"
taskset -c 0-31 ./grpc_ping_client --host server-host --min-inflight 1 --max-inflight 48 --warmup 5 --interval 10 --with-csv --streaming

The plotted results:

Zoom image will be displayed

“Regular RPC (theoretical)” is calculated using the formula

rps(inflight) = measured_rps_inflight_1 * inflight

and represents ideal linear scalability. Unfortunately, even with a small in-flight count, the results deviate significantly from this ideal straight line.

Here is the tabular data for RPC without streaming:

Benchmark Results Summary:
In-flight | Throughput (req/s) | P50 (us) | P90 (us) | P99 (us) | P99.9 (us) | P100 (us)
---------+--------------------+----------+----------+----------+------------+----------
1 |           10435.10 |       91 |       98 |      106 |        116 |       219
2 |           17583.50 |      100 |      131 |      235 |        273 |       372
3 |           21754.50 |      126 |      171 |      251 |        319 |       538
4 |           25904.20 |      144 |      195 |      283 |        358 |       491
5 |           27468.70 |      168 |      243 |      348 |        442 |       606
6 |           30093.90 |      190 |      255 |      366 |        461 |       630
7 |           32224.70 |      210 |      270 |      394 |        502 |       770
8 |           33289.90 |      234 |      298 |      424 |        552 |       732
9 |           36014.90 |      247 |      299 |      411 |        599 |      1282
10 |           37169.70 |      266 |      325 |      454 |        699 |       942
11 |           39452.10 |      283 |      336 |      433 |        653 |      1017
12 |           40515.50 |      297 |      359 |      537 |        840 |      1133
13 |           41839.70 |      312 |      374 |      543 |        902 |      1172
14 |           43360.70 |      327 |      389 |      536 |        892 |      1231
15 |           43196.50 |      344 |      426 |      660 |       1038 |      1335
16 |           43854.60 |      359 |      445 |      847 |       1156 |      1453
17 |           44044.30 |      375 |      473 |      875 |       1210 |      1493
18 |           44025.60 |      390 |      508 |     1043 |       1316 |      1578
19 |           42714.70 |      411 |      581 |     1176 |       1415 |      1651
20 |           42050.10 |      429 |      705 |     1270 |       1497 |      1764
21 |           40584.50 |      453 |      848 |     1371 |       1585 |      1866

As you can see, adding 10× clients yields only a 3.7× increase in throughput, while adding 20× results in just a 4× gain. Moreover, latency increases linearly with each additional client. Even when the in-flight count is small, latency is much higher than the network’s. And because we have a closed loop, throughput is latency-bound.

The performance was clearly poor, so we began investigating. First, we checked the number of TCP connections using lsof -i TCP:2137 and found that only a single TCP connection was used regardless of in-flight count.

Next, we captured a tcpdump and analyzed it in Wireshark:

As expected, there were no congestion issues, because we have a solid network
Nagle is off (TCP_NODELAY set)
TCP window was also good: set to 64 KiB, while in-flight bytes were usually within 1 KiB
No delayed ACKs
Server was fast to respond

Meanwhile, we noticed the following pattern:

Client sends an HTTP/2 request to the server, batching RPCs from different workers.
Server ACKs.
Server sends a batched response (multiple datagrams) containing the data for different workers (i.e., also batched).
Client ACKs. Now, the TCP connection has no bytes in-flight.
Around 150–200 µs of inactivity before the subsequent request.

Thus, the major source of latency is the client. And because our microbenchmark logic is minimal, the issue likely lies somewhere in the gRPC layer. It might involve contention within gRPC or batching.

We tried both per-worker gRPC channels and channel pooling (all using the same arguments and sharing the same TCP connection). There was no improvement. Using channels with different arguments works well. Alternatively, it’s sufficient to set the GRPC_ARG_USE_LOCAL_SUBCHANNEL_POOL argument as we do here. The best performance (in both throughput and latency) is achieved when each worker has its own channel, and this option is set.

Commands to run the clients:

echo "Multi connection, no streaming"
taskset -c 0-31 ./grpc_ping_client --host ydb-vla-dev04-000.search.yandex.net --min-inflight 1 --max-inflight 48 --warmup 5 --interval 10 --with-csv --local-poolecho "Multi connection streaming"
taskset -c 0-31 ./grpc_ping_client --host ydb-vla-dev04-000.search.yandex.net --min-inflight 1 --max-inflight 48 --warmup 5 --interval 10 --with-csv --streaming --local-pool

Results:

Zoom image will be displayed

For regular RPCs, we see nearly a 6× improvement in throughput, and 4.5× for streaming RPCs. And when in-flight is increased, latency grows really slowly. It’s possible that another bottleneck or resource shortage exists, but we stopped the investigation at this point.

To better understand when this bottleneck might become an issue, we repeated the same measurements in a network with 5 ms latency:

100 packets transmitted, 100 received, 0% packet loss, time 99108ms
rtt min/avg/max/mdev = 5.100/5.246/5.336/0.048 ms

Zoom image will be displayed

It’s clear that in this case everything is OK, and the multichannel version is only slightly faster when in-flight is high.

In our case, the two gRPC client-side “solutions” described in the official best practices — per-channel separation and multi-channel pooling — turned out to be two steps of a single, unified fix. Creating per-worker channels with distinct arguments (or enabling GRPC_ARG_USE_LOCAL_SUBCHANNEL_POOL) resolves the bottleneck and delivers both high throughput and low latency.

However, there may be other optimizations we’re unaware of. If you know ways to improve performance or want to contribute, let us know in the comments or by filing an issue/PR on the benchmark’s GitHub repo.

低延迟网络中令人惊讶的 gRPC 客户端瓶颈 The Surprising gRPC Client Bottleneck in Low-Latency Networks

低延迟网络中令人惊讶的 gRPC 客户端瓶颈
The Surprising gRPC Client Bottleneck in Low-Latency Networks