使用 eBPF 对 Envoy TCP 代理进行零代码检测
Zero-Code Instrumentation of an Envoy TCP Proxy Using eBPF

原始链接: https://sergiocipriano.com/beyla-envoy.html

## 使用 OpenTelemetry eBPF 仪器化 (OBI) 调试网络延迟 最近,作者在云环境中遇到了 HTTP 499 错误,表明存在延迟问题,但缺乏明确的根本原因识别。现有的 Envoy 工具无法充分调试,需要团队实施自定义仪器化。认识到需要更好的解决方案,他们研究了 OpenTelemetry eBPF 仪器化 (OBI) – 一种无需代码的自动仪器化 Linux 服务的方法。 使用简单的 Envoy TCP 代理和 Go 后端进行的初步测试表明,OBI 能够生成详细的追踪信息,并能精确定位延迟来源。随后,实施了一个更复杂的生产级设置,使用了 Incus 容器、OpenTelemetry Collector、Jaeger、Prometheus 和 Grafana。这使得能够通过基于进程 ID 过滤遥测数据,利用 OBI Collector 的处理器能力,有针对性地追踪单个容器。 最终,OBI 揭示了问题:一个错误导致网络编排服务每 10 分钟运行一次 `netplan apply`,短暂地使接口失效,并触发 499 错误。作者得出结论,OBI 是一种强大的可观察性工具,能够在不进行代码更改的情况下实现高效调试,并提供对系统行为的宝贵见解。

黑客新闻 新的 | 过去的 | 评论 | 提问 | 展示 | 工作 | 提交 登录 零代码使用 eBPF 代理 Envoy TCP (sergiocipriano.com) 13 分,由 sergiocipriano 发布 45 分钟前 | 隐藏 | 过去的 | 收藏 | 1 条评论 a012 9 分钟前 [–] 我也正考虑安装 OBI 来排查我们集群中的一个网络问题。也许下次遇到类似情况时可以尝试。 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文

I recently had to debug an Envoy Network Load Balancer, and the options Envoy provides just weren't enough. We were seeing a small number of HTTP 499 errors caused by latency somewhere in our cloud, but it wasn't clear what the bottleneck was. As a result, each team had to set up additional instrumentation to catch latency spikes and figure out what was going on.

My team is responsible for the LBaaS product (Load Balancer as a Service) and, of course, we are the first suspects when this kind of problem appear.

Before going for the current solution, I read a lot of Envoy's documentation.

It is possible to enable access logs for Envoy, but they don't provide the information required for this debug. This is an example of the output:

[2025-12-08T20:44:49.918Z] "- - -" 0 - 78 223 1 - "-" "-" "-" "-" "172.18.0.2:8080"

I won't go into detail about the line above, since it's not possible to trace the request using access logs alone.

Envoy also has OpenTelemetry tracing, which is perfect for understanding sources of latency. Unfortunatly, it is only available for Application Load Balancers.

Most of the HTTP 499 were happening every 10 minutes, so we managed to get some of the requests with tcpdump, Wireshark and using http headers to filter the requests.

This approach helped us reproduce and track down the problem, but it wasn't a great solution. We clearly needed better tools to catch this kind of issue the next time it happened.

Therefore, I decided to try out OpenTelemetry eBPF Instrumentation, also referred to as OBI.

I saw the announcement of Grafana Beyla before it was renamed to OBI, but I didn't have the time or a strong reason to try it out until now. Even then, I really liked the idea, and the possibility of using eBPF to solve this instrumentation problem had been in the back of my mind.

OBI promises zero-code automatic instrumentation for Linux services using eBPF, so I put together a minimal setup to see how well it works.

Reproducible setup

I used the following tools:

Setting up a TCP Proxy with Envoy was straightforward: