iPad 连接了 Tailscale:一个 WebRTC 调试故事
The iPad was on Tailscale: a WebRTC debugging story

原始链接: https://p2claw.com/blog/2026-06-09-the-ipad-was-on-tailscale/

作者曾遇到一个间歇性的、特定于设备的漏洞:一个 p2claw 网络应用在 iPad 上会卡死,但在其他设备上运行正常。团队最初怀疑是 WebKit 渲染错误,在花费数周排查后,才意识到这是一个由两个不相关的设计选择冲突导致的“海森堡漏洞(heisenbug)”。 首先,`webrtc-rs` 库使用了一个硬编码的 MTU 常数,当叠加包头开销后,会导致数据包超过路径容量。其次,团队发现 iPad 所使用的 Tailscale 会静默丢弃 IPv6 分片,因为其 ACL 策略将这些分片视为“未知协议”。 由于 iPad 恰好通过 IPv6 Tailscale 连接路由流量,其大型 WebRTC 数据包被分片后,被 Tailscale 的过滤器丢弃,导致应用无限期地等待数据。该漏洞表现为间歇性,是因为网络有时会通过不会触发分片的路径路由流量。 作者总结认为,发送大型 UDP 数据包的开发者必须探测路径限制,或使用保守的数据包大小。这一经历作为一种警示:当漏洞仅影响单一设备时,罪魁祸首通常是该设备所使用的独特网络路径,而非设备本身。

最近的一项调试调查揭示了 WebRTC 连接在 Tailscale 上失败的原因。作者发现,两个独立问题共同导致了“静默故障”(silent wedge): 1. **webrtc-rs 漏洞**:该库硬编码了 1228 的 `INITIAL_MTU` 且缺乏路径 MTU 探测功能,导致其不断重传超大数据包。 2. **Tailscale 漏洞**:该平台的包过滤器将所有包含分片标头(Fragment header)的 IPv6 数据包归类为“未知协议”,从而触发默认拒绝的 ACL 操作。 由于标准的健康检查使用较小的数据包,因此可以通过测试,从而掩盖了这一问题。只有当实际的 WebRTC 负载需要分片时,故障才会显现。作者提供了一种简单的重现方法(使用大数据包进行 ping 测试),并向两个项目组报告了这些问题。评论者指出了“MTU 黑洞”带来的挫败感,并就 Tailscale 过滤器中丢弃 IPv6 分片的架构决策展开了讨论。
相关文章

原文

If you're not familiar with how p2claw works, it's worth checking out the how it works blog post before diving into this one.

I opened one of my p2claw apps on my iPad and got a blank page. The same URL was working on my Mac, my linux box and my phone. On the same wifi, same browser engine, same network.

Like in a good detective story, we came up with a bunch of suspects [the iPad, then WebKit, then Tailscale] and they all turned out to be innocent. Sort of. It turned out to be two bugs wearing a trenchcoat: a hardcoded constant in webrtc-rs, and a one-line design decision in Tailscale that we found through sheer stubbornness. We had a workaround patched the same day, but understanding what we had actually patched took two more weeks.

The complaint

The app loaded enough HTML to paint the loading state and then hung. There were no relevant console errors, the Service Worker registered, the WebRTC handshake finished, the data channel opened [dc.readyState === "open"], and then nothing. The browser sent its first GET / over the data channel and waited forever for the response.

The box agent on the other end thought everything was fine. It had served the response and pushed the bytes onto the channel. They just never made it to the iPad.

If that wasn't tricky enough, it was a heisenbug: if I refreshed like crazy, the page would sometimes load.

When in doubt, instrument

The first useful thing we did was log both ends of the connection and line the logs up by clock time: every chunk the box sent, every chunk the browser received, and, crucially, how much data the box was holding in its outbound buffer waiting to be confirmed delivered. That helped us figure out where the data was not making it to the other end.

Dead ends

After discarding everything up to and including the webrtc handshake, we were grasping at straws. We checked some webrtc specific limits and double checked network stability.

  • A message-size limit. In WebRTC, before two peers start exchanging data they agree on the largest single chunk each will accept. If you send something bigger, some browsers just silently hang up. We read that limit (maxMessageSize) off both devices. The iPad reported 64kb, exactly the same as the Mac, and far above the 7-8kb chunks we were sending. After this, we felt like we had discarded message chunk size as a culprit, which ended up making the true diagnosis harder to arrive at.
  • Flaky wifi. The cheapest explanation: packets getting lost over the air. ifstat and tcpdump were clean on the box, and my phone [on the same wifi] did not exhibit the same problem.

It had to be something specific to the iPad, but we had no idea what.

What the numbers actually said

Per request, the box sent three chunks: a 220 byte header, a 7,874 byte body, and a 199 byte tail. Our new instrumentation showed the sender's outbound buffer climb to about 8kb and stop. It was holding the body it had "sent" but could never get confirmation it had arrived. When the ipad refreshed, we saw the same identical pattern.

WebRTC data channels guarantee in-order delivery on top of lossy UDP, so one missing chunk blocks subsequent messages. On the iPad, in the browser's js console, we saw exactly one chunk being received [the 220 byte header] and then nothing. We didn't see the body or the small headers of the following requests.

We tested on Safari on the Mac, guessing the issue might be WebKit since it happened on every ios browser [and all ios browsers are webkit under the hood], but the Mac was receiving 8kb and 11kb chunks without a hiccup.

"It was Tailscale"

After two hours of WebKit theories, I realized that, unlike the Mac, the iPad had Tailscale enabled.

Tailscale is a VPN, and a VPN wraps your traffic in an extra layer that leaves less room in each packet. So the big responses got sliced into more, smaller pieces on the way to the iPad than they did to the Mac. WebKit implements data channels itself, in userspace, including reassembling big messages from the packets that carry them. Our theory evolved toward a bug in webkit message reassembly.

We capped the box's messages at 800 bytes, small enough that each one rode a single packet, and the iPad loaded instantly, Tailscale on or off. It felt like case closed [actually a first attempt at 1,200 bytes, which Claude helped me calculate should fit, mysteriously didn't work. Hold that thought].

In hindsight, we had just discovered that the issue was the VPN, and yet we stuck to our WebKit theory. Given our context bloat [both mine and the agents', this is troubleshooting in the age of AI after all], the Tailscale discovery got absorbed into the WebKit theory instead of challenging it. We could have looked at the network and the webrtc sender, but instead we took it as one more reason the browser was at fault. So we wrote the incident up as an iOS Safari bug [the device gets the packets but never reassembles them for the app] and started building a standalone reproduction to prove it.

The repro that wouldn't repro

For the next two weeks, the bug didn't repro with a JavaScript sender, so we turned to a webrtc-rs based Rust sender. Still nothing. We matched the data channel chunk shapes and sizes, and used a real browser receiver both on Linux and on the iPad, with and without Tailscale. It delivered everything, every time. Eventually we had to re-read our own evidence [actually Anthropic released Fable and I had it dig up the jsonl logs from the original debugging session].

The decisive numbers were in WebRTC's own getStats() counters, which our client logs to the console and which we'd captured in photos of the screen during the incident. The iPad's candidate pair froze at 2,144 bytes received across 18 packets, while the data channel had delivered exactly one message [266 bytes, our 220 byte header plus framing]. The box was retransmitting the big packet the whole time. If Safari were getting those packets and merely failing to stitch the message back together, the transport counter should have climbed by another kilobyte-plus with every retransmission while the message stalled. It never moved. The packets were not arriving at all.

web inspector console during the freeze: dc.messagesReceived=1, candidatePair.bytesReceived=2144, packetsReceived=18, and the body pump stalled error

Actual photo from the night of the incident. Every number that mattered is in frame, but it took us two weeks to understand them.

So we stopped trying to reproduce a browser bug and reproduced the network instead.

Suspect number one: webrtc-rs

webrtc-rs, the Rust WebRTC stack our box uses, cuts its outgoing data-channel messages into packets sized against this:

// sctp/src/association/mod.rs
pub(crate) const INITIAL_MTU: u32 = 1228;

It's not configurable and nothing ever updates it. The 1,228 byte packet plus the encryption layer that wraps it comes out to 1,265 bytes on the wire. Add the 28 bytes of UDP and IPv4 headers, or 48 for IPv6, and that's a 1,293 byte packet over IPv4, or 1,313 bytes over IPv6. Tailscale's tunnel carries at most 1,280.

It turns out that the packet being too big is not fatal by itself. When the kernel routes a large packet into the tunnel, it does the polite thing the IP layer has done since the eighties: it fragments. It sends two pieces over the wire, each under the limit, and they get reassembled on the other side. We confirmed this with tcpdump. The fragments leave the box. On a healthy path everything arrives and the bug is invisible, which is exactly why our standalone repro kept passing.

We were back to the drawing board. In the repro, the packets fragmented and reassembled neatly; in the incident, the iPad froze. So the question wasn't why the packet was too big. It was: where did the fragments go?

Back to the actual box agent

To answer that, we went back to the real thing. We cranked the box agent's chunk cap back up to 8kb, served a real app through it, and loaded it on the iPad over Tailscale while capturing on the tunnel interface.

It wedged on cue, and this time we were watching both layers at once. The agent's outbound buffer froze at 13kb [not 8kb because different app, different payload]. On the wire, the same 1,265 byte payload left as two IPv6 fragments and got retransmitted on SCTP's textbook backoff schedule: +1.2s, +2s, +4s, +8s. Identical fragments every time, never acknowledged. And the whole time, small packets kept flowing in both directions like nothing was wrong. Heartbeats, acks for old data, connectivity checks, all fine. The connection looked perfectly healthy except for the actual data payloads.

Then a Linux laptop on the same tailnet loaded the same app through the same tunnel just fine. Which gave us the experiment that cracked the whole thing open.

The ping that needed no WebRTC

If fragments were dying somewhere on the iPad's path, we didn't need WebRTC to prove it. We tried ping.

A 1,400 byte ping forces fragmentation through a 1,280 byte tunnel. A 100 byte ping doesn't. Run both, over both address families, and you get a truth table:

ping -s 100  <ipad over IPv4>    3/3 received
ping -s 1400 <ipad over IPv4>    3/3 received     fragments fine
ping -s 100  <ipad over IPv6>    3/3 received
ping -s 1400 <ipad over IPv6>    0/3, 100% loss   fragments gone

IPv4 fragments reassemble. IPv6 fragments vanish. Deterministically, every run.

It wasn't an iOS thing: every Tailscale device we pointed this at exhibits the same packet loss. There is something in Tailscale itself, on every platform, that eats IPv6 fragments.

The counter that confessed

Tailscale's client keeps diagnostic counters, and on the receiving machine one of them increments when we ping -s 1400 over IPv6. The output of tailscale metrics print includes:

tailscaled_inbound_dropped_packets_total{reason="acl"} 6

Three pings, two fragments each, six drops. The arithmetic matched on every machine we checked. The kernel's own IPv6 reassembly counters stayed at zero the whole time; the fragments were being dropped before the operating system ever saw them.

reason="acl" means the packet filter dropped them as a policy denial. Which is a strange thing to see on a personal tailnet whose access policy is allow everything. So we went to github to have a look at the source [Tailscale's client is open source, which made this whole hunt possible]. There we learned that their IPv6 parser doesn't parse fragments. Any packet carrying an IPv6 Fragment header gets classified as "unknown protocol," and an unknown-protocol packet can't match any allow rule, so the default deny fires. The comment in the code reads:

Note that this means we don't support fragmentation in IPv6. This is fine, because IPv6 strongly mandates that you should not fragment.

It's a reasonable-sounding line, and I think it's a misreading. IPv6 forbids routers from fragmenting packets in flight. It fully allows the sender to fragment, and the spec requires the receiving end to put the pieces back together. Our sending kernel was following the rules. Tailscale's filter drops what the kernel produced, by design, silently, and files it under "acl." IPv4 fragments, for what it's worth, get proper handling and sail through, which contributed to our heisenbug.

Every loose end ties off

  • Why the iPad and not the Mac? The Mac wasn't on Tailscale. Not an Apple problem, a which-route-got-picked problem.
  • Why the iPad and not the Linux laptop, when both are on Tailscale? In the WebRTC handshake, the two peers advertise several addresses each and WebRTC picks one pair. The issue only happens on the Tailscale IPv6 pair. The iPad nominated it every single time; the Linux browser kept landing on IPv4 or the plain LAN, where everything works.
  • Why did refreshing sometimes work? Each reload re-runs the handshake. Back in May the iPad occasionally drew a route that wasn't the v6 tunnel, and the page loaded. A genuine browser bug wouldn't come and go connection to connection.
  • Why did the connection never recover or error out? Because every packet that tests the path is small. Connectivity checks, heartbeats, acks are all under the limit and get delivered. Every layer's health check passes while the payload gets stuck.
  • Why did 1,200 byte messages still fail? webrtc-rs pads its first packet out to the full 1,228, which put us over the line. [This is the "hold that thought" from earlier.]
  • Why did 800 work? Comfortably under the limit even with all the overhead, on either address family. Nothing to fragment means no fragments to drop.
  • Why doesn't everything on a VPN break this way? Ordinary https:// traffic negotiates its packet size up front [TCP MSS clamping] so it never oversends. The kind of traffic WebRTC uses has no such negotiation, and almost nothing else sends large UDP over v6 without its own MTU handling, so the trap sits unsprung until something like webrtc-rs walks into it.
  • In hindsight, why did it take so long to solve? There are two reassemblies in this story: WebKit stitching packets back into messages in layer 7, and the kernel stitching fragments back into packets as part of the ip protocol. We spent two weeks accusing the first one. The guilty party was one layer down, and it never even got to run, because Tailscale ate its inputs.

Reproduce it yourself

The two-command version needs nothing but a tailnet with two devices:

ping -s 100  <any tailscale IPv6 address>     works
ping -s 1400 <any tailscale IPv6 address>     100% loss

The full WebRTC version is at github.com/phact/mtu-webrtc-bug: a tiny relay that drops oversized packets [the deterministic stand-in for the fragment-eating path], plus captures from the real tunnel and a writeup of the localization, diagnostics/who-loses-the-packets.md. Next, we're reporting the constant to the webrtc-rs maintainers with a suggested fix, and we're filing an issue for the fragment drop in the tailscale repo.

What I'd take from this

Packets that are too big for the path, silently vanishing, with nobody told why, is one of the internet's oldest problems. It never got solved so much as papered over, and it resurfaces whenever new software sends its own packets without checking what the path accepts. If you're building anything like that [video calls, games, peer-to-peer anything], assume a real fraction of your users are on a path smaller than you'd expect, and either keep your packets conservatively small or probe before you trust.

Neither project here did anything crazy. webrtc-rs picked a constant [which by the way is only 28 bytes more optimistic than Chrome's] and trusted the network to cope. Tailscale decided IPv6 fragments weren't worth supporting and trusted that nothing legitimate sends them. Both decisions are defensible in isolation. Together they form a trap with no error message, where the only symptom is a blank page on one specific device, and are liable to cost you a week or two.

Part of me wants to say we only hit this because p2claw uses things in weird ways. Which is true, and is also the whole point of p2claw. p2claw exists so that agents can self host with no signup, so vibe coders can deploy with oauth with a single cli call, so web apps can be peer to peer. To do that we bypass a bunch of machinery most software relies on to participate in the internet. This is what programming is all about. Bending the system and the standards to your will. The more you bend, the wackier the bugs.

Two debugging lessons I'm keeping. First: a sender-side packet capture only proves the packets left. We "verified" fragments flowing with tcpdump on the box and called the path healthy; the fragments were leaving beautifully and dying on arrival, every time. Watch the receiver [easier said than done when receiver is a non jailbroken iPad but the point holds]. Second: when a bug only shows up on one device, before you blame the device, ask what path only that device takes.

The iPad was fine. The iPad was just on Tailscale. And Tailscale was just doing what the comment says it does.


UPDATE: Both issues are filed. The webrtc-rs constant is webrtc-rs/webrtc#806 and the IPv6 fragment drop is tailscale/tailscale#20083.

联系我们 contact @ memedata.com