最优雅的TCP打孔算法

最优雅的TCP打孔算法
A most elegant TCP hole punching algorithm

原始链接: https://robertsdotpm.github.io/cryptography/tcp_hole_punching.html

## 简化的TCP打孔测试算法该算法提供了一种在*无需*复杂基础设施的情况下测试TCP打孔的方法，专注于打孔机制本身。传统的打孔需要同步时间、WAN IP交换和端口发现——这些都容易出错。这种方法通过从单个参数：当前时间戳确定性地推导出所有必要数据来绕过这种复杂性。核心思想是使用时间戳计算一个共享的“桶”编号，考虑到潜在的时钟偏差。然后，该桶会作为伪随机数生成器的种子，以创建潜在端口的共享列表，利用家庭路由器中常见的“等量增量映射”（本地端口和外部端口匹配）。该算法利用非阻塞套接字和积极的SYN数据包发送来尝试打孔。一旦建立多个连接，领导者/跟随者系统（由WAN IP地址确定）将通过简单的单字符交换选择一个“获胜”连接。最终，这种方法允许隔离测试打孔过程，仅需要目标IP地址并最大限度地减少外部依赖。虽然简化用于测试，但它展示了用于功能性（尽管有限）打孔实现的确定性元数据生成的核心原理。

## 优雅的TCP打孔 – 摘要一篇最近的博文详细介绍了一种TCP打孔算法，旨在建立NAT后对等体之间的直接连接。然而，该算法的“优雅”依赖于一个重要的假设：NAT设备会保留出站连接的源端口——这种做法并非普遍实施。讨论强调了这一局限性，评论员指出许多路由器不保持源端口的一致性，从而降低了该算法的有效性。虽然类似的假设存在于其他技术（如STUN）中，但博文声称的“等量增量映射”特性被质疑，一些人认为在已建立的网络术语中没有找到它。对话还涉及NAT的更广泛问题以及IPv6缓解这些复杂性的潜力。然而，即使使用IPv6，防火墙仍然是一个因素，因此需要打孔技术。一些人认为ISP错失了完全采用IPv6的机会，而选择了不太理想的解决方案，如运营商级NAT (CGNAT)。最终，该算法被视为理解NAT穿越的有价值的教育工具，但其在实际应用中受到现实网络配置的限制。尽管增加了复杂性，但TURN中继等替代方案通常更可靠。

原文

TCP hole punching is a way to connect two computers behind NAT routers. The technique has a lot of requirements to work.

Both sides must know each other’s WAN IPs
They must know the right external ports to use
They must connect at exactly the same time

In practice this collapses to using STUN to lookup the WAN IP, doing NAT type enumeration and port prediction, synchronizing time with NTP, and having both sides exchange all the needed meta data (WAN IP, port predictions, future NTP punch time) via some “channel.”

That involves a whole list of infrastructure and code to work – which is complex and error-prone. What if you just wanted to test whether your punching algorithm works? You don’t care about every other part of the software. This is a way to do that.

Part 1: selecting the bucket

A simple way to bypass the need for fixed infrastructure in hole punching algorithms is by using a deterministic algorithm to derive all meta data from a single parameter. Here’s how it works.

First, a starting parameter is chosen based on the unix timestamp. What we need is a number both sides can converge on without requiring any communication. However, as we know in distributed systems theory – there is no “now” in networks. So a little math is used to make the “now.”

now = timestamp()
max_clock_error = 20s # How far off the timestamp can be
min_run_window = 10s # Total time window to run the program for both sides
window = (max_clock_error * 2) + 2
bucket = int((now - max_clock_error) // window)

We define what an accepted solution looks like. There ought to be a certain amount of time that the protocol can run in. A certain range that a timestamp must fall in to be considered valid. This information is combined and then quantized in a way that both sides converge on the same number even given variations to clock skew. This we term the “bucket.”

Part 2: selecting ports

Now that both sides share the same bucket we can use this to derive a list of shared ports. The idea behind the port list is that the local port will equal the external port. Many home routers try to preserve the source port in external mappings. This is a property called “equal delta mapping” – it won’t work on all routers but for our algorithm we’re sacrificing coverage for simplicity.

In order to generate a shared list of ports without communication between the two sides – the bucket is used as a seed to a pseudo-random number generator. This allows for a much smoother conversion of the bucket number which might otherwise have little direct variation.

large_prime = 2654435761
stable_boundary = (bucket * large_prime) % 0xFFFFFFFF

The FF… number fixes the range for the boundary number so it doesn’t overflow what the PRNG accepts as a seed. A prime number is used for the multiplier because if the multiplier shares a common factor with the modulus (0xff..) then the space of possible numbers shrink due to shared factors. A prime number ensures that the number space contains unique items.

The port range is now computed based on using a randrange func. In my code I generate 16 ports and throw out any I can’t bind to. It is expected that there will be some port collisions with existing programs in the OS when manually selecting random ports like this.

Part 3: sockets and networking

It is helpful to review the requirements for TCP hole punching before we proceed. There are very specific socket options that must be set for it to work.

s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEPORT, 1)

TCP hole punching involves aggressively reusing socket addresses. Normally with TCP, if a connection fails you close the socket. But with TCP hole punching if you do any kind of cleanup you break the protocol. Cleanup means calling close. And close means possibly sending out an RST packet that tells the remote router “bad connection, disregard.”

If you close the socket the OS will also start its own cleanup process. The socket will enter states like TIME_WAIT and will be harder to reliably reuse the same address even if you do set the right socket options. I also want to add that the only really correct model for TCP hole punching is to use non-blocking sockets. You can’t use blocking sockets because you need to be able to rapidly send SYN packets without having to wait for a response.

Async networking won’t work here either. Async networking prevents your software from being able to precisely control timing and TCP hole punching is highly sensitive to timing. If packet exchange is off by even a few miniseconds the whole protocol can fail. In my opinion: the best way to implement this is to use non-blocking sockets with select for polling. This allows you to properly handle every connection state without impacting timing.

for ... sel.register(s, selectors.EVENT_WRITE | selectors.EVENT_READ)

For punching: we simply call connect_ex on a (dest_ip, port) tuple with a 0.01 second sleep until an expiry (src_port == dest_port.) Actual punching here is not terribly elegant as you’re just spamming out SYNs. But this part needs to be aggressive enough to create a remote mapping but not so aggressive that it kills your CPU. I use a 0.01 second sleep for that.

Part 4: winner selection

In this algorithm multiple ports are used so many successful connections can be returned. The problem is how to choose the same connection? My approach is to start by having the sides choose a leader and follower. The leader has the greater WAN IP number. The leader sends a single character on a connection and cleanly closes the rest.

winner.send(b"$")
...
loser.shutdown(socket.SHUT_RDWR)
loser.close()

The follower then uses select to poll events. If it finds one it calls recv(1) to choose that connection (winner.) The reason a single char is used here is because TCP is stream-based. If the “success” delimiter was a word then the follower would have to implement a buffered-reader algorithm just to ensure they have everything (and I didn’t want that complexity.) A single char is atomic.

Part 5: putting it all together

I present to you: a simple TCP hole punching algorithm requiring only a dest IP to use. Since the protocol is deterministic no infrastructure is required for testing and there is no need for the hosts to exchange meta data to use it. The hosts may still use NTP for time which is probably recommended though optional (older OSes in VMs can’t maintain time well.)

It is assumed another process will coordinate the running of this tool. Though you can easily run everything yourself on multiple terminals as long as the commands fall within the 10 second min_run_window. Here, punching applies for common routers using equal delta allocation.

tcp_punch.py

You can run tcp_punch.py 127.0.0.1 to get a feel for the code.