185微秒类型提示
The 185-Microsecond Type Hint

原始链接: https://blog.sturdystatistics.com/posts/type_hint/

## Roughtime 服务器性能提升:类型提示的故事 最近一个 Roughtime 协议的开源实现,该协议旨在实现安全的时间同步,通过一个看似微不足道的代码更改,获得了令人惊讶的 13 倍吞吐量提升。该服务器处理涉及排队、16 个版本之间的协议兼容性、Merkle 树构建和 Ed25519 签名等请求——所有这些都是计算密集型任务。 然而,初步分析显示,90% 的请求时间都花在一个简单的计算字节数组长度的函数中。尽管通过了测试且没有反射警告,但 `mapv` 函数的动态分派和运行时类型检查引入了显著的开销。 修复方法?添加一个类型提示 (`fn [^bytes v] (alength v)`),告知编译器字节数组的类型。这使得编译器能够发出单个高效的字节码指令,而不是复杂的函数调用链。 虽然隔离测试显示性能提升了 8 倍,但端到端基准测试表明性能提升了 13 倍,这可能是由于减少了反射调用路径中的竞争,并改进了 JIT 优化。这表明在 Clojure 中,没有反射警告并不能保证最佳性能,并且分析对于识别意外的瓶颈至关重要——即使是在“简单的”代码中。

## Clojure 类型提示带来的 185 微秒性能提升 sturdystatistics.com 上的一篇最新博文详细介绍了一种通过添加类型提示在 Clojure 代码中实现的令人惊讶的性能改进。作者观察到显著的加速——将执行时间减少到 185 微秒——通过指定字节数组的类型来实现。 最初的解释认为编译器将代码优化为单个 CPU 指令。然而,Hacker News 评论区的讨论表明这可能过于简单化。专家指出,字节码可能仍然包含带有潜在异常处理的类型转换,而加速很可能归功于 JIT(即时编译)编译器优化。 具体来说,添加类型提示允许 JIT 编译器假设类型,从而实现类型保护以及在没有提示的情况下不可能进行的内联或循环展开等优化。作者承认需要检查发出的字节码以获得明确的解释,并表达了对像 Clojure 这样高级语言通过 JIT 优化实现令人印象深刻的性能的潜力感到兴奋。该帖子还提到了“Roughtime”协议,这是一种用于密码学可验证时间的系统,用于加固许可证服务器。
相关文章

原文

How a “trivial” change yielded a 13× throughput increase.

We recently released an open-source Clojure implementation of Roughtime, a protocol for secure time synchronization with cryptographic proof.

When a client asks for the time, it sends a random nonce. The server replies with a signed certificate containing both the nonce and a timestamp, proving the response happened after the request. Responses can be chained together with provable ordering; if any server’s timestamps are inconsistent with that ordering, that server is cryptographically “outed” as unreliable.

The Heavy Lifting

A single request to our server triggers a surprising amount of work:

1. Queueing

An incoming request goes through basic validation and enters a “received queue.” This queue is processed by a batcher, which sends batches to one of four worker queues. When a worker queue picks up a batch, it decodes each request, groups them into sub-batches by version number, and responds to each sub-batch. These go into a sender queue which un-batches and sends the responses back to the requesting server.

2. Protocol Compatibility

We support the entire evolution of the protocol: from Google’s original specification, through all fifteen IETF drafts – that’s sixteen versions. That means we have conditional logic littered throughout the codebase: version tags, padding schemes, tag labels, hash sizes, and packet layouts all vary with the protocol version. In several places, compatibility won over elegance or optimization.

3. Recursive Merkle Trees

Each batch is rolled into a Merkle tree using SHA-512. That means recursive hashing all the way to the root; this is pure CPU-bound work.

4. Ed25519 Signatures

Finally, each response is signed with Ed25519. Public-key signatures are notoriously expensive and are usually the dominant cost in systems like this.

The “Sluggish” Server

Given all that complexity, along with the fact that I’m using a high-level dynamic programming language, I wasn’t surprised when my initial benchmarks showed the server responding in 200 microseconds (µs).

I ran a profiler expecting to see SHA-512 or Ed25519 dominating.

Instead, nearly 90% of the runtime was attributed to the most mundane line in the entire library:

Criterium’s quick-benchmark.

Test conditions:

  • Apple M2
  • 4 parallel workers
  • Merkle batch size: 64
  • Full crypto enabled (SHA-512 + Ed25519)

Results:

Without Type Hint 19,959 200.4
With Type Hint 264,316 15.1

That’s a 13× throughput increase from one type hint.

If you plot the comparison, it is striking:

Server throughput before and after the fix. x-axis shows the batch size on a logarithmic scale; y-axis shows the response rate.

Why Did the Speedup Get Larger?

In isolated tests, the improvement was ~8×. Amdahl’s law suggests that, in the real system, we should see a substantially lower improvement. Instead, we saw the improvement grow to ~13×.

I can’t explain this fully, but my working hypothesis is contention in the reflective call path. When multiple workers hit the same reflective, non-inlinable call site, the JVM cannot optimize it effectively. Removing that reflective barrier allows the JIT to inline and parallelize cleanly.

The result: better scaling under load.

The Lesson

I learned that, when optimizing Clojure code, “no reflection warnings” is not always the end of the story. When you pass low-level primitives through higher-order interfaces, you may accidentally force the runtime back onto generic (and slower) paths. The compiler needs enough information to emit primitive bytecode.

In this case, the code I thought was complex – the crypto, Merkle trees, and protocol gymnastics – was fine. It was the “trivial” line that killed performance.

Without a profiler, I would never, ever, have suspected it.

联系我们 contact @ memedata.com