系统地生成能够捕获Anthropic的top-K错误的测试。

系统地生成能够捕获Anthropic的top-K错误的测试。
Systematically generating tests that would have caught Anthropic's top‑K bug

原始链接: https://theorem.dev/blog/anthropic-bug-test/

## 分数证明：系统性地发现罕见错误传统测试常常遗漏罕见的边缘情况，导致生产问题。该系统引入“分数证明”——一种结合了基于属性的测试（PBT）和形式化证明的优势，以高效地发现这些错误的方法。标准的PBT速度快但缺乏详尽的覆盖，而完整的证明在计算上不切实际。分数证明通过将程序属性编码为PBT，然后*递归地分解*它们为更小、更易于管理的子测试来工作。这种分解，由对代码的推理引导，创建有针对性的单元测试，从而平衡速度、准确性和计算效率。最近，这种方法自动发现了一个Anthropic近似Top-K采样实现中的错误——这个错误需要一个复杂的复现器才能手动找到。该系统生成了一个单元测试，*无需*依赖该复现器。通过策略性地将复杂的定理分解为更小、可测试的组件，分数证明的计算复杂度随错误稀有度呈对数增长，从而能够系统地检测错误并提高程序正确性。该技术不仅适用于Top-K采样，还可以扩展到复杂的代码库和分布式系统。

黑客新闻新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录系统性地生成能够捕获 Anthropic 顶级K个错误的测试 (theorem.dev) 9点由 jasongross 2小时前 | 隐藏 | 过去 | 收藏 | 讨论指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系搜索：

Most testing strategies miss rare edge cases until customers find them in production. Our system automatically generates targeted unit tests for rare bugs, including the one that would have caught Anthropic’s recent approximate top-K bug. In this blog post, we’ll provide a brief overview of how it works.

Comparison of testing approaches — **Figure 1:** Unit-level PBTs are fast but miss edge cases. Proofs offer exhaustive coverage but require extensive reasoning and code refactoring. End-to-end PBTs have coverage but are not compute efficient. Fractional proofs sit at the intersection, using proof decomposition to generate targeted unit tests that balance compute efficiency, developer accuracy, and speed.

Catching the rare bug in top-K sampling

A bug in the TPU implementation of approximate top-K resulted in the most likely token sometimes being excluded. Rare bugs like this frequently slip through to production because covering every behavior with testing is infeasible in practice. After discovery, Anthropic provided a simple reproducer of the bug, but it is the sort of test you only manage to write after a laborious bug minimization process.

We used fractional proof decomposition to automatically generate the unit test without relying on Anthropic’s bug reproducer code. You can run the unit test on colab. For any code, if testing is done via fractional proof decomposition, bugs can be systematically found without the benefit of hindsight.

@given(k=st.integers(min_value=0, max_value=TOP_K_RANGE), arr=arr_strategy)
def test_approx_max_k(k, arr):
    N = len(arr)
    k = int(k % min(N - MIN_TOP_K, TOP_K_RANGE)) + MIN_TOP_K

    approx_values, _ = lax.approx_max_k(arr, k=k)
    assert jnp.max(approx_values) == jnp.max(arr), \

Figure 2: Top-K sampling should always have some chance of picking the most likely token. We encode this property with a PBT (property-based test) for max(approximate_top_k(arr, k=k)) == max(arr). If the implementation of lax.approx_max_k is correct, we should expect the test to pass because the approximate top-K algorithm is implemented by dividing data points into L bins and computing the true max in each bin. L is chosen based on the desired average recall r as .

Systematically generating tests via fractional proof decomposition

Fractional proof decomposition process — **Figure 3:** We encode theorems as property-based tests (PBTs), then recursively decompose them into smaller sub-theorems using reasoning, and fuzz the theorems, aka run PBTs, once the decomposition creates compute efficient unit-level PBTs.

Step 1: Identify the theorem, which is the property that your implementation must satisfy. Then, encode it as a PBT using the Hypothesis framework. We call the top-level theorem an end-to-end PBT because it corresponds to the end-to-end behavior of the function.

The theorem for the top-K bug is:

$$\forall\ \text{prompt}, k,\ LLM_{\text{top-1}}(\text{prompt}) \in LLM_{\text{top-}k}(\text{prompt})$$

For any $\text{prompt}$, $LLM_{\text{top-k}}(\text{prompt})$ represents the model’s prediction of the top-K most likely next tokens. Now, we can encode this as an end-to-end PBT. Since the end-to-end PBT does not need to run on TPU, we set up a different colab.

@given(prompt=st.text(min_size=1, max_size=MAX_SIZE), k=st.integers(1, VOCAB_SIZE))
@settings(max_examples=50, deadline=None)
def test_top_token_present(prompt, k):
  greedy_params = SamplingParams(temperature=0.0, max_tokens=1, logprobs=1)
  topk_params   = SamplingParams(temperature=1.0, max_tokens=1, logprobs=MAX_LOGPROBS, top_k=k)
  most_likely_token = llm.generate([prompt], greedy_params, use_tqdm=False)[0].outputs[0].token_ids[0]
  logprobs          = llm.generate([prompt],   topk_params, use_tqdm=False)[0].outputs[0].logprobs[0]

  assert most_likely_token in logprobs

Figure 4: The theorem stating that the most likely token should always be included in the top k tokens, encoded as a PBT.

Although the end-to-end PBT has comprehensive coverage, to catch rare bugs would take an excessively large number of tokens. The rarer the bug, the more compute is required.

Step 2: Recursively decompose the theorems into a collection of smaller theorems, which are also encoded as PBTs. These smaller theorems are intermediate results that compose together to establish the original end-to-end PBT.

Decomposing the end-to-end PBT for the top-K bug yields, by construction, three theorems:

max(approximate_top_k(arr, k=k)) == max(arr) (true max always included)
On any input tokens, the logits are finite (not ∞ and not NaN)
In vLLM, the token ids are the same as the logprobs dict keys

You can think of PBTs as fractional components of the brute-force proof. Just as the brute-force proof is optimized by decomposing properties into logical sub-properties via reasoning, better known as partial evaluation, we’re applying reasoning to decompose the fractional brute-force proofs. The reasoning bootstraps trust in PBT coverage–so even though you don’t exhaustively check every single input like a formal proof, you get systematic understanding of your programs, and control over how you spend your testing compute. We’re calling this sampling technique fractional proofs.

Step 3: Continue decomposing until the input space is small enough to be compute efficient at catching rare bugs. You stop decomposing when:

Each sub-test runs sufficiently quickly,
Each sub-test tests a sufficient fraction of its input distribution so as to catch most bugs that would end up in code of similar complexity to that being tested, and
The sub-properties provably compose to cover the full end-to-end property

While we found the top-K bug in about 10m of sampling, we were able to find the XLA:TPU bug (also discussed in Anthropic’s post), involving an issue with excess-precision, in just a few seconds.

Fractional proofs for efficient oversight

Systematic decomposition catches rare bugs without sacrificing developer speed or compute efficiency. Instead of scaling compute proportional to the rarity of the bug, fractional proofs scale compute as the logarithm of rarity.

We can straightforwardly extend the approximate top-K example in this post to real world codebases. For example, top-K can be decomposed into a sequence of PBTs testing how libtpu implements the algorithm described in its reference paper. Or, we can use this reasoning to establish how single-TPU behavior composes into cluster behavior.

At Theorem, we’re training models that can automatically reason about program correctness. If you want to catch bugs earlier and make your devs happy, send me an email.