GPT-5.5 Codex 推理标记聚类可能导致性能下降。

GPT-5.5 Codex 推理标记聚类可能导致性能下降。
GPT-5.5 Codex reasoning-token clustering may be leading to degraded performance

原始链接: https://github.com/openai/codex/issues/30364

对 2026 年 2 月至 6 月期间 Codex 遥测数据的分析显示，GPT-5.5 的响应元数据存在显著异常：推理标记（reasoning-token）数量不成比例地集中在 516、1034 和 1552 个标记上。尽管 GPT-5.5 的响应仅占总响应量的 19.3%，但它贡献了 82% 的“刚好 516 个标记”事件。这种模式在其他模型变体中并不存在，且与整体推理标记强度呈负相关——尽管此类聚集事件有所增加，但整体推理强度却在下降。这种行为表明，这更像是模型特有的阈值设置、截断或预算约束，而非任务复杂度的自然变异。这种聚集现象与报告中提到的 Codex 复杂任务性能下降情况相吻合。数据显示，GPT-5.5 响应达到 516 个标记这一阈值的概率，是非 GPT-5.5 模型基准的 33 倍。现请求 Codex 团队调查这些固定的标记峰值是源于有意设置的内部阈值、路由异常，还是系统级的截断。建议的验证方案包括：对比“刚好 516 个标记”任务与可变推理长度任务的性能表现，并审计 GPT-5.5 的调度逻辑。

Hacker News 上的一场讨论强调了 GPT-5.5 Codex 模型存在一个令人担忧的性能问题。原帖作者指出，该模型表现出一种“聚类现象”，即推理输出的标记（tokens）始终以 518 为固定间隔进行聚类。这种行为与复杂任务中错误率的增加有着极强的相关性。该现象似乎是 GPT-5.5 特有的，用户指出在 5.4 版本中这种现象明显较少，而在 5.2 和 5.3 版本中则几乎不存在。在评论区，用户证实了这些发现，并指出 GPT-5.5 Codex 与 GPT-5.5（标准版）或 Claude 等其他模型之间存在“惊人”的智力差距。开发人员建议，虽然 Codex 在处理某些任务时仍然有用，但它在复杂推理方面已变得不可靠，这导致许多人选择在将任务交给 Codex 执行之前，先由更强大的模型来完成推理工作。

原文

Summary

I found an aggregate pattern in Codex token_count metadata: gpt-5.5 responses disproportionately land at exactly reasoning_output_tokens = 516, with additional fixed-boundary spikes around 1034 and 1552.

This appears model-specific and coincides with lower overall reasoning-token intensity, which may help explain degraded performance on complex/high-stakes Codex tasks.

This is related to #29353, which reported a task-level reproduction where gpt-5.5 runs ending at exactly 516 reasoning tokens returned the wrong answer. This issue adds aggregate evidence across a larger Feb-Jun window.

I am not claiming this proves hidden chain-of-thought truncation. The narrower claim is that Codex telemetry shows a GPT-5.5-specific fixed-token clustering anomaly that looks consistent with thresholded reasoning-budget behavior.

Environment

Evidence

Metric	Value
Response-level token records analyzed	390,195
Sessions represented	865
Exact `reasoning_output_tokens = 516` events	3,363
GPT-5.5 share of all responses	19.3%
GPT-5.5 share of exact-516 events	82.0%
GPT-5.5 exact-516 / >=516 ratio	44.0%
Non-GPT-5.5 exact-516 / >=516 ratio	1.3%

Model-level result:

Model	Response records	Exact 516 / >=516
`gpt-5.5`	75,401	44.0%
`gpt-5.4`	25,214	19.8%
`gpt-5.2`	247,575	0.34%
`gpt-5.3-codex`	13,333	0.0%
`gpt-5.3-codex-spark`	26,179	0.0%

Monthly exact-516 clustering increased sharply:

Month	Exact 516 / >=516
Feb 2026	0.11%
Mar 2026	2.45%
Apr 2026	4.25%
May 2026	53.30%
Jun 2026	35.84%

At the same time, overall reasoning-token intensity decreased:

Month	Mean reasoning tokens	P90 reasoning tokens
Feb 2026	268.1	772
Mar 2026	256.8	723
Apr 2026	228.7	669
May 2026	106.9	344
Jun 2026	168.5	515

Why this looks suspicious

The anomaly is not simply higher reasoning-token usage overall. Mean and P90 reasoning-token intensity fell from February-April to May-June, while exact-516 clustering rose sharply.

The clustering is also not evenly distributed across models. gpt-5.5 accounts for only 19.3% of responses but 82.0% of exact-516 events. Its exact-516 / >=516 ratio is about 33.6x higher than the non-GPT-5.5 baseline.

The fixed values are also notable: 516, 1034, and 1552 look like repeated threshold boundaries rather than a naturally varying reasoning-token distribution.

Expected behavior

Reasoning-token counts for complex Codex tasks should vary naturally with task complexity and should not disproportionately cluster at exact fixed values for one model family.

Actual behavior

gpt-5.5 responses cluster heavily at exactly 516 reasoning tokens, with related spikes around 1034 and 1552. This pattern is much weaker or absent in several other models.

Ask

Could the Codex team investigate whether gpt-5.5 has a reasoning-budget, routing, truncation, fallback, or scheduler behavior that causes responses to terminate around 516/1034/1552 reasoning tokens?

If this is expected behavior, it would be useful to know whether exact 516 indicates a normal stopping point, a budget cap, a degraded tier, or another internal threshold.

Useful internal validation checks:

Query token_count events with reasoning_output_tokens by model.
Compare exact-value counts for 0, 516, 1034, and 1552.
Compute count(reasoning_output_tokens = 516) / count(reasoning_output_tokens >= 516) by model and day.
Compare gpt-5.5 against gpt-5.2, gpt-5.4, and Codex-specific variants.
Replay matched complex tasks across GPT-5.2 and GPT-5.5 with quality evals, especially separating exact-516 responses from longer-reasoning responses.