大型语言模型没有变得更好吗?
Are LLM merge rates not getting better?

原始链接: https://entropicthoughts.com/no-swe-bench-improvement

最近对LLM代码生成的研究,基于“metr”文章的数据,揭示了一个令人担忧的趋势:虽然LLM越来越能*通过测试*,但适合实际合并到项目中的代码质量在过去一年中没有提高。 该研究比较了LLM基于通过测试和维护者批准的性能,发现以合并性作为判断标准时,成功率显著下降——50%成功的所需时间从50分钟减少到仅8分钟。 重要的是,合并率分析显示自2025年初以来没有上升趋势,尽管通过测试的能力有所提高。 统计建模(使用Brier分数)证实了这一点,表明预测*恒定*合并率的模型比预测逐步改进的模型更准确。 这表明LLM并没有真正提高生产就绪代码的生成能力,引发了对以通过测试作为主要进展指标的质疑。

## LLM进展:停滞期? 近期分析质疑大型语言模型(LLM)在编码能力上是否仍在显著提升。该研究表明可能出现停滞期,尤其是在衡量自动合并的拉取请求(无需人工审核的代码更改)的速率时。一些评论员对此表示异议,指出像Opus 4.5/4.6等模型的最新进展,以及改进的工具和“代理”工作流程的影响。 核心争论在于,观察到的改进是由于模型本身变得更智能,还是仅仅更好地利用和整合现有工具。许多人同意,虽然原始的“一次性”性能可能正在趋于平稳,但整体开发者体验*已经*有所改善。 一些评论员强调考虑模型特定数据的重要性(避免混淆不同实验室的结果),并承认LLM仍然需要大量的人工监督。大家普遍认为,未来的进展可能更依赖于生态系统改进和成本优化,而不是模型规模或能力的巨大飞跃。最终,这场讨论强调了客观衡量LLM进展的难度,以及炒作可能掩盖现实评估的潜力。
相关文章

原文

no-swe-bench-improvement.jpg

I was reading the metr article on how llm code passes test much more often than it is of mergeable quality. They look at the performance of llms doing programming when the success criterion is “passes all tests” and compare it to when the success criterion is “would get approved by the maintainer”. Unsurprisingly, llm performance is much worse under the more stringent success criterion. Their 50 % success horizon moves from 50 minutes down to 8 minutes.

As part of this they have included figures such as this one:

swebench-01.png

But there’s something about it that strikes me as odd. Let’s look only at the more valuable data, the merge rates.

swebench-02.png

What line best characterises this data? metr suggested something that slopes slightly upwards. But here’s what I see:

swebench-03.png

At some point toward the end of 2024 we may have had a step up in ability, but this plot shows no evidence of any actual improvement in merge rates since early 2025.

Fisher warns us against eyeballing plots, so let’s make it more formal. We’ll use leave-one-out cross-validation and compare the linear slope suggested by metr against the step function the plot hints at.

Model Brier score
Gentle upward slope 0.0129
Piecewise constant 0.0117

The Brier score is a form of squared error, thus lower is better. This means the step function has more predictive power (“fits better”) than the linear slope. For fun, we can also fit a function that is completely constant across the entire timespan. That happens to get the best Brier score.

Model Brier score
Gentle upward slope 0.0129
Piecewise constant 0.0117
Constant function 0.0100

Stop and think about what this means: the two models that predict constant merge rates over the latter half of the plot are more accurate than the linear growth trend. This corroborates what we eyeballed in the plots: the merge rate has not increased in the latter half of this plot.

This means llms have not improved in their programming abilities for over a year. Isn’t that wild? Why is nobody talking about this?

联系我们 contact @ memedata.com