大型语言模型没有变得更好吗?
Are LLMs not getting better?

原始链接: https://entropicthoughts.com/no-swe-bench-improvement

最近对LLM代码生成的研究,基于“metr”文章的数据,揭示了一个令人担忧的趋势:虽然LLM越来越能*通过测试*,但适合实际合并到项目中的代码质量在过去一年中没有提高。 该研究比较了LLM基于通过测试和维护者批准的性能,发现以合并性作为判断标准时,成功率显著下降——50%成功的所需时间从50分钟减少到仅8分钟。 重要的是,合并率分析显示自2025年初以来没有上升趋势,尽管通过测试的能力有所提高。 统计建模(使用Brier分数)证实了这一点,表明预测*恒定*合并率的模型比预测逐步改进的模型更准确。 这表明LLM并没有真正提高生产就绪代码的生成能力,引发了对以通过测试作为主要进展指标的质疑。

一篇最近的文章质疑大型语言模型(LLM)是否仍在改进,在Hacker News上引发了争论。文章声称LLM的能力停滞不前,尤其是在编程方面,但评论员们大多不同意。 多位用户指出文章的数据存在重大遗漏,特别是缺乏对OpenAI的GPT-4.5/4.6 Opus & Sonnet以及Google的Gemini等较新模型的分析。 普遍的看法是,LLM的进步并非线性,而是在关键突破(如思维链提示)之后出现爆发,然后进入平台期。许多用户*确实*观察到在他们的工作中有所改进,并指出使用GPT-4等当前模型时,所需的编辑量减少了。虽然承认最近可能出现放缓,但总体情绪是LLM*仍在*变得更好,即使速度已经改变。
相关文章

原文

no-swe-bench-improvement.jpg

I was reading the metr article on how llm code passes test much more often than it is of mergeable quality. They look at the performance of llms doing programming when the success criterion is “passes all tests” and compare it to when the success criterion is “would get approved by the maintainer”. Unsurprisingly, llm performance is much worse under the more stringent success criterion. Their 50 % success horizon moves from 50 minutes down to 8 minutes.

As part of this they have included figures such as this one:

swebench-01.png

But there’s something about it that strikes me as odd. Let’s look only at the more valuable data, the merge rates.

swebench-02.png

What line best characterises this data? metr suggested something that slopes slightly upwards. But here’s what I see:

swebench-03.png

At some point toward the end of 2024 we may have had a step up in ability, but this plot shows no evidence of any actual improvement in merge rates since early 2025.

Fisher warns us against eyeballing plots, so let’s make it more formal. We’ll use leave-one-out cross-validation and compare the linear slope suggested by metr against the step function the plot hints at.

Model Brier score
Gentle upward slope 0.0129
Piecewise constant 0.0117

The Brier score is a form of squared error, thus lower is better. This means the step function has more predictive power (“fits better”) than the linear slope. For fun, we can also fit a function that is completely constant across the entire timespan. That happens to get the best Brier score.

Model Brier score
Gentle upward slope 0.0129
Piecewise constant 0.0117
Constant function 0.0100

Stop and think about what this means: the two models that predict constant merge rates over the latter half of the plot are more accurate than the linear growth trend. This corroborates what we eyeballed in the plots: the merge rate has not increased in the latter half of this plot.

This means llms have not improved in their programming abilities for over a year. Isn’t that wild? Why is nobody talking about this?

联系我们 contact @ memedata.com