开源权重大型语言模型与闭源大型语言模型之间的差距

开源权重大型语言模型与闭源大型语言模型之间的差距
The gap between open weights LLMs and closed source LLMs

原始链接: https://blog.doubleword.ai/frontier-os-llm

对“人工智能指数”的分析显示，开源权重模型与闭源大语言模型之间的性能差距正在缩小。一项核心指标表明，这一差距可能在 2026 年底前完全消失，从而引发了关于“开源奇点”即将到来的猜测。然而，对 18 个不同基准测试的广泛考察则提供了更为细致的视角。若将这些数据集的平均表现进行对比，会发现性能差距相对稳定，始终保持在五个月左右。尽管开源权重模型在编程能力上取得了巨大进步，但在大多数其他基准测试中，它们并未表现出同样快速的追赶速度。归根结底，这项分析凸显了衡量大语言模型质量的难度。根据所选指标的不同，人们既可能得出开源模型即将超越闭源领先者的结论，也可能认为它们将持续、甚至无限期地滞后数月。这些发现强调了没有任何单一基准能够全面反映不断演变的 AI 格局。

关于开源权重模型与闭源人工智能模型之间差距的 Hacker News 讨论，凸显了一个核心矛盾：依赖私有实体的脆弱性，与本地无限制软件的持久价值之间的博弈。 **讨论要点包括：** * **开源权重的可持续性：** 批评者认为，由于开源模型（如 DeepSeek 等）往往源于企业的战略决策而非纯粹的慈善行为，一旦商业激励发生变化或政府介入，其供应可能会中断。 * **“永久”优势：** 支持者认为，权重一旦发布就无法撤回。与可以被供应商终止服务的 API 模型不同，开源模型可以无限期使用，是社会的一项重要“备份”。 * **蒸馏与创新：** 关于开源模型在多大程度上依赖于从闭源前沿模型中“蒸馏”知识，各方存在争议。一些人担心这会限制其上限，而另一些人则指出，自我蒸馏和高效训练后技术的快速进步，证明了开源替代方案能够保持竞争优势。 * **未来格局：** 许多用户预见到了向“无晶圆厂”（fabless）人工智能的转变，即由专业公司负责训练并授权模型，同时日益强调本地、高性价比的推理能力，以此作为抵御中心化控制的主要“护城河”。

Interactive plot of the Artificial Analysis Intelligence Index for open and closed frontier models.

I have seen a version of the above plot going around Twitter and wanted to dig a bit deeper into it. What the plot above is showing is the gap between open weights LLMs and closed source LLMs. We measure this gap by looking at the frontier of performance of open weights LLMs on a benchmark and then looking back into the past how long ago was the closed source frontier at that level. It is a measure of how long it took for open source models to catch up to the new capabilities reached by the closed source model frontier. This benchmark is the Artificial Analysis Intelligence Index - their headline index that tries to assess the overall capabilities of models. In general it correlates quite well with the ‘vibe’ people seem to get from models.

You can see that around summer 2024 the gap on this benchmark starts to shrink, and has been reliably shrinking since then. If you plot a line of best fit and extend it into the future you find that the gap shrinks to 0 months around December 3rd 2026 - 6 months or so from the time of writing.

Now is probably a good time to liquidate your pension, fly to a remote island somewhere, and live out the remaining 6 months or so of civilization in peace.

…

Except.

This might not be the whole picture. This is only a single benchmark, and doesn’t give a complete picture of the capabilities of LLMs. Kindly, Artificial Analysis gives us access to 18 different benchmarks that they have measured for these models. I have repeated the analysis for all the 18 different benchmarks and I have summarized them in the plot below:

Interactive boxplot of monthly open frontier lag across Artificial Analysis metrics.

For each of the 18 datasets we have created a similar chart. You can see all 18 at the bottom of the page. At each month we have created a box plot of the gap for each dataset. We have then plotted all the box plots over time. We have also calculated the average of the gaps across datasets, and calcuated a line of best fit for that. That line is almost completely flat, at just under 5 months for the entire period.

What is notable is that a large amount of the total improvement of models has been in the coding benchmark. The coding index has gone from 15 months behind to only a month or two behind. Most other datasets have a moderate increase over time in their gaps.

So maybe the open source apocalypse won’t happen yet.

What this exercise does suggest is the difficulty of measuring LLM quality. Depending on how you measure it you would predict the open source singularity by Christmas, or you would say that open source LLMs are consistently 5 months behind close source, and that the gap might be growing.

Benchmark plot

Interactive frontier plot for artificial analysis intelligence index.

开源权重大型语言模型与闭源大型语言模型之间的差距 The gap between open weights LLMs and closed source LLMs

开源权重大型语言模型与闭源大型语言模型之间的差距
The gap between open weights LLMs and closed source LLMs