CursorBench 3.1
CursorBench 3.1

原始链接: https://cursor.com/evals

CursorBench 3.1 旨在评估 AI 智能体在处理复杂、多文件真实编程任务时的表现,包括代码库分析、调试和重构。该基准测试通过任务成功率与单次任务平均成本的对比来衡量模型性能。 **主要发现:** * **顶级表现:** Fable 5 (Max) 以 72.9% 的成功率位居排行榜首位,紧随其后的是 Fable 5 的其他高阶配置。 * **成本与质量:** 成本与性能之间存在明显的正相关关系;如 Fable 5 和 Opus 4.8 等高分模型,由于 token 使用量和处理步骤的增加,其价格也更高。 * **效率:** 与顶级模型相比,像 Composer 2.5 这样注重预算的选择,能在显著降低成本(每项任务 0.55 美元)的同时提供极具竞争力的结果(63.2%)。 * **方法论:** 分数基于模型在模糊的多文件代码任务中的表现计算得出。成本则是根据标准每百万 token 定价并应用于实际使用数据所得,且承认微小的分数偏差可能在误差范围内。 总体而言,该基准测试突显了一种权衡:用户必须在 Fable 5 等模型的高精度、高成本性能,与 Composer 等轻量级模型的成本效益效率之间做出选择。

这篇关于“CursorBench 3.1”的 Hacker News 讨论凸显了资深用户在选择编程辅助模型及考量其效率方面存在的分歧。 许多用户对 Claude Opus 等顶级“前沿”模型表示不满,批评它们速度缓慢、“耗费 token”,并且倾向于将简单任务复杂化。评论者经常提到这些模型会不必要地触发子代理或消耗大量 token,从而引发猜测:这可能是为了增加使用量以提升收入的策略。 相反,人们对“Composer 2.5”这类更轻量、更快速的模型给予了强力支持。用户认为,这些模型足以胜任 80% 的编码任务,在速度和成本效益之间提供了更好的平衡。虽然前沿模型在复杂的架构规划或高难度实现方面仍受青睐,但作为日常使用的工具,人们正日益转向更高效的模型,以便在不消耗过多资源的情况下实现快速迭代。 总体而言,社区共识倾向于实用性而非单纯的性能,许多开发者在选择时,更看重响应速度和成本效率,而非昂贵模型所提供的“顶配”性能。
相关文章

原文

We evaluate agents on ambiguous, multi-file tasks from real Cursor sessions. Higher scores are better.

More about CursorBench
A scatter and line chart comparing Fable 5, Opus 4.8, Opus 4.7, GPT-5.5, Sonnet 5, Sonnet 4.6, GLM 5.2, Composer 2.5, and Composer 2 scores against average cost per task.75% CursorBench 3.1 score70%65%60%55%50%45%$20$16$12$8$4$0Average cost per taskFable 5 highComposer 2.5GPT-5.5 mediumGemini 3.5 FlashOpus 4.8 highSonnet 5 highKimi K2.7 CodeGLM 5.2 high

Model
1Fable 5 Max72.9%$18.0263,84276
2Fable 5 Extra High72.0%$13.7448,75463
3Fable 5 High70.6%$10.8137,17354
4Fable 5 Medium69.8%$8.2728,50747
5Opus 4.7 Max64.8%$11.0262,98996
6GPT-5.5 Extra High64.3%$4.3717,90546
7Fable 5 Low64.2%$5.7018,88236
8Opus 4.8 Max63.8%$7.5977,37060
9Composer 2.563.2%$0.5515,15237
10GPT-5.5 High62.6%$3.5913,32940
11Opus 4.8 Extra High62.1%$6.1455,62254
12Opus 4.7 Extra High61.6%$7.1143,94272
13Sonnet 5 Max61.2%$6.8793,48593
14Opus 4.7 High59.4%$5.0132,22759
15GPT-5.5 Medium59.2%$2.229,06535
16Opus 4.8 High58.4%$4.4136,78845
17Sonnet 5 Extra High58.4%$5.2358,22886
18Sonnet 5 High57.0%$3.7441,73566
19Opus 4.8 Medium56.6%$3.8331,68441
20Sonnet 5 Medium54.9%$2.5727,46953
21GLM 5.2 Max54.6%$3.1151,31283
22Opus 4.8 Low54.3%$2.9322,72636
23Opus 4.7 Medium52.7%$2.9319,19341
24Kimi K2.7 Code52.7%$1.9232,90270
25Composer 252.2%$0.5614,16340
26GLM 5.2 High50.7%$2.4630,62176
27Gemini 3.5 Flash49.8%$1.9435,10579
28Sonnet 4.6 Max49.0%$3.0940,28055
29GPT-5.5 Low48.8%$1.194,92324
30Sonnet 4.6 High48.8%$3.0637,35257
31Opus 4.7 Low48.3%$1.8713,16429
32Sonnet 5 Low47.7%$1.4617,02837
33Kimi 2.647.6%$1.2724,78356
34Sonnet 4.6 Medium46.0%$2.6431,36050
35Sonnet 4.6 Low41.5%$1.8921,21150
36Kimi 2.531.9%$0.879,44630

CursorBench 3.1

  • Introduced problems focused on codebase understanding, bugfinding, planning, and code review.
  • Improved grading criteria for some edit tasks.

CursorBench 3.0

  • Initial set of tasks focused on edit, refactor, and bugfix problems.

Avg cost / task is computed by applying each model's published per-million-token pricing (input, cache read, cache write, and output) to the tokens it used on each CursorBench 3.1 task, then averaging across tasks. Results are subject to variance; small differences in scores may not be statistically meaningful.

联系我们 contact @ memedata.com