CursorBench 3.1

CursorBench 3.1
CursorBench 3.1

CursorBench 3.1 旨在评估 AI 智能体在处理复杂、多文件真实编程任务时的表现，包括代码库分析、调试和重构。该基准测试通过任务成功率与单次任务平均成本的对比来衡量模型性能。 **主要发现：** * **顶级表现：** Fable 5 (Max) 以 72.9% 的成功率位居排行榜首位，紧随其后的是 Fable 5 的其他高阶配置。 * **成本与质量：** 成本与性能之间存在明显的正相关关系；如 Fable 5 和 Opus 4.8 等高分模型，由于 token 使用量和处理步骤的增加，其价格也更高。 * **效率：** 与顶级模型相比，像 Composer 2.5 这样注重预算的选择，能在显著降低成本（每项任务 0.55 美元）的同时提供极具竞争力的结果（63.2%）。 * **方法论：** 分数基于模型在模糊的多文件代码任务中的表现计算得出。成本则是根据标准每百万 token 定价并应用于实际使用数据所得，且承认微小的分数偏差可能在误差范围内。总体而言，该基准测试突显了一种权衡：用户必须在 Fable 5 等模型的高精度、高成本性能，与 Composer 等轻量级模型的成本效益效率之间做出选择。

这篇关于“CursorBench 3.1”的 Hacker News 讨论凸显了资深用户在选择编程辅助模型及考量其效率方面存在的分歧。许多用户对 Claude Opus 等顶级“前沿”模型表示不满，批评它们速度缓慢、“耗费 token”，并且倾向于将简单任务复杂化。评论者经常提到这些模型会不必要地触发子代理或消耗大量 token，从而引发猜测：这可能是为了增加使用量以提升收入的策略。相反，人们对“Composer 2.5”这类更轻量、更快速的模型给予了强力支持。用户认为，这些模型足以胜任 80% 的编码任务，在速度和成本效益之间提供了更好的平衡。虽然前沿模型在复杂的架构规划或高难度实现方面仍受青睐，但作为日常使用的工具，人们正日益转向更高效的模型，以便在不消耗过多资源的情况下实现快速迭代。总体而言，社区共识倾向于实用性而非单纯的性能，许多开发者在选择时，更看重响应速度和成本效率，而非昂贵模型所提供的“顶配”性能。

We evaluate agents on ambiguous, multi-file tasks from real Cursor sessions. Higher scores are better.

More about CursorBench

	Model
1	Fable 5 Max	72.9%	$18.02	63,842	76
2	Fable 5 Extra High	72.0%	$13.74	48,754	63
3	Fable 5 High	70.6%	$10.81	37,173	54
4	Fable 5 Medium	69.8%	$8.27	28,507	47
5	Opus 4.7 Max	64.8%	$11.02	62,989	96
6	GPT-5.5 Extra High	64.3%	$4.37	17,905	46
7	Fable 5 Low	64.2%	$5.70	18,882	36
8	Opus 4.8 Max	63.8%	$7.59	77,370	60
9	Composer 2.5	63.2%	$0.55	15,152	37
10	GPT-5.5 High	62.6%	$3.59	13,329	40
11	Opus 4.8 Extra High	62.1%	$6.14	55,622	54
12	Opus 4.7 Extra High	61.6%	$7.11	43,942	72
13	Sonnet 5 Max	61.2%	$6.87	93,485	93
14	Opus 4.7 High	59.4%	$5.01	32,227	59
15	GPT-5.5 Medium	59.2%	$2.22	9,065	35
16	Opus 4.8 High	58.4%	$4.41	36,788	45
17	Sonnet 5 Extra High	58.4%	$5.23	58,228	86
18	Sonnet 5 High	57.0%	$3.74	41,735	66
19	Opus 4.8 Medium	56.6%	$3.83	31,684	41
20	Sonnet 5 Medium	54.9%	$2.57	27,469	53
21	GLM 5.2 Max	54.6%	$3.11	51,312	83
22	Opus 4.8 Low	54.3%	$2.93	22,726	36
23	Opus 4.7 Medium	52.7%	$2.93	19,193	41
24	Kimi K2.7 Code	52.7%	$1.92	32,902	70
25	Composer 2	52.2%	$0.56	14,163	40
26	GLM 5.2 High	50.7%	$2.46	30,621	76
27	Gemini 3.5 Flash	49.8%	$1.94	35,105	79
28	Sonnet 4.6 Max	49.0%	$3.09	40,280	55
29	GPT-5.5 Low	48.8%	$1.19	4,923	24
30	Sonnet 4.6 High	48.8%	$3.06	37,352	57
31	Opus 4.7 Low	48.3%	$1.87	13,164	29
32	Sonnet 5 Low	47.7%	$1.46	17,028	37
33	Kimi 2.6	47.6%	$1.27	24,783	56
34	Sonnet 4.6 Medium	46.0%	$2.64	31,360	50
35	Sonnet 4.6 Low	41.5%	$1.89	21,211	50
36	Kimi 2.5	31.9%	$0.87	9,446	30

CursorBench 3.1

Introduced problems focused on codebase understanding, bugfinding, planning, and code review.
Improved grading criteria for some edit tasks.

CursorBench 3.0

Initial set of tasks focused on edit, refactor, and bugfix problems.

Avg cost / task is computed by applying each model's published per-million-token pricing (input, cache read, cache write, and output) to the tokens it used on each CursorBench 3.1 task, then averaging across tasks. Results are subject to variance; small differences in scores may not be statistically meaningful.