方形减方形 – 一个编码代理基准测试
Square Minus Square – A coding agent benchmark

原始链接: https://aedm.net/blog/square-minus-square-2025-12-22/

该实验测试了几个大型语言模型(LLM),包括Opus、Gemini 3 Pro、GPT-5.2和Gemini 3 Flash,解决一个复杂的几何问题:计算一个正方形的面积,并减去它与另一个可能重叠、非轴对齐正方形的交集区域。目标是最小化使用的三角形数量。 没有LLM完全成功生成正确的代码。然而,顶尖模型表现出令人惊讶的调试能力,能够利用视觉反馈(执行过程中生成的和“检查”的截图)。这突出了为智能体提供自我检查机制的价值。 虽然像Gemini 3 Flash这样的模型生成了看似可用的代码,但它并非最优,增加了不必要的复杂性。性能差异很大,没有明显的赢家——即使是最好的模型有时也会崩溃。完整的代码和结果可在Github上找到。

黑客新闻 新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 Square Minus Square – 一个编码代理基准测试 (aedm.net) 22 分,Topfi 发表于 17 小时前 | 隐藏 | 过去 | 收藏 | 1 条评论 wariatus 发表于 14 小时前 [–] 你尝试过给这些代理配备访问接地视觉模型来分析图像吗?根据我的经验,大多数模型无法正确理解这种输入。我现在正在试验 Molmo2,看起来很有希望。 回复 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文

I tried several coding agents to implement the following task:

There are two squares on a 2D plane, possibly overlapping. They are not axis-aligned and have different sizes. Write a function that triangulates the area of the first square minus the area of the intersection. Use the least amount of triangles.

There is a single Rust function to be implemented in a standalone file, no dependencies:

pub fn generate(
    center1: [f32; 2], rotation1: f32, size1: f32,
    center2: [f32; 2], rotation2: f32, size2: f32,
) -> Vec<[f32; 2]> {
    // TODO
}

I made a little framework that displays results. It can capture screenshots and video footage.

Several coding agents were tasked to implement the function, and I did it myself without AI, too. Agents are encouraged to generate screenshots and examine them.

I ran the test two times and picked the better result for each agent.

Video capture of the results:

More models:

Some takeaways:

  • To date, no LLM was able to solve the task successfully.
  • Nearly all of the models generate screenshots and examine them to fix bugs. They are surprisingly good at it, top models identify real issues correctly. This highlights the importance of the feedback loop: always provide a way for the agent to check its own work.
  • During development, I ran the test several times. There is no conclusive winner. Best models (Opus, Gemini 3 Pro, GPT 5.2) all came out on top sometimes. But sometimes they generate code that crashes.
  • Gemini 3 Flash might seem to have solved the task well but it adds unnecessary vertices and triangles.

Full code on Github.

联系我们 contact @ memedata.com