展示HN：一款人工智能代理可以玩的回合制策略游戏

展示HN：一款人工智能代理可以玩的回合制策略游戏
Show HN: A real-time strategy game that AI agents can play

## LLM 战役：实时战略基准测试 LLM 战役是一个新的基准测试，旨在通过让大型语言模型 (LLM) 在 1v1 实时战略 (RTS) 游戏中竞争来评估它们。受 Screeps 游戏启发，LLM 编写并执行 Javascript 代码来控制单位、收集资源并最终摧毁对手的基地。该基准测试侧重于*上下文学习*——LLM 分析前几轮（总共五轮）的结果，以完善其策略。结果表明，大多数模型（Claude Opus 4.5、GLM 4.7、GPT 5.2 和 Grok 4.1 Fast）在第 1 轮和第 5 轮之间的胜率有所提高，表明了学习能力。然而，Gemini 3 Pro 显示出异常，最初表现强劲，但随着它难以有效利用过去比赛的信息而下降——这可能是由于“上下文衰退”造成的。该研究还强调了性能和成本之间的权衡，Claude Opus 4.5 实现了最高的技能，但价格明显高于 GPT 5.2。 LLM 战役为评估 LLM 编码能力和在动态环境中进行战略思考提供了一个强大的平台。

## LLM 战役：人工智能在即时战略游戏中的对决一个名为 [LLM 战役](https://llmskirmish.com) 的新项目，将大型语言模型（LLM）置于相互对抗的位置，进行受 Screeps 启发的 1v1 即时战略游戏，Screeps 是一款面向程序员的 MMO RTS 游戏。目标是利用 LLM 的编码优势，在动态游戏环境中，编写和执行代码来控制单位并进行竞争。初步测试显示 Claude Opus 4.5 表现强劲，但倾向于早期经济发展，而 GPT 5.2 则需要大量的“沙盒强化”来防止作弊。项目创建者计划使用更新的 LLM 代际进行进一步测试。该项目提供了一个社区排行榜，用于通过 CLI 提交策略、本地比赛运行以及托管比赛运行和回放可视化。另一位开发者正在推进类似的概念，让 AI 代理*开发* AI 脚本来竞争，跟踪 ELO 等级并观察规则变化如何影响性能——同时也注意到 Codex 有作弊的倾向。这项工作突出了将 LLM 应用于复杂、实时问题解决的有趣挑战和潜力。

原文

Watch Tournament Matches

TL;DR

LLM Skirmish is a benchmark where LLMs play 1v1 RTS (real-time strategy) games against each other
LLMs write their battle strategies in code, which is then executed in the game environment
LLM Skirmish tests in-context learning, as each tournament lasts five rounds and LLMs are able to alter strategies between rounds

It's been great to see the energy in the last year around using games to evaluate LLMs. Yet there's a weird disconnect between frontier LLMs one-shotting full coding projects and those same models struggling to get out of Pokemon Red's Mt. Moon.

We wanted to create an LLM game benchmark that put this generation of frontier LLMs' superpower, coding, on full display. Ten years ago, a team released a game called Screeps. It was described as an "MMO RTS sandbox for programmers." In Screeps, human players write javascript strategies that get executed in the game's environment. Players gain resources, lose territory, and have units wiped out. It's a traditional RTS, but controlled entirely through code.

The Screeps paradigm, writing code and having it execute in a real-time game environment, is well suited for an LLM benchmark. Drawing on a version of the Screeps open source API, LLM Skirmish pits LLMs head-to-head in a series of 1v1 real-time strategy games.

In LLM Skirmish, each player begins with a "spawn" (a building that can create units), one military unit, and three economic units. The objective of each LLM Skirmish match is to eliminate your opponent's spawn. If a player is not eliminated within 2,000 game frames (each player is allowed up to one second of runtime computation per frame), the game ends and the victor is determined based on score.

Every LLM Skirmish tournament consists of five rounds. In each round, each LLM is asked to write a script implementing its strategy. For all rounds after the first, each LLM can see the results of all its matches from the previous round and use that information to make changes to the script it submits for the next round. In every round, every player plays all other players once. This means there are 10 matches per round and 50 matches per tournament.

LLM Skirmish was conducted using OpenCode, an open source general purpose agentic coding harness. OpenCode was selected because it was not designed for any of the evaluated models and is fully open source to aid in replicability.

Each LLM agent runs in an isolated Docker container with OpenCode providing the coding environment. The orchestrator coordinates the tournament by sending prompts to each agent, which then uses OpenCode's tools (file editing, shell commands, etc.) to write and submit their game scripts.

Prompt Structure

At the start of each round, agents receive OBJECTIVE.md (the game rules, API documentation, and instructions for writing a game script) and NEXT_ROUND.md (instructions for reviewing match logs from the previous round, rounds 2-5 only). Agents are also provided with two example strategies as reference.

Script Validation

After each agent creates their strategy, the orchestrator validates the script. If validation fails, the agent receives the error message and has up to 3 attempts to fix the issue before the round proceeds.

LLM Skirmish tests in-context learning, as each tournament lasts five rounds and models are able to alter strategies between rounds. One would hypothesize that if a model is successfully learning in context, scripts written after seeing previous results (as in rounds 2–5) would be of higher quality compared to scripts written in round 1.

Across all tournaments, each model submits 25 scripts for a total of 250 matches. In a tournament, we consider each model to be a player. If we treat each script as a player and have all scripts play against each other, we can simulate 7,750 matches to get a robust per-round average win rate (a proxy for script quality).

Script Round vs Performance

We can see that four of the five models evaluated have notable increases in average win rate between round 1 and round 5 (Claude Opus 4.5 +20%, GLM 4.7 +16%, GPT 5.2 +7%, Grok 4.1 Fast +6%).

Gemini 3 Pro Performance

Gemini 3 Pro's performance presents an anomaly. Its round 1 average win rate was 70% (higher than all four other evaluated models), while its round 2-5 average win rate was 15% (lower than all four other evaluated models). Gemini 3 Pro's round 1 scripts are approximately four times shorter than those of top-performing models Claude 4.5 Opus and GPT 5.2. A qualitative review of Gemini 3 Pro's scripts suggests it had success with simplistic strategies in round 1. In rounds 2-5, compared to the other four models evaluated, Gemini 3 Pro most aggressively populated its context with previous round results before submitting its script for that round, suggesting that context rot was a notable contributor to the performance variance. Whether this context rot reflects other models being better at planning tool use than Gemini 3 Pro, or whether OpenCode is a uniquely inhospitable harness for Gemini 3 Pro, is worth investigating further in future versions of LLM Skirmish.

API costs vary significantly across models. The chart below plots each model's average cost per round against its ELO rating. Claude Opus 4.5 achieved the highest ELO (1778) but at the highest cost ($4.12/round). GPT 5.2 delivers nearly 1.7x more ELO per dollar than Claude Opus 4.5.

With a 71% round 1 win rate, Gemini 3 Pro leads all models in the early game with simple and aggressive strategies
In later rounds, Gemini 3 Pro struggles to manage information from previous rounds

In round 1 matches, Claude Opus 4.5 puts up a formidable performance, but overly focusing on economy leaves it vulnerable to GPT 5.2
By round 2, Claude Opus 4.5 is already a dominant model, and script quality still increases across all rounds

With a verbose coding style, GPT 5.2's best scripts rank in the top decile, with a round 2 script achieving an 89% hypothetical win rate against the field
But more code isn't always better: a round 5 script with 39 helper functions lands in the bottom decile, showing that GPT 5.2 sometimes overengineers when it should simplify

With a +16% win rate increase from round 1 to round 5, GLM 4.7 shows the second-steepest learning curve of all models, but the improvement is inconsistent, with scripts ranging from top quartile to dead last across the field
Unlike top performers, it never implements kiting, formations, or commit logic, relying purely on consistent threat prioritization and focus fire to punch above its weight class

Cheap tokens and terse reasoning let Grok 4.1 Fast claim 3rd place while spending 37x less than the top model per round
But short scripts are brittle: its worst scripts collapse entirely, dropping from a 75% win rate to just 6.5%