MTG Bench：测试大语言模型玩《万智牌》的能力

MTG Bench：测试大语言模型玩《万智牌》的能力
MTG Bench: Testing how well LLMs can play Magic

原始链接: https://mtgautodeck.com/articles/mtg-bench/

该项目旨在评估大语言模型（LLM）在没有硬编码规则引擎的情况下，模拟复杂《万智牌》（Magic: The Gathering）对局的能力。通过使用 MCP 服务器，模型可以执行基本的库操作（如抽牌、洗牌）来完成复杂的对局动作。评估结果显示，尽管模型在识别合法行动方面表现尚可，但在实际执行时却频频受阻，常在复杂序列中无法修正错误或遗忘当前游戏状态。分析的重点之一在于成本效益。使用 MCP 服务器配合 OpenAI API，可以将智能体循环视为单一请求，从而避免重复的缓存输入 Token 费用，以此实现成本最小化。相比之下，Anthropic 目前的实现方式在每次工具调用后都会收取系统提示词费用，导致成本更高。该项目完全通过“感觉编程”（vibe coding）完成，绕过了手动编程过程。尽管当前工具仅为概念验证，且在速度和成本上不如手动模拟，但作者展望未来，认为随着模型变得更便宜、更准确，可以通过运行数千次并行模拟来实现自动化套牌优化和统计性能分析。该项目已在 GitHub 开源。

抱歉。

原文

Results

Click on the charts above to view each benchmark's simulations.

Example successes

Example failures

How the benchmark works

The main idea is that if an LLM is smart enough to play good magic, then it is also smart enough to not need a rules engine. A rules engine that enforces legal actions would improve the performance floor, but I don't think it would improve the overall quality of the simulation.

Each LLM call has access to an MCP server with primitive library operations. It can do things like draw a card from the top of the deck, return card to bottom of deck, and shuffle. To simulate more advanced operations, like scry or surveil, it can use multiple library tool calls.

Everything other than the library is managed by the LLM. Legality checks and scoring for the benchmarks was all done with gpt-5.5 (medium). From my testing, LLMs were much better at evaluating if a simulated turn was legal than they were at actually performing a legal turn simulation.

Why I choose to use an MCP server

I have full control over all of the data and the LLM api calls, so why use MCP instead of basic function/tool calling?

The main reason is that OpenAI and Anthropic allow you to provide a remote MCP server url in an api request. This means that OpenAI or Anthropic handle the agent loop. This has two major benefits.

Since it is one api call, you don't pay for the cached input token cost after each tool use (at least with OpenAI. more on that later)
You can use the batch api for 50% savings without having to submit a new batch after every tool call

Input token caching

In my opinion, the way cached input tokens are charged does not make sense for agent loops. The pricing makes sense for independent requests. If multiple independent api calls start with the same large system prompt, input caching gets you a discount for free, or for a small caching fee.

With an agent loop, however, you are charged the cached input cost for a large system prompt after every tool call. Consider an example. Assume the system prompt is already cached and tool calls result in negligible token use.

Large system prompt = 10k tokens
Agent calls 10 tool functions (not parallel)
Billed cached input tokens = 10k + 10k * 10 = 110k tokens

I don't think it makes sense to charge for the system prompt after every agent turn if the LLM is only pausing for a fraction of a second while waiting for a tool function result. This is overlooking some details, like how it takes output tokens to call a tool, and the tool function result still needs to be processed as input tokens. But in my case, the api cost is dominated by the large system prompt being charged as cached input tokens after every agent turn.

The pricing for an agent loop is understandable when your application code has the agent loop, and is making a new api call after each tool call. But it makes even less sense when you provide a remote MCP server and do not handle the agent loop yourself. OpenAI handles it correctly. A single api call to OpenAI with a remote MCP server will only ever charge you for the input prompt once. An Anthropic api call with remote MCP server, however, works like the previous example.

Some real numbers, the gpt-5.5 (medium) benchmark had an average input tokens per magic turn of 11,386. The average for claude-fable-5 (medium) was 51,610.

Over eager tool calling

This benchmark punishes models that are too eager to call tools more than most benchmarks. In many cases, tool calls are only retrieving information, so if a model calls too many tools, the only downside is wasted input tokens and context window for the tool results. Even if the tool mutates state, it can usually be undone so the final result is correct.

This is not the case when simulation magic. If you draw a card, then realize that was a mistake, you can't just put it back. Even if you do return the card to the deck, you now know what that card is, so the simulation is illegal.

A common failure mode was the model starting a tool call, then realizing it was a mistake and having no way to correct it. All the library MCP functions have a required reason field. If you look at this example from Opus 4.8, you can see that it draws a card for turn with reason "Draw for turn", then returns the card to the deck with reason "No-op check not needed; cancel". It then proceeds to return a card named "x" to the deck with reason "noop", then again with reason "stop".

What's next

I made MTG Auto Deck as a way to try out vibe coding. I had not been keeping up with the state of LLM based coding, and I ended up making this project and the benchmark without writing a single line of code by hand.

I only made a live version with accounts and payments because of how quick it was to implement. The project is on GitHub if you want to run it and use your own api keys or local llama.cpp

I wouldn't actually recommend paying for the live version. With the current cost and speed of models that that can accurately play magic, the app does not provide much utility. Simulating turns one at a time is slower than manually goldfishing your deck using one of the online tools. And it is too expensive to run dozens of simulations in parallel and give you a summary.

As better cheaper LLMs get released, I think there is some version of this app that would be useful. I can imagine running hundreds of simulations, then giving statistical results about which cards are good and bad. Or automatically optimizing a deck by swapping out cards for you.