工具调用和开源模型的M×N问题

工具调用和开源模型的M×N问题
The M×N problem of tool calling and open-source models

原始链接: https://www.thetypicalset.com/blog/grammar-parser-maintenance-contract

## 开源模型中的工具调用挑战使用外部函数来增强LLM的能力，在闭源模型中通常很直接，提供无缝的API体验。然而，转向开源模型会暴露出一个显著的障碍：**不一致的“线格式”**——模型编码工具调用的方式。每个模型系列（如Gemma、Harmony、DeepSeek）都使用独特的格式，具有不同的token、结构和参数序列化方式，如果引擎不支持该格式，就会导致输出混乱。这需要在每个推理引擎（vLLM、SGLang等）中为*每个*模型定制解析器，这是一项代价高昂且重复的工作。通用解析器难以应对这些格式的开放性，经常无法处理模型特定的怪癖，例如推理token泄露到参数中。目前，语法引擎（用于约束生成）和输出解析器都在独立地逆向工程这些格式。需要一个共享的、**声明式规范**来定义工具调用格式——一个详细说明边界token和参数结构的配置文件——以便将模型更新与生态系统中的代码更改解耦，并避免重复工作。这种分离对于开源LLM环境中的高效且可扩展的工具调用至关重要。

最近的 Hacker News 讨论强调了人工智能中“工具调用”这一关键但目前分散的问题——特别是大型语言模型如何与外部工具交互。链接的文章（thetypicalset.com）引发了关于这种交互缺乏标准化格式的讨论，尽管人工智能取得了显著进展。用户们一致认为这是一个重要的议题，但常常被炒作所掩盖。大家对 OpenAI 的“Harmony”格式为何未被广泛采用感到好奇，并推测这可能需要进一步的模型开发。一位评论员还提到了“MCP”作为一种潜在的解决方案。一个小的反馈点是关于网站的格式问题，特别是过多的缩进影响了可读性。总的来说，该讨论强调了行业范围内对工具调用标准进行统一的必要性，以便开发更有效的 AI 应用。

原文

Tool calling with closed-source models is seamless. You pass a list of functions to the API, the model calls them, you get structured JSON back. The wire format is invisible to you.

Then you move to open models and discover that tool calling depends on a wire format the engine has to understand. If the engine doesn’t support that model’s format yet, the output comes back garbled: reasoning tokens in arguments, malformed JSON, missing tool calls. Then you either wait, or write the parser yourself.

What “supporting a model” actually means

Every model family encodes tool calls differently.

Here’s the same semantic operation, calling a function search(query="GPU"), in three wire formats:

gpt-oss (Harmony):

<|channel|>commentary to=functions.search
<|constrain|>json<|message|>
{"query": "GPU"}
<|call|>

DeepSeek:

<｜tool▁calls▁begin｜><｜tool▁call▁begin｜>function<｜tool▁sep｜>search
'''json
{"query": "GPU"}
'''
<｜tool▁call▁end｜><｜tool▁calls▁end｜>

GLM5:

<tool_call>search
<arg_key>query</arg_key><arg_value>GPU</arg_value>
</tool_call>

Same operation, incompatible wire formats: different token vocabularies, boundary markers, and argument serialization schemes.

To return a nice array of JSON objects with the generated tool calls, you need to parse the model output back into a clean API response. In practice, each of the M applications (vLLM, SGLang, TensorRT-LLM, transformers, etc.) ends up writing custom parsers for each model it wants to support. And that is only half of the implementation burden.

The pace of the problem

Gemma 4 is a good illustration of the difficulty involved. Its <|channel> reasoning tokens get stripped by the decoder before the parser sees them (vLLM #38855). Reasoning content can leak into tool-call arguments (vLLM PR #39027). The model’s non-standard format was different enough that llama.cpp had to abandon its generic autoparser and build a dedicated implementation (llama.cpp PR #21418). These are training-time format choices surfacing as parser bugs.

Generic parsers are swimming against the current

The natural response is to build a parser generic enough to handle all formats. Every engine has tried. A reasonable heuristic, say “find special tokens, extract JSON between them,” covers some formats well enough. But then Harmony routes through <|channel|> with a to= attribute, and GLM5 serializes arguments as <arg_key>/<arg_value> pairs instead of JSON at all.

This is the fundamental problem: wire formats are training-time decisions, and nothing constrains them to a shared convention. The space of possible formats is open-ended, so a generic parser is trying to anticipate design choices that haven’t been made yet. That is why generic parsers help with the common cases but do not eliminate the per-model tail, where the hard bugs live: reasoning tokens leaking into arguments, decoders stripping special tokens before the parser sees them, end-of-generation signals colliding with content.

The same model-specific format knowledge is also needed during generation, not just after the fact when parsing the result. That is where grammar engines enter the picture.

The missing separation

When a new model ships, work happens in two independent places.

Grammar engines, like Outlines, XGrammar, and llama.cpp’s grammar support, need to know where to apply constraints during generation: which tokens mark the tool-call envelope, when to activate structured generation inside it, and when to leave the model unconstrained outside it.

Output parsers inside vLLM, SGLang, TensorRT-LLM, transformers need to do the reverse: take the raw generated text and extract tool calls into a clean API response. They need the same format knowledge in reverse.

These are different teams, different codebases, different release cycles. But the model-specific knowledge they need is the same: which tokens mark the boundaries, how arguments are serialized, where reasoning tokens can appear. Today each team reverse-engineers this independently from chat templates and (if they’re lucky) documentation.

The result is N models × M implementations of the same format knowledge, developed in parallel with no shared contract. A new model ships, and grammar engine maintainers and inference engine maintainers both start the same reverse-engineering work from scratch.

We have already seen the ecosystem converge on shared chat templates in Hugging Face, standardizing how prompts and turns are formatted. Tool calling needs the same kind of separation: not one wire format, but a shared declarative way to describe them. Until that exists, each new model will keep triggering the same reverse-engineering work across the stack.

The separation that’s missing is extracting that shared format knowledge into configuration rather than code. A model’s wire format, its boundary tokens, its argument serialization, and its reasoning token behavior, should be a declarative spec that both grammar engines and parsers consume. The model changes, you update the spec. The grammar engine and the parser don’t move.

I am Rémi Louf, CEO of dottxt. Follow @remilouf / @dottxtai for our work on structured generation and tool calling.