微型智能体:通过模型内部协作超越前沿模型
Micro-Agent: Beat Frontier Models with Collaboration Inside Model API

原始链接: https://vllm.ai/blog/2026-06-29-micro-agent-frontier-models

人工智能的下一个前沿不仅在于模型本身,而在于“路由器”——即人工智能推理的控制平面。现代路由器不再仅仅充当简单的中转站,而是正在演变为“微智能体”的执行运行时,将单一的 API 调用转化为复杂的协同工作。 **vLLM 语义路由器**将这种逻辑转移到了开放服务层。它在保持与 OpenAI 兼容的稳定 API 的同时,在内部管理着复杂的“执行配方”。根据请求的不同,路由器可以触发多种算法,例如:**置信度**(通过顺序升级以提高成本效率)、**评分**(通过并行集成以提升质量)、**ReMoM**(通过综合推理来扩展思维广度)或**工作流**(受限的代理步骤)。 通过在服务层托管这些模式,路由器可以管理预算、错误策略和输出契约,从而确保向最终用户隐藏复杂性。这种基础设施层面的方法使开发者能够将“模型”视为一个由协作团队支撑的通用界面,而非静态的检查点。归根结底,人工智能的下一个时代将由能够智能编排不同模型(无论是边缘端还是云端,开源还是闭源)的路由器来定义,以提供卓越的性能和可靠性。

Hacker News 最新 | 往期 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 Micro-Agent:通过模型 API 内部协作击败前沿模型 (vllm.ai) 11 分,由 matt_d 发布于 2 小时前 | 隐藏 | 往期 | 收藏 | 讨论 | 帮助 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文

Everyone is watching for the next frontier model.

The more interesting layer may be the one in front of it.

Routers are becoming the control plane for AI inference. Their first role was practical: route the right request to the right model. That already matters because production AI is no longer a one-model world.

A router can cut cost by deciding when a request deserves a frontier model and when an open-source or local model is enough. It can make safety policy executable by sending sensitive domains to stricter models, stricter filters, or stronger review paths. It can coordinate cloud and edge, keeping private or low-latency intent local while escalating harder work to the cloud.

Those are important jobs.

But the next router job is more interesting:

A router can make the model better.

Not by changing weights. Not by asking every application to build a bespoke agent graph. By turning one model API call into a bounded collaboration inside the serving layer.

Figure 1: The router is moving from model selection to capability construction.
Figure 1: The router is moving from model selection to capability construction.

This is why Sakana Fugu landed so loudly: it made a commercial product out of a simple but powerful idea, that a "model" can be a surface, and behind that surface can be a team. The research around this idea, including the Fugu technical report and coordination papers such as Conductor and Trinity, gives useful language for thinking about orchestration.

But the vLLM Semantic Router vision is different in where it puts the abstraction. Collaboration should not live only inside one commercial endpoint or one application-specific agent graph. It should become an open serving primitive.

vLLM Semantic Router brings that idea into the open serving layer. The user still calls one model:

{
  "model": "vllm-sr/auto",
  "messages": [{"role": "user", "content": "..."}]
}

Behind that stable model identity, the router can select a recipe, fan out to workers, collect a quorum, verify disagreement, synthesize a final answer, repair the output contract, and return one normal OpenAI-compatible response.

The point is not to expose complexity.

The point is to make collaboration feel like a model.

The Looper Is the Runtime

In vLLM Semantic Router, the looper is the execution runtime for bounded micro-agents.

A request enters the router as an ordinary chat completion. The router extracts signals, projects them into task-shape or risk bands, matches a decision, and then chooses an algorithm. That algorithm may be a normal single-model route, or it may be a looper route.

Today, the main looper patterns are:

  • Confidence: a sequential escalation loop. It tries a cheaper candidate first, measures confidence, and escalates only when the score is too low.
  • Ratings: a bounded fan-out loop. It runs multiple candidates under a hard concurrency cap and aggregates them with rating-aware weights.
  • ReMoM: repeated mixture-of-model reasoning. It fans out breadth samples, waits for enough successful responses, and runs a final synthesis round.
  • Fusion: a panel-judge-final pattern. Independent model responses become evidence for a judge and finalizer.
  • Workflows: a micro-agent workflow runtime. It supports static roles or a dynamic planner, executes bounded worker steps, and synthesizes a final response.
Figure 2: Looper algorithms run inside the router while preserving the model API surface.
Figure 2: Looper algorithms run inside the router while preserving the model API surface.

The implementation details matter. A looper is not a slogan for "ask more models." It is a small runtime with budget, topology, trace, and failure policy.

Confidence: spend escalation only on hard cases

Confidence is the cost-aware loop. It starts with a smaller or cheaper candidate, then evaluates whether the answer is confident enough to stop. The confidence signal can come from token-level log probability, logprob margin, a hybrid score, self-verification, or an AutoMix-style entailment verifier.

If the score passes the threshold, the router returns immediately. If the score is too low, the route escalates to the next candidate. The important part is not that escalation exists. It is that escalation becomes explicit router policy: thresholds, failure behavior, and stopping conditions are visible and tunable.

Figure 3: Confidence turns escalation into a measured stopping policy.
Figure 3: Confidence turns escalation into a measured stopping policy.

Ratings: parallel quality under a hard cap

Ratings is the controlled ensemble loop. It launches several candidates in parallel, but only up to a configured max_concurrent cap. That makes it useful when a route should benefit from multiple model views without turning every request into an unbounded fan-out.

The router collects successful responses, applies rating-aware aggregation, and handles failures according to the route policy. In practice, Ratings is a good fit for A/B-style evaluation, ensemble strategies, and routes where the operator already has meaningful per-candidate quality signals.

Figure 4: Ratings keeps multi-candidate execution bounded and rating-aware.
Figure 4: Ratings keeps multi-candidate execution bounded and rating-aware.

ReMoM: breadth with a contract

ReMoM is useful when the task has high reasoning variance and the answer format must survive the collaboration. It fans out multiple reasoning attempts, waits for a minimum-success quorum, then asks a synthesis model to merge evidence into the required output contract.

If synthesis fails but earlier workers produced valid evidence, the route does not have to collapse into an API error. It can fall back to the best valid evidence and still return a normal response.

Figure 5: ReMoM treats breadth, quorum, synthesis, and fallback as serving-time controls.
Figure 5: ReMoM treats breadth, quorum, synthesis, and fallback as serving-time controls.

Fusion: disagreement as signal

Fusion starts from a different bet. Sometimes the useful object is not the average answer; it is the structure of disagreement. Independent panel answers become evidence. The judge sees agreement, contradiction, and unique insight, then the finalizer returns one answer with the trace collapsed behind the API.

That makes Fusion especially useful when there are plausible competing paths: hard multiple-choice reasoning, long-form expert judgment, or exact-answer tasks where a single confident response can be brittle.

Figure 6: Fusion does not hide disagreement. It turns disagreement into evidence.
Figure 6: Fusion does not hide disagreement. It turns disagreement into evidence.

Workflows: roles under a budget

Workflows is the most agentic pattern, and also the one that needs the strictest boundaries. The planner can only choose allowed worker models. The plan is validated. Steps are bounded by max steps, max parallelism, timeouts, and error policy. The final response still has to satisfy the output contract.

For SWE-style tasks, that means the router can express a planner, patcher, verifier, and finalizer without letting the application own a bespoke agent stack. For production serving, that distinction is critical: the loop is powerful, but it is still governed by infrastructure.

Figure 7: Workflows gives the router a bounded role system, not an unbounded autonomous agent.
Figure 7: Workflows gives the router a bounded role system, not an unbounded autonomous agent.

Auto recipes: one model name, many loops

The public surface remains one model name: vllm-sr/auto. Internally, the router can use signals and projections to choose the right loop for the request. Difficulty, risk, contract pressure, latency, and cost are not comments in a prompt. They are routing facts that can select Confidence, Ratings, ReMoM, Fusion, Workflows, or a fallback path.

Figure 8: Auto recipes let signals choose the collaboration pattern while preserving one model identity.
Figure 8: Auto recipes let signals choose the collaboration pattern while preserving one model identity.

This is the difference between "agent as app logic" and "micro-agent as serving runtime." The router controls the budget, policy, topology, trace, and failure mode.

Recipes Beat One Universal Loop

The most important lesson from our eval work is not that one algorithm always wins.

It is the opposite:

The best loop is task-shaped.

GPQA-Diamond wants strict multiple-choice answer preservation. LiveCodeBench wants runnable code and hidden-test robustness. Humanity's Last Exam wants disagreement resolution and exact-answer formatting. SWE-style tasks need a planner, patcher, verifier, and finalizer.

That is why vllm-sr/auto should not mean "always run the biggest loop." It should mean: select the recipe that fits this task.

Figure 9: Signals and projections let the router choose a benchmark-shaped collaboration pattern.
Figure 9: Signals and projections let the router choose a benchmark-shaped collaboration pattern.

In our recipes, that shape is explicit:

  • GPQA-Diamond routes hard science multiple-choice prompts into a ReMoM recipe with strict ANSWER: X preservation.
  • LiveCodeBench looks for constraints, starter code, standard input, float tolerance, timeout risk, and hidden-test risk before selecting a code-shaped loop.
  • HLE detects formal reasoning, disagreement risk, long context, and exact answer pressure before choosing between deeper ReMoM, smaller Fusion, or a fallback path.

This is why router-side collaboration is more than prompt engineering. The prompt is only one part. The recipe also defines model pool, model roles, reasoning effort, concurrency, quorum, timeout, synthesis model, fallback policy, output contract, and observability labels.

The Scorecard Is a Proof, Not the Whole Story

We evaluated the current closed-model recipe across three hard benchmarks. The numbers are useful because they show that the idea is not only aesthetic.

Figure 10: VSR Closed and VSR Hybrid scorecard view across LiveCodeBench, GPQA-Diamond, and Humanity's Last Exam.
Figure 10: VSR Closed and VSR Hybrid scorecard view across LiveCodeBench, GPQA-Diamond, and Humanity's Last Exam.

In this scorecard, VSR Closed means the recipe uses only closed-model backends. VSR Hybrid means the recipe mixes open and closed models, using the stronger closed models where the recipe needs higher-risk judging, repair, synthesis, or fallback.

BenchmarkVSR scorecard rowScoreReference rows
LiveCodeBench, January-April 2025VSR Closed92.6Fugu Ultra 92.0, Fugu 90.3, GPT-5.5 90.7, Opus 4.8 90.3
GPQA-DiamondVSR Closed96.0Fugu Ultra 95.5, Fugu 95.5, Gemini 3.1 Pro 94.3, GPT-5.5 93.6
Humanity's Last ExamVSR Closed50.0Fugu Ultra 50.0, Fugu 48.5, Gemini 3.1 Pro 45.0
Humanity's Last ExamVSR Hybrid47.1GLM-5.2 40.5, Qwen3.7 Max 41.4, GPT-5.5 41.4

The scorecard should be read carefully. It is not a claim that every request should always use every closed model. That would be the wrong product.

The claim is that router-owned collaboration can create a stronger model identity than the individual calls beneath it. It can beat or match frontier single-model baselines while preserving one API surface.

That is the real product shape:

  • Users see one model name.
  • Operators control the recipe.
  • The system can improve without changing the client integration.
  • Open and closed models can participate under the same serving abstraction.

What This Means for Model Serving

The old serving stack was passive. It accepted a model name and sent the request to a backend.

The next serving stack is active. It asks:

  • What evidence do we have about this request?
  • What quality, cost, latency, and safety band does it fall into?
  • Is one model enough?
  • If not, what collaboration pattern should run?
  • Which answer contract must be preserved?
  • What should happen if one provider is slow or wrong?
  • How do we expose one clean response while keeping the full trace?

That is not application glue. That is infrastructure.

Micro-agents belong in the router because the router already owns the things micro-agents need: model aliases, provider policy, credentials, cost metadata, signals, decisions, retries, timeouts, traces, and OpenAI-compatible response semantics.

The Takeaway

The phrase "frontier model" is starting to mean two things.

One is a checkpoint.

The other is a system boundary.

The recent orchestration wave made the direction visible. vLLM Semantic Router is the bet that this capability should be programmable, observable, and open at the serving layer.

The next model race will still involve better models. But it will also involve better routers: routers that know when to save money, when to enforce safety, when to stay on the edge, when to go to the cloud, and when to turn one request into a small, disciplined team.

That is the promise of micro-agents inside the Model API.

Acknowledgements

We thank researchers from MBZUAI, McGill University, Mila, and Agentic Intelligence Lab, especially Prof. Xue Liu and Dr. Bowei He, for research collaboration and discussions around router-side model collaboration.

Individual Contributors: Huamin Chen, Yincheng Ren.

We also thank AMD's Andy Luo and Haichen Zhang for AMD GPU evaluation support.

联系我们 contact @ memedata.com