计算机使用成本是结构化API的45倍。
Computer Use is 45x more expensive than structured APIs

原始链接: https://reflex.dev/blog/computer-use-is-45x-more-expensive-than-structured-apis/

## Vision 与 API Agent:成本比较 一项基准测试比较了使用 AI agent 通过两种方法操作内部管理面板的成本:“视觉”(像用户一样解释截图)和“API”(直接调用应用程序函数)。目标是量化通常假定的视觉 agent 的成本,因为为内部工具构建专用 API 成本高昂。 测试涉及一项常见任务——查找客户、管理订单和接受评论。API agent 完成任务使用了 8 次调用,始终使用约 12k tokens。即使使用详细的分步提示,视觉 agent 也需要更多——53 步,550k+ tokens,并且时间变化很大(17 分钟 ± 5 分钟)。 关键区别是什么?视觉 agent *读取像素*,需要多次交互才能导航和提取数据,而 API agent 直接接收结构化数据。这项研究强调,即使更好的视觉模型可以提高准确性,它们也无法降低“观察”每个屏幕的基本成本。 重要的是,像 Reflex 这样的工具现在可以从现有应用程序自动生成 API,从而大大降低 API 开发成本。这改变了经济方程式,使 API agent 成为更高效的选择,尤其是在您控制代码库的内部工具中。视觉 agent 仍然对*没有* API 的应用程序很有价值,但对于自建工具,成本效益分析现在更倾向于直接的 API 方法。

## 计算机使用 vs. API:成本比较 Hacker News 上一篇帖子强调了 AI 代理使用可视化计算机界面(“计算机使用”)和与结构化 API 交互之间的显著成本差异。作者的研究表明,“计算机使用”的成本大约是利用 API 的 45 倍。 讨论的重点在于 AI 导航图形用户界面 (GUI) 的效率低下——需要像素分析和模拟点击——与直接 API 访问相比。用户分享了经验,证实了这一点,指出 CLI 工具和直接 API 调用更快且更节省 token。一些人建议将所有应用程序功能通过 API 公开,这可能会推动操作系统创新,甚至促使 OpenAI 等公司考虑构建自己的手机。 然而,现实情况是并非所有事物*都有* API,但所有事物都有 UI。对话还涉及 AI 理解复杂 GUI 的挑战,以及通过窗口管理集成和更好的模型训练来改进的潜力。最终,核心结论是,程序化接口对于 AI 来说效率更高,但弥合缺乏 API 的应用程序之间的差距仍然是一个关键挑战。
相关文章

原文

We ran a benchmark comparing two ways of letting an AI agent operate the same admin panel, with the goal of putting a price tag on vision agents (browser-use, computer-use).

Here is what we measured, what we had to change to make the vision agent work at all, and what changes when generating an API surface stops being a separate engineering project.

Vision agents are the default for letting AI agents operate web apps that don't expose APIs. The alternative, writing an MCP or REST surface per app, is its own engineering project across the 20+ internal tools most teams have. Most teams default to vision agents not because they are better, but because the alternative is too expensive to build. The cost of the vision approach is treated as a fixed price.

We wanted to measure the price.

The test app is an admin panel for managing customers, orders, and reviews, modeled on the react-admin Posters Galore demo. Two agents target the same running app: one drives the UI via screenshots and clicks, the other calls the app's HTTP endpoints directly. Same Claude Sonnet, same pinned dataset, same task. The interface is the only variable.

The task: find the customer named "Smith" with the most orders, locate their most recent pending order, accept all of their pending reviews, and mark the order as delivered. This touches three resources, requires filtering, pagination, cross-entity lookups, and both reads and writes. It is the shape of work a typical internal tool sees daily.

Path A: Vision agent. Claude Sonnet driving the UI via browser-use 0.12. Vision mode, taking screenshots and executing clicks.

Path B: API agent. Claude Sonnet with tool-use, calling the handlers the UI calls. Each tool maps to one or more event handlers on the app's State, the same functions a button click would trigger. The agent gets the structured response back instead of a rendered page.

We started by giving both agents the same six-sentence task above and seeing what happened.

The API agent completed it in 8 calls. It listed the customer's reviews filtered by pending status, accepted each one, and marked the order as delivered. Both agents are calling into the same application logic; the API agent just reads the structured response directly instead of looking at a rendered page.

The vision agent, on the same prompt, found one of four pending reviews, accepted it, and moved on. It never paginated. The remaining three reviews were below the visible fold of the reviews page and the agent had no signal to scroll for them.

This is not a model problem. The vision agent was reasoning about a rendered page and had no signal that the page wasn't showing everything. The API agent calls the same handler the UI calls, but the response includes the full result set the handler returned, not just the rows currently rendered. The agent reads "page 1 of 4 with 50 results per page" directly instead of having to interpret pagination controls from pixels.

To make the comparison apples-to-apples, we rewrote the vision prompt as an explicit UI walkthrough, naming the sidebar items, tabs, and form fields the agent should interact with at each step. Fourteen numbered instructions covering the navigation the agent had failed to figure out on its own.

With the walkthrough, the vision agent completed the task. It also ran for fourteen minutes and consumed about half a million input tokens.

The walkthrough is itself a finding. Each numbered instruction is engineering work that doesn't show up in token counts but represents real cost. Anyone deploying a vision agent against an internal tool is either writing prompts at this level of specificity or accepting that the agent will silently miss work.

We ran the API path five times and the vision path three times. The vision path was capped at three trials because each run takes 14-22 minutes and consumes 400-750k tokens.

Variance was the most surprising part of the vision results. Across three trials the wall-clock time spanned 749s to 1257s, and input tokens spanned 407k to 751k. The agent took 43 cycles in the shortest run and 68 in the longest. The screenshot-reason-click loop has enough non-determinism that a single run is not a representative cost estimate.

The API path had no such variance. Sonnet hit identical 8 tool calls on every trial, with input token counts varying by ±27 across all five runs. The agent calls the same handlers in the same order because the structured responses give it no reason to deviate.

MetricVision agent (Sonnet)API (Sonnet)API (Haiku)
Steps / calls53 ± 138 ± 08 ± 0
Wall-clock time1003s ± 254s (~17 min)19.7s ± 2.8s7.7s ± 0.5s
Input tokens550,976 ± 178,84912,151 ± 279,478 ± 809
Output tokens37,962 ± 10,850934 ± 41819 ± 52

Numbers are mean ± sample standard deviation (n−1), with n=5 per API path and n=3 for the vision path. Full run details are available in the repo.

Haiku could not complete the vision path. The failure was specific to browser-use 0.12's structured-output schema, which Haiku could not reliably produce in either vision or text-only mode. On the API path, Haiku finished in under 8 seconds for under 10k input tokens, which is the cheapest configuration we tested.

The cost difference follows directly from the architecture. An agent that must see in order to act will always pay for the seeing, regardless of how good the model gets. Better vision models reduce error rates per screenshot, but they do not reduce the number of screenshots required to reach the relevant data. Each render is a screenshot is thousands of input tokens.

Both agents in this benchmark walk through the same application logic. They both filter, paginate, and update the same way the UI does. The difference is what they read at each step. The vision agent reads pixels and has to render every intermediate state to interpret it. The API agent reads the structured response from the same handlers, which already contains the data the UI was going to display.

Better models will narrow the cost per step. They will not narrow the step count, because the step count is set by the interface.

The benchmark was made cheap to run by Reflex 0.9, which includes a plugin that auto-generates HTTP endpoints from a Reflex application's event handlers. None of the structural argument depends on Reflex specifically, but it is what made running the API path possible without writing a second codebase.

The interesting question is what becomes possible when the engineering cost of an API surface drops to zero. Vision agents remain the right tool for applications you do not control: third-party SaaS products, legacy systems, anything you cannot modify. For internal tools you build yourself, the math now points the other way.

Vision results are specific to browser-use 0.12 in vision mode, and other vision agents may behave differently. The Path B runner shapes the auto-generated endpoints into a small REST tool surface of about thirty lines, which the agent sees as list_customers, update_order, and similar. The dataset is pinned and small (900 customers, 600 orders, 324 reviews), so behavior on production-scale data is not measured here. The vision agent runs through LangChain's ChatAnthropic, and the API agent runs through the Anthropic SDK directly. Reported token counts are uncached input tokens.

The repo includes seed data generation, the patched react-admin demo, both agent scripts, and raw results.

联系我们 contact @ memedata.com