浏览器代理基准测试：比较用于网页自动化的LLM模型

浏览器代理基准测试：比较用于网页自动化的LLM模型
Browser Agent Benchmark: Comparing LLM models for web automation

原始链接: https://browser-use.com/posts/ai-browser-agent-benchmark

浏览器使用发布了一个开源基准测试，旨在解决评估人工智能代理在复杂网络任务中性能的挑战。现有的基准测试难以平衡真实性和可验证的结果；合成网站缺乏现实世界的复杂性，而模拟用户行为的任务难以大规模评估。这个新的基准测试结合了WebBench、Mind2Web、GAIA和BrowseComp中的100个任务，以及20个专注于困难浏览器交互的定制挑战。为了确保任务难度，他们使用各种大型语言模型对任务进行了严格测试，去除了过于简单或不可能完成的任务，并对剩余的任务进行了人工验证。重要的是，该基准测试利用一个大型语言模型（目前是Gemini 2.5-Flash）作为评估者，通过精心设计提示，实现了与人类评估87%的一致性——倾向于简单的“真/假”判断，而不是复杂的评分标准。初步结果显示，各模型表现良好，浏览器使用的ChatBrowserUse 2 API目前处于领先地位。该基准测试在GitHub上可用，旨在为开发者提供一种标准化、可重复的方法来测试和改进大型语言模型在真实代理浏览方面的性能，但运行完整套件需要大量资源。

黑客新闻新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录浏览器代理基准测试：比较用于网页自动化的LLM模型 (browser-use.com) 4点由 MagMueller 2小时前 | 隐藏 | 过去 | 收藏 | 1条评论 pixel_popping 10分钟前 | 下一个 [–] 基准测试中缺少最好的模型（Opus 4.5）。回复指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系搜索：

原文

At Browser Use I spend a lot of time deciding what model to use. It's not easy to choose between LLMs, agent parameters, or compare two different versions of Browser Use and tell which one is better.

To truly understand our agent performance, we built a suite of internal tools for evaluating our agent in a standardized and repeatable way so we can compare versions and models and continuously improve. We take evaluations seriously. As of now, we have over 600,000 tasks run in testing.

Today we are releasing our first open source benchmark.

The Tasks

Existing browser benchmark task sets all have strengths and weaknesses. All tasks fall somewhere in the tradeoff between interpretability and realism.

On the interpretable end are tasks with synthetic websites that can deterministically confirm if the agent succeeds. But synthetic sites don't capture the bizarre reality and diversity of how real websites work, so we avoid them.

A good middle ground is web tasks that involve researching verifiable information, often involving multiple steps (like BrowseComp and GAIA) and comparing the answer to ground truth.

The end of the spectrum that most represents real user tasks involves finding real-time information or following complex workflows on various pages (Mind2Web 2, WebBench). The challenge here is judging them accurately at scale.

Tasks left out of evaluations are those that make real changes to websites (like creating a post) or require authentication. There has yet to be an economical solution for running these at scale.

Another challenge is difficulty. Many tasks have become trivial to modern browser agents, while others simply are not completable. For our benchmark, we selected 100 of the best tasks from existing open source benchmarks. We chose WebBench, Mind2Web, GAIA, and BrowseComp for a mix of verifiable and real-time tasks. We also added 20 tasks on a custom site to test the hardest browser interactions, such as iframe inception, clicking and dragging, etc.

Source	Tasks	Description
Custom	20	Page interaction challenges
WebBench	20	Web browsing tasks
Mind2Web 2	20	Multi-step web navigation
GAIA	20	General AI assistant tasks (web-based)
BrowseComp	20	Browser comprehension tasks

We approached the difficulty problem with the following method: we ran all tasks many times with different LLMs, agent settings, and agent frameworks. Each was evaluated by our LLM judge for success, with flags for tasks judged impossible or where the agent was very close.

We removed tasks completed most of the time for being too easy, and ones majority voted impossible and never completed for being unreachable. Among the remaining tasks, the most challenging and interesting ones were hand-selected and independently verified to be possible. The resulting set contains only very hard but possible tasks.

The Judge

Judging task traces is a critical part of any benchmark. When tasks involve real websites and real information, there is no deterministic way to check if the agent succeeded.

At the scale and speed needed to base product direction on evaluations, we must use an LLM as the judge. To ensure consistency across models, the same LLM, prompt, and inputs must be used.

We have iterated across many judge frameworks over the last year on our internal evaluation platform. The way to evaluate a judge is to run it on task traces that were judged personally and meticulously by our team and compare the results. This tells us how aligned the judge is with our own judgements. We hand labeled 200 task traces and used accuracy on this set as our core metric.

Initial results settled on GPT-4o as the most human-aligned judge, as found by the original Mind2Web paper. However, when gemini-2.5-flash released, we found it had better alignment and became our new judge.

For prompting, we found that simple trumps complex, and context is king. Many benchmarks use a rubric system, but we found better accuracy demanding a true or false verdict. With rubrics, LLMs tend to highlight a few positives and negatives and give a middling score even in complete success or utter failure.

Our final judge achieved 87% alignment with our human judgements, only differing on partial successes or technicalities.

The Results

Here is a comparison of performance and throughput on this benchmark for the most used models on Browser Use Cloud. We find it concerning that many AI agent benchmarks do not include error bars or variance estimations. We have run each evaluation multiple times and shown standard error bars.

The strongest model today is our new ChatBrowserUse 2 API, which is specially optimized for use in our framework.

However, all models on this plot are very strong, and even the lowest scoring model (gemini-2.5-flash at 35%) is respectable on these hard tasks. The fact that recent models have surpassed 60% on this benchmark is impressive. We may need to collect even harder tasks for a new benchmark soon.

Using the Benchmark

This benchmark is open source at github.com/browser-use/benchmark. We want it to be easy to use and modify. Our results for ChatBrowserUse 2 can be replicated by running run_eval.py.

However, these evaluations are not suitable for an everyday user. A single run through these 100 complex tasks on the basic Browser Use plan with concurrency limited to 3 will take roughly three hours and cost $10. Using more expensive models like claude-sonnet-4-5 will take roughly twice as long and incur costs of nearly $100 in API calls.

We hope this benchmark can enable LLM providers to test new models on complex real world agentic browsing tasks and use the results to improve their models. If you would like to inquire about running these evaluations at a larger scale, please contact [email protected]