PA基准：评估前沿模型在多标签PA任务上的表现

PA基准：评估前沿模型在多标签PA任务上的表现
PA Bench: Evaluating Frontier Models on Multi-Tab Pa Tasks

原始链接: https://vibrantlabs.com/blog/pa-bench

## PA Bench：计算机使用代理的新基准当前的网络代理基准测试通常侧重于简单的单应用程序任务，未能反映人类实际使用个人助理的方式。为了解决这个问题，研究人员推出了 **PA Bench**，一个评估代理在电子邮件和日历等网络应用程序中执行逼真、多步骤工作流程的基准测试。 PA Bench 利用模拟的高保真环境来确保可重复和可验证的结果。任务是从可重用的场景模板（例如，旅行计划、会议重新安排）生成的，这些模板建立在一致的“基础世界”用户数据之上，从而保证跨应用程序的一致性。一个标准化的 SDK 管理模拟、模型适配器和实验编排。对 Claude Opus 4.6、Gemini 3 Pro/Flash 和 OpenAI Computer Use 的评估显示出显著的性能差异。**Claude Opus 4.6** 通过恢复驱动的行为和事后行动验证实现了最高的成功率 (68.8%)。**Gemini 3 Pro** 显示出强大的规划能力，但缺乏可靠的执行力，而 **Gemini 3 Flash** 在复杂的推理方面遇到困难。**OpenAI Computer Use** 面临控制和探索方面的问题。未来的工作旨在通过涉及众多应用程序和步骤的更复杂、更长期的工作流程来扩展 PA Bench，以及自动化任务生成。这项研究为构建真正强大的计算机使用代理迈出了关键一步。

Vibrant Labs (W24) 发布了 **PA Bench**，一个旨在评估先进人工智能模型——特别是“前沿模型”——在逼真、多步骤的网络任务中表现的新基准。PA Bench 认识到现有基准的不足，专注于模拟跨 Gmail 和 Calendar 等应用程序的复杂工作流程，从而反映现实世界中的“个人助理”场景。该基准旨在识别任务复杂度增加（更多标签页和更长步骤）时的失败点。Vibrant Labs 目前使用最多 3 个标签页进行测试，并正在扩展数据集，构建常见企业工作流程的模拟。该团队还在开发用于 LLM 代理的自动化评估和强化学习数据生成工具，包括自动化任务创建和用于训练的“连贯世界”模拟。他们正在寻求关于 PA Bench 的反馈，并愿意与感兴趣方讨论他们的工作。

Introduction

Browser-based and computer-use agents are becoming increasingly popular for automating consumer workflows that involve interacting with web applications through clicks, typing, and navigation. Many of these workflows mirror how humans use personal assistant tools today—by coordinating information across multiple applications such as email, calendars, and booking platforms.

However, it remains unclear whether current frontier computer-use agents are capable of reliably completing such workflows. Most existing benchmarks for web or computer-use agents focus on isolated, single-application tasks. Typical examples include actions such as adding a product to an online cart or creating a single calendar event. While these benchmarks are useful for evaluating atomic interaction capabilities, they do not reflect how humans actually use personal assistant agents (or human personal assistants) in practice.

Real-world personal assistant tasks are inherently multi-step and multi-application. They require agents to understand context, switch between applications, reason over information distributed across different interfaces, and take coordinated actions to achieve a meaningful goal. Evaluating agents solely on isolated tasks fails to capture these requirements.

To address this gap, we introduce PA Bench, a benchmark designed to evaluate the ability of frontier computer-use agents to complete realistic, long-horizon personal assistant workflows involving multiple web applications. PA Bench focuses on tasks that require agents to interact, reason, and act across applications under deterministic and verifiable conditions, enabling reliable comparisons between models.

Experiment Setup

Above shows an example task from PA Bench: the agent needs to open Gmail, find the airline confirmation emails, read them, understand the pertinent information, and block the same slots in the calendar with the required details.

The above image shows an example task from PA Bench: the agent needs to open the user's email application, find the airline confirmation emails, read them, understand the pertinent information, and block the same slots in the calendar with the required details.

Simulations

We designed the benchmark such that each task requires the agent to interact with both email and calendar applications in order to complete it successfully. To support this, we built realistic, high-fidelity simulated replicas of email and calendar web applications within controlled simulation boundaries. We took a task-centric simulation design, where we determine the features to be implemented based on the tasks we have in the dataset.

Since all tasks involve write operations, running them in simulations rather than real applications enables more reproducible and verifiable evaluations. Because we fully control the simulation environment, the verifier can directly access the backend state at the end of each run, stored as a structured JSON file, and determine whether the agent completed the task correctly.

The above recording shows our email simulation environment.

The above recording shows our calendar simulation environment.

Data, tasks, and verifiers

Designing long-horizon workflows that require agents to interact with multiple applications introduces a key challenge. The data across all applications must be coherent and consistent for a task to be solvable.

For example, consider a task where the agent must identify overlapping meetings in the calendar and notify participants that the user cannot attend one of them. For this task to be feasible, the calendar must contain conflicting events, and the email inbox must include the corresponding meeting notifications or invitation threads associated with those events. Ensuring this level of cross-application consistency is difficult to achieve through manual annotation alone and does not scale.

To handle this, we break the process into two main steps:

Generating coherent base world states : We first generate a coherent base world that represents a user’s digital environment. Each world defines a user persona along with contacts, relationships, and a timeline of activities. Emails and calendar events are then derived from this shared context and used to populate all applications. Because every application is generated from the same source, information referenced in one interface naturally exists in the others.
Creating task scenarios and generating task variants: Tasks are not written individually. Instead, we define reusable scenario templates such as meeting rescheduling, conflict resolution, participant coordination, and travel planning. A scenario augments the base world with additional data and creates a concrete situation the agent must resolve. For example, a travel scenario may introduce flight confirmation emails that require the agent to block the corresponding time in the calendar. From each scenario we automatically produce a natural language task and a programmatic verifier.

Diagrams (top to bottom): 1) The process we use to generate the base data for both the calendar and email worlds; 2) Each scenario generator (such as the travel confirmation scenario) takes the base world and several other configurations as inputs and generates task/verifier pair variants.

All generated tasks and verifiers are manually validated by completing them inside the simulations. We iterate on the generation process until tasks are solvable and verifiers consistently reflect true success or failure.

Benchmark SDK

The benchmark SDK provides the infrastructure required to run and evaluate agents consistently across models. It consists of three main components:

Simulation management: this handles the spawning of simulation instances, resetting them to a known state, retrieving the backend state, and shutting them down after execution.
Model adapters: this implements standardized tool interfaces for different computer use models so that all agents interact with the environment through the same action schema.
Experiment orchestration: this runs evaluations at scale, records execution traces, and logs results for later analysis.

Results

We evaluated PA Bench on major frontier computer use models, including Claude Opus 4.6, Gemini 3 Pro, Gemini 3 Flash, and OpenAI Computer Use. We selected these four models because they natively support computer-use actions, whereas other frontier models such as GPT-5.2 currently do not. All agents used a shared canonical action space, with screen resolution set to the provider recommended configuration for each model. We additionally exposed a standardized tab-switching action to support cross application workflows. We capped each episode at 75 steps.

The above image shows the pass rate for each model tested on PA Bench.

The above image shows the average reward for each model tested on PA Bench.

Model	Task Success Rate (Full Success)	Average Reward (Including Partial Completion)
Claude Opus 4.6	68.8%	0.73
Gemini-3-flash-preview	31.3%	0.41
Gemini-3-pro-preview	25.0%	0.48
OpenAI CUA	12.5%	0.25

Task Success Rate: A task is considered successful only if all verifier checks pass ( each task verifier can have more than one multiple checks that contributes to total success).

Average Reward: The mean reward across tasks, computed as the total reward obtained divided by the number of tasks.

Error Analysis

Claude Opus 4.6

Claude Opus consistently demonstrates recovery-driven behavior rather than single-trajectory execution. When an attempted action does not change the environment, the agent actively searches for an alternative interaction path instead of repeating the same command.

For example, our simulations do not support application-specific keyboard shortcuts like Ctrl+C for opening the compose window within the email application, since this is typically specified in the system prompt. The model still tried some of these application-specific keyboard shortcuts using a keypress action, but when a shortcut failed, the agent abandoned the approach and switched to a UI interaction such as double-clicking or selecting elements directly. In the same scenario, other models repeatedly attempted the same shortcut until the step limit was reached.

When keyboard shortcuts such as Ctrl+A are disabled, the agent recovers and uses double-click to select and edit the required text to modify the location.

Claude also performs explicit post action verification. In meeting cancellation workflows, after deleting the calendar event and sending the notification email, the agent navigates to the outbox to confirm the email was actually sent before terminating the task. This behavior appears in a majority of successful runs and correlates strongly with completion.

Claude Opus 4.6 performing a task that requires it to also send a cancellation email: after clicking Send, the agent goes to the Sent section to verify if the email was actually sent.

Most Claude failures occur earlier in the reasoning stage rather than the execution stage. In several failed scenarios, the agent could not correctly identify the relevant entities such as overlapping meetings. The agent then explored multiple valid interaction paths but operated on the wrong target until the maximum step limit was reached.

Gemini 3 Pro

Gemini 3 Pro generally understands the intent of the task and navigates across applications correctly, but frequently makes small execution errors that lead to failure.

A common pattern is modifying the correct entity but applying the wrong operation. For example, in a meeting modification scenario, the agent attempted to change the meeting location to “Conference Room C” but appended the new location instead of replacing the existing value. The agent then declared the task complete without checking whether the final state matched the requirement.

The above screenshot shows Gemini 3 Pro typing an incorrect location. Instead of deleting and replacing the text in the location field, it incorrectly tries to append text into the location field.

Most Gemini 3 Pro failures follow this pattern. The plan is correct and the agent reaches the appropriate interface, but one step in the execution is slightly incorrect.

In a travel booking workflow, the agent correctly extracted flight details and created a calendar event with the right departure time, but set an incorrect arrival time. Because PA Bench requires the final state to be fully correct, this counted as a failure despite the workflow being mostly completed.

Similarly, in a scheduling scenario, the agent identified a valid time slot and added the correct participants, but forgot to include a meeting link. The workflow was logically solved but not fully executed.

The example above shows one of the scenarios that requires the agent to find a new slot for an existing event. Gemini 3 Pro does everything else correctly but forgets to add the Google Meet link in the invite.

Unlike Claude, Gemini rarely performs post action verification. After completing the intended actions, it often terminates immediately instead of checking whether the resulting state matches the task requirements. As a result, many failures are near correct solutions that lack final validation rather than navigation or reasoning mistakes.

Overall, Gemini 3 Pro demonstrates strong planning ability but weak completion reliability. Errors typically arise from imperfect execution and lack of verification rather than misunderstanding the task.

Gemini 3 Flash

Gemini 3 Flash behaves differently from the other models due to its speed oriented behavior. It performs well on tasks that require limited reasoning and a small number of decisions, but struggles when the workflow requires understanding context across multiple events.

For example, the model can successfully handle a simple meeting modification request received over email and completes it faster than any other model evaluated in this benchmark. However, in more complex scenarios such as multi-meeting coordination, where the agent must interpret an email thread, identify the relevant events in the calendar, and schedule updates accordingly, the model often fails to construct the correct context.

In these cases, the agent executes a sequence of actions without converging on the correct target and eventually reaches the maximum step limit. The failures are therefore not due to navigation errors but due to insufficient reasoning over interconnected information.

Overall, Gemini 3 Flash is efficient for straightforward workflows but unreliable for tasks that require sustained reasoning across multiple related entities.

OpenAI Computer Use

The OpenAI computer-use-preview model primarily fails due to control and exploration issues rather than misunderstanding the task. The agent often identifies the correct objective but cannot reliably progress through the workflow.

A frequent failure pattern is inability to switch application context. Many tasks require moving between email and calendar, but the agent remains stuck in a single tab after attempting to use the tab-switching tool. This prevents it from accessing required information and causes early termination.

In runs where the agent successfully switches context and understands the task, it often gets stuck repeating actions that do not change the environment. Instead of attempting alternative interaction strategies, it retries the same step until the maximum step limit is reached.

Another recurring behavior is unnecessary permission seeking. Even when explicitly instructed not to ask for confirmation, the agent sometimes pauses execution to request approval before performing the required action. For example, in a meeting cancellation workflow the agent asks whether it should cancel the event and notify participants, and then waits without proceeding, leading to failure.

Overall, failures are dominated by lack of recovery and control flow management. The agent struggles to adapt when an action does not produce the expected result, which prevents it from completing otherwise understandable tasks.

Future Work

We plan to integrate workflows that involve 3+ web applications and 100+ steps to complete in meaningful personal assistant scenarios and workflows in the next iteration of PA Bench. For this, we are also doing research and experimentation on improving our task/verifier generation paradigms for automatic task/verifier synthesis for long-horizon tasks for web/computer use agents.

If you wish to partner with us on our research/benchmarks, please contact us at [email protected].

Citation

@misc{
  PABench2026, 
  title={PA Bench: Evaluating Web Agents on Real World Personal Assistant Workflows}, 
  author={Shahul Elavakkattil Shereef and Ankit Sridhar}, 
  year={2026}, 
  url={vibrantlabs.com/blog/pa-bench} 
}

@misc{
  PABench2026, 
  title={PA Bench: Evaluating Web Agents on Real World Personal Assistant Workflows}, 
  author={Shahul Elavakkattil Shereef and Ankit Sridhar}, 
  year={2026}, 
  url={vibrantlabs.com/blog/pa-bench} 
}

@misc{
  PABench2026, 
  title={PA Bench: Evaluating Web Agents on Real World Personal Assistant Workflows}, 
  author={Shahul Elavakkattil Shereef and Ankit Sridhar}, 
  year={2026}, 
  url={vibrantlabs.com/blog/pa-bench} 
}

@misc{
  PABench2026, 
  title={PA Bench: Evaluating Web Agents on Real World Personal Assistant Workflows}, 
  author={Shahul Elavakkattil Shereef and Ankit Sridhar}, 
  year={2026}, 
  url={vibrantlabs.com/blog/pa-bench} 
}