``` 特工阅读测试 ```
Agent Reading Test

原始链接: https://agentreadingtest.com

## 代理阅读测试:评估AI代码代理的网络理解能力 代理阅读测试是一项基准测试,旨在评估AI编码代理(如Claude Code、Copilot等)处理和理解网络内容的能力——这是利用在线文档的关键技能。它识别出“隐性失败模式”,即代理在处理截断内容、通过CSS隐藏的文本、客户端渲染以及类似选项卡等复杂页面结构时遇到的困难。 该测试向代理呈现10项真实的文档任务,并在页面中嵌入“金丝雀标记”。代理最初并不知道这些标记的存在;它们只有在完成任务后才会报告这些标记。这可以防止作弊行为,并侧重于实际的阅读能力。 测试涵盖的问题包括页面截断、CSS干扰、JavaScript依赖内容、选项卡信息以及处理重定向/错误。评分基于找到的金丝雀标记数量(每个1分)和定性问题的正确答案(每个1分),总分最高为20分。目前的代理通常得分在14-18之间。 该基准测试与“代理友好型文档规范”相辅相成,将重点从评估*针对*代理的文档转移到评估代理本身。

相关文章

原文
Agent Reading Test

A benchmark that tests how well AI coding agents can read web content. Point your agent at the test, get a score, compare across platforms.

What This Tests

AI coding agents (Claude Code, Cursor, GitHub Copilot, and others) read documentation websites as part of their workflows. But most agents hit silent failure modes: content gets truncated, CSS buries the real text, client-side rendering delivers empty shells, and tabbed content serializes into walls of text where only the first variant is visible.

This benchmark surfaces those failure modes. Each test page is designed around a specific problem documented in the Agent-Friendly Documentation Spec. The pages embed canary tokens at strategic positions. But instead of asking agents to hunt for tokens (which games relevance filters), the test gives the agent realistic documentation tasks. Only after the agent completes all tasks does it learn about the canary tokens and report which ones it encountered. You paste the results into a scoring form.

How It Works

  1. Point your agent at the start page. Give your agent the URL agentreadingtest.com/start/ and tell it to follow the instructions.
    Go to https://agentreadingtest.com/start/ and follow the instructions
  2. The agent completes 10 documentation tasks. Each task requires reading a page that targets a specific failure mode. The agent doesn't know about canary tokens yet.
  3. The agent visits the results page. Only after completing all tasks does the agent learn about canary tokens and report which ones it saw.
  4. Paste the results into the scoring form. The agent gives you a comma-separated list of canary tokens. Paste it into the scoring form for a detailed breakdown of what your agent's pipeline delivered and where it lost content.
Score Your Results

The Tests

1. Truncation

150K-char page with canary tokens at 10K, 40K, 75K, 100K, and 130K. Maps exactly where your agent's truncation limit kicks in.

page-size-html, page-size-markdown

2. Boilerplate Burial

80K of inline CSS before the real content. Tests whether agents distinguish CSS noise from documentation.

content-start-position

3. SPA Shell

Client-side rendered page. Content only appears after JavaScript executes. Most agents see an empty shell.

rendering-strategy

4. Tabbed Content

8 language variants in tabs. Canary tokens in tabs 1, 4, and 8. Tests how far into serialized tab content the agent reads.

tabbed-content-serialization

5. Soft 404

Returns HTTP 200 with a "page not found" message. Tests whether the agent recognizes it as an error page.

http-status-codes

6. Broken Code Fence

Markdown with an unclosed code fence. Everything after it becomes "code." Tests markdown parsing awareness.

markdown-code-fence-validity

7. Content Negotiation

Different canary tokens in HTML vs. markdown versions. Tests whether your agent requests the better format.

content-negotiation

8. Cross-Host Redirect

301 redirect to a different hostname. Most agents won't follow it (security measure). The canary is on the other side.

redirect-behavior

9. Header Quality

Three cloud platforms, identical "Step 1/2/3" headers. Tests whether agents can determine which section is which.

section-header-quality

10. Content Start

Real content buried after 50% navigation chrome. Tests whether agents read past the sidebar serialization.

content-start-position

Scoring

The test has a maximum score of 20 points. Each canary token found earns 1 point, and correct answers to qualitative questions earn 1 point each. The answer key has the full breakdown.

A perfect score is unlikely for any current agent. The tests are calibrated so that each failure mode will realistically affect at least some agents. A typical score range for current agents is probably 14-18 out of 20, depending on the platform's web fetch pipeline.

About

Agent Reading Test is a companion project to the Agent-Friendly Documentation Spec, which defines 22 checks across 8 categories evaluating how well documentation sites serve AI agent consumers. The spec is grounded in empirical observation of real agent workflows.

This benchmark flips the perspective: instead of testing the documentation site, it tests the agent. The same failure modes apply, but here we're measuring which agents handle them gracefully and which don't.

Source code: github.com/agent-ecosystem/agent-reading-test

联系我们 contact @ memedata.com