XBOW，一款自主渗透测试工具，已荣登HackerOne榜首。

XBOW，一款自主渗透测试工具，已荣登HackerOne榜首。
XBOW, an autonomous penetration tester, has reached the top spot on HackerOne

原始链接: https://xbow.com/blog/top-1-how-xbow-did-it/

XBOW，一款自主式AI渗透测试工具，在HackerOne美国排行榜上荣登榜首，创造了漏洞赏金史上的首次。它的历程始于利用CTF挑战赛和定制的真实世界模拟进行严格的基准测试。之后，它开始发现开源项目中的零日漏洞，模拟白盒渗透测试。为了在真实环境中测试XBOW，团队将其投入HackerOne的公开和私有漏洞赏金计划。XBOW自主识别了数千个跨越不同系统的漏洞。为了扩展规模，他们构建了基础设施，利用由大型语言模型（LLM）解析的程序范围和策略以及人工审核来优先处理高价值目标。他们还使用内容和视觉相似性检测来避免重复报告。 XBOW使用“验证器”——自动化的同行评审者——来最大限度地减少误报。这种方法在高知名度目标中发现了经验证的漏洞，最终使其排名第一。XBOW提交了近1060个漏洞，其中许多漏洞已被漏洞赏金计划解决和分类。发现的漏洞包括远程代码执行和SQL注入等严重漏洞。一个引人注目的发现是Palo Alto GlobalProtect VPN中一个以前未知的漏洞。

自主渗透测试工具XBOW登顶HackerOne排行榜，引发Hacker News热议。有人认为这是重大成就，也有人指出HackerOne的经济模式奖励数量，许多项目无法吸引顶级人才。该工具的成功凸显了自动化解决方案在解决专家常忽略的低危漏洞方面的需求。评论者讨论了AI生成的漏洞报告的质量和有效性，承认可能存在“冗余信息”，但也肯定了其快速识别XXE、RCE和SQL注入等关键漏洞的潜力。讨论还涉及AI工具如何改变漏洞赏金的格局，开源项目可能需要新的策略来过滤低质量的提交。最终，讨论强调AI有可能增强而非取代人类安全专家，从而使他们能够专注于更复杂的任务。

原文

For the first time in bug bounty history, an autonomous penetration tester has reached the top spot on the US leaderboard.

Our path to reaching the top ranks on HackerOne began with rigorous benchmarking. Since the early days of XBOW, we understood how crucial it was to measure our progress, and we did that in two stages:

First we tested XBOW with existing CTF challenges (from well-known providers like PortSwigger and Pentesterlab), then quickly moved on and built our own unique benchmark that simulates real-world scenarios—ones never used to train LLMs before. The results were encouraging, but still these were artificial exercises.
The logical next step, therefore, was to focus on discovering zero-day vulnerabilities in open source projects, which led to many exciting findings. Some of these were reported on this blog before: in every case, we gave the AI access to source code, simulating a white-box pentest. While our paying customers were enthusiastic about XBOW’s capabilities, the community raised a key question: How would XBOW perform in real, black-box production environments? We took up that challenge, choosing to compete in one of the largest hacker arenas, where companies serve as the ultimate judges by verifying and triaging vulnerabilities themselves.

Dogfooding AI in Bug Bounties

XBOW is a fully autonomous AI-driven penetration tester. It requires no human input, operates much like a human pentester, but can scale rapidly, completing comprehensive penetration tests in just a few hours.

When building AI software, having precise benchmarks to keep pushing the limit of what’s possible, is essential. But when some of those benchmarks evolve into real-world environments, it’s a developer’s dream come true.

Discovering bugs in structured benchmarks and open source projects was a fantastic starting point. However, nothing can truly prepare you for the immense diversity of real-world environments, which span from cutting-edge technologies to 30-year-old legacy systems. No number of design partners can offer that breadth of system variety as that level of unpredictability is nearly impossible to simulate.

To bridge that gap, we started dogfooding XBOW in public and private bug bounty programs hosted on HackerOne. We treated it like any external researcher would: no shortcuts, no internal knowledge—just XBOW, running on its own.

HackerOne offers this unique opportunity, and as XBOW discovered and reported vulnerabilities across multiple programs, we soon found ourselves climbing the H1 ranks.

Scaling Discovery and Scoping capabilities

Our first challenge was scaling. While XBOW can easily scan thousands of web apps simultaneously, HackerOne hosts hundreds of thousands of potential targets. As a startup with limited resources, even when we focused on specific vulnerability classes, we still needed to be strategic. That’s why we built infrastructure on top of XBOW to help us identify the high-value targets and prioritize those that would maximize our return on investment.

We started by consuming bug bounty program scopes and policies, but this information isn’t always machine-readable. With a combination of large language models and some manual curation, we managed to parse through them—with a few hiccups. (At one point, we were officially removed from a program that didn’t allow “automatic scanners.”)

With the domains ingested into our database, and a bit of “magic” to expand subdomains, we built a scoring system to highlight the most interesting targets. This scoring criteria covered a broad range of signals, including target appearance, presence of WAFs and other protections, HTTP status codes, redirect behavior, authentication forms, number of reachable endpoints, underlying technologies, and more.

Domain deduplication quickly became essential in large programs, it is common to encounter cloned or staging environments(e.g. stage0001-dev.example.com). Once a vulnerability is found in one, similar issues are likely to exist across others. To stay efficient, we used SimHash to detect content-level similarity and leveraged a headless browser to capture website screenshots and then applied imagehash techniques to assess visual similarity analysis, allowing us to group assets and focus our efforts on unique, high-impact targets.

Automated Vulnerability Discovery

AI can be remarkably effective at discovering a broad range of vulnerabilities—but the real challenge isn’t always detection, It’s precision. Automation has long struggled with false positives, and nowhere is this more evident than in vulnerability scanning. Tools that flag dozens of irrelevant issues often create more work than they save. When AI enters the equation, the stakes grow even higher: models can generalize well, but verifying technical edge cases is a different game entirely.

To ensure accuracy, we developed the concept of validators, automated peer reviewers that confirm each vulnerability XBOW uncovers. Sometimes this process leverages a large language model; in other cases, we build custom programmatic checks. For example, to validate Cross-Site Scripting findings, a headless browser visits the target site to verify that the JavaScript payload was truly executed. (don’t miss Brendan Dolan-Gavitt’s BlackHat presentation on AI agents for Offsec)

XBOW’s Real-World Impact

Running XBOW across a wide range of public and private programs yielded results that exceeded our expectations—not just in volume, but in consistency and quality.

Over time, XBOW reported thousands of validated vulnerabilities, many of them affecting high-profile targets from well-known companies. These findings weren’t just theoretical; every submission was confirmed by the program owners and triaged as real, actionable security issues.

The most public signal of progress came from the HackerOne leaderboard. Competing alongside thousands of human researchers, XBOW climbed to the top position in the US ranking. That wasn’t our original goal, and indeed was surprising since we didn’t have a buffer of untriaged reports from previous quarters—but it became a useful benchmark to track real-world performance and collect traces to reinforce our models.

XBOW submitted nearly 1,060 vulnerabilities. All findings were fully automated, though our security team reviewed them pre-submission to comply with HackerOne’s policy on automated tools. It was a unique privilege to wake up each morning and review creative new exploits.

To date, bug bounty programs have resolved 130 vulnerabilities, while 303 were classified as Triaged (mostly by VDP programs that acknowledged the issue but did not proceed to resolution). In addition, 33 reports are currently marked as new, and 125 remain pending review by program owners.

Across all submissions, 208 were marked as duplicates, 209 as informative and 36 as not applicable (most of them self-closed by our team). Interestingly, many of these informative vulnerabilities came from programs with specific constraints such as policies excluding third-party vulnerabilities or disallowing certain classes like Cache Poisoning.

XBOW identified a full spectrum of vulnerabilities including: Remote Code Execution, SQL Injection, XML External Entities (XXE), Path Traversal, Server-Side Request Forgery (SSRF), Cross-Site Scripting, Information DIsclosures, Cache Poisoning, Secret exposure, and more.

Over the past 90 days alone, the vulnerabilities submitted were classified as 54 critical, 242 high, 524 medium, and 65 low severity issues by program owners. Notably, around 45% of XBOW’s findings are still awaiting resolution, highlighting the volume and impact of the submissions across live targets.

XBOW’s path to the top involved uncovering a wide range of interesting and impactful vulnerabilities. Among them was a previously unknown vulnerability in Palo Alto’s GlobalProtect VPN solution, affecting over 2,000 hosts. Throughout this process, XBOW consistently demonstrated its ability to adapt to edge cases and develop creative strategies for complex exploitation scenarios entirely on its own.

In the spirit of transparency, and in accordance with the rules and regulations of POC || GTFO, our security team will be publishing a series of blog posts over the coming weeks, showcasing some of our favorite technical discoveries by XBOW.

XBOW is an enterprise solution. If your company would like a demo, email us at [email protected].