We’re all coding with agents now, but delivering high quality software at 10x velocity remains an open problem. Code review bots are an important start, but a lot of bugs are still landing in production. Even top products are accumulating a layer of low-grade brokenness. We need new ways to make products secure and high quality.

We built a new kind of bug scanner to solve this problem.

The hard part about building a bug scanner is that any meaningfully complicated codebase has many thousands of bugs, and the vast majority don’t matter. You want to reserve human attention (and your tokens) for the bugs that matter. So, one of the most important ways we benchmark ourselves is that we want the bugs we generate to be significantly more important than the typical finding from a code review bot.

How can we quantify importance?

“Important” is a codebase-relative term. A crash in OpenBSD is a headliner finding for a frontier model launch, but a crash in an experimental personal project might not be worth thinking about.

One way to evaluate ourselves is against review bots that operate against individual PRs in a repo. We find PR bots to be generally high value (we have 3 installed right now!) so they’re a useful baseline for signal-to-noise ratio.

To quantify our interestingness, we compared one week of review bot comments on OpenClaw and vLLM against bug reports from Detail. We chose OpenClaw and vLLM because the two are extremely successful products with wildly different development practices and two very different definitions of “importance”.

We also put Detail in “recent changes mode”, which limits Detail to the same week of changes that the review bots saw. This also means any bugs that Detail found were missed by code review and merged into the repo. I.e., code review bots get the first pass at the code.

We ran a tournament with Sonnet 4.6 as a judge between pairs of findings to decide which was more “important”. Then all the pairwise outcomes were fed into a Bradley-Terry model to produce a global ranking. You can think of this as similar to an ELO score for bug reports. The tournament code is here.

We gave Sonnet 4.6 this prompt:

You are comparing two bug findings from code review tools on the same codebase.

An engineer only has time to investigate one. Pick the one that is more important for them to see.

Finding A: {finding_a}

Finding B: {finding_b}

Use the select_winner tool to pick a or b.

After our first run, we realized that the depth of evidence in the Detail findings was strongly biasing the model in favor of Detail. So we took both the review bot comments and the Detail findings and put them through a summarization prompt:

Summarize this code review finding in one sentence.

{finding}

Once we had the importance ranks, we converted them into percentiles and plotted the distribution of findings. The higher the percentile, the more important the finding.