基准测试一个漏洞扫描器

原文

Benchmarking a Bug Scanner | Detail

We’re all coding with agents now, but delivering high quality software at 10x velocity remains an open problem. Code review bots are an important start, but a lot of bugs are still landing in production. Even top products are accumulating a layer of low-grade brokenness.^{We need new ways to make products secure and high quality.}

We built a new kind of bug scanner to solve this problem.

The hard part about building a bug scanner is that any meaningfully complicated codebase has many thousands of bugs, and the vast majority don’t matter. You want to reserve human attention (and your tokens) for the bugs that matter. So, one of the most important ways we benchmark ourselves is that we want the bugs we generate to be significantly more important than the typical finding from a code review bot.

How can we quantify importance?

“Important” is a codebase-relative term. A crash in OpenBSD is a headliner finding for a frontier model launch, but a crash in an experimental personal project might not be worth thinking about.

One way to evaluate ourselves is against review bots that operate against individual PRs in a repo. We find PR bots to be generally high value (we have 3 installed right now!) so they’re a useful baseline for signal-to-noise ratio.

To quantify our interestingness, we compared one week of review bot comments on OpenClaw and vLLM against bug reports from Detail. We chose OpenClaw and vLLM because the two are extremely successful products with wildly different development practices and two very different definitions of “importance”.

We also put Detail in “recent changes mode”, which limits Detail to the same week of changes that the review bots saw. This also means any bugs that Detail found were missed by code review and merged into the repo. I.e., code review bots get the first pass at the code.

We ran a tournament with Sonnet 4.6 as a judge between pairs of findings to decide which was more “important”. Then all the pairwise outcomes were fed into a Bradley-Terry model to produce a global ranking. You can think of this as similar to an ELO score for bug reports. The tournament code is here.

We gave Sonnet 4.6 this prompt:

You are comparing two bug findings from code review tools on the same codebase.

An engineer only has time to investigate one. Pick the one that is more important for them to see.

Finding A: {finding_a}

Finding B: {finding_b}

Use the select_winner tool to pick a or b.

After our first run, we realized that the depth of evidence in the Detail findings was strongly biasing the model in favor of Detail.^{So we took both the review bot comments and the Detail findings and put them through a summarization prompt:}

Summarize this code review finding in one sentence.

{finding}

Once we had the importance ranks, we converted them into percentiles and plotted the distribution of findings. The higher the percentile, the more important the finding.

`),a&&(i+=`

${t(a)}

`),i+="

");const l=(s.comparisons||[]).slice().sort((e,t)=>e.opponent_rank-t.opponent_rank);l.length===0?i+='

No comparisons recorded

':l.forEach(e=>{const n=e.won?"bt-ridgep__comp-result--won":"bt-ridgep__comp-result--lost",s=e.won?"Won":"Lost",a=e.opponent_tool==="detail"?"Detail":o,c=e.justification?`

${r(e.justification)}

`:"";i+=`

`+`

${s}

`+`

vs `+`#${e.opponent_rank}`+` (${t(a)}): ${r(e.opponent_summary)}`+c+`

`+`

`}),e.innerHTML=i;const c=e.querySelector('[data-md="1"]');c&&f().then(e=>{c.innerHTML=e.parse(a,{gfm:!0,breaks:!0}),y().then(()=>p(c)).catch(()=>{})}).catch(()=>{})}const a=new Map;function C(e,t){const n=e+"/"+t;if(a.has(n))return a.get(n);const o=v+e+"_"+t+".json?"+h,s=fetch(o).then(e=>{if(!e.ok)throw new Error("HTTP "+e.status);return e.json()});return a.set(n,s),s}function E(e,t){const o=document.createElement("div");o.className="bt-ridgep__findings";const n=document.createElement("div");n.className="bt-ridgep__findings-toggle",n.setAttribute("role","button"),n.setAttribute("aria-expanded","false"),n.innerHTML=`▶`+`All ${t.total.toLocaleString()} findings by rank`,o.appendChild(n);const s=document.createElement("div");s.className="bt-ridgep__findings-list",o.appendChild(s);let i=!1;n.addEventListener("click",()=>{const e=s.classList.toggle("bt-ridgep__findings-list--open");n.setAttribute("aria-expanded",e?"true":"false"),e&&!i&&(i=!0,s.innerHTML='

Loading findings…

',f().catch(()=>{}).finally(()=>{s.innerHTML="",k(s,t)}))}),e.appendChild(o)}function k(e,n){const s=document.createDocumentFragment();n.findings.forEach(o=>{const l=o.tool==="detail",d=l?"Detail":n.bot_display,u=o.wins+o.losses>0?`${o.wins}W ${o.losses}L`:"",c=document.createElement("div");c.className="bt-ridgep__row",c.innerHTML=`

#${o.rank}

`+``+`

`+`

${r(o.summary)}

`+`

${t(d)} · `+`BT ${o.bt_score} · ${o.percentile}th pctl

`+`

${u}

`;const i=document.createElement("div");i.className="bt-ridgep__panel";let a="idle";c.addEventListener("click",()=>{const s=i.classList.contains("bt-ridgep__panel--open");if(e.querySelectorAll(".bt-ridgep__panel--open").forEach(e=>e.classList.remove("bt-ridgep__panel--open")),s)return;if(i.classList.add("bt-ridgep__panel--open"),a==="loaded"||a==="loading")return;a="loading",i.innerHTML='

Loading finding…

',C(n.slug,o.rank).then(e=>{O(i,o,e,n.bot_display),a="loaded"}).catch(e=>{i.innerHTML='

Could not load: '+t(e.message)+"

",a="idle"})}),s.appendChild(c),s.appendChild(i)}),e.appendChild(s)}function A(t,n,s){const o=t.append("pattern").attr("id",n).attr("patternUnits","userSpaceOnUse").attr("width",8).attr("height",8);o.append("rect").attr("width",8).attr("height",8).attr("fill",u),s==="diagonal"?o.append("path").attr("d","M-2,2 l4,-4 M0,8 l8,-8 M6,10 l4,-4").attr("stroke",e).attr("stroke-width",1.2):s==="crosshatch"?(o.append("path").attr("d","M-2,2 l4,-4 M0,8 l8,-8 M6,10 l4,-4").attr("stroke",e).attr("stroke-width",1),o.append("path").attr("d","M-2,6 l4,4 M0,0 l8,8 M6,-2 l4,4").attr("stroke",e).attr("stroke-width",1)):s==="dots"?(o.append("circle").attr("cx",2).attr("cy",2).attr("r",1.2).attr("fill",e),o.append("circle").attr("cx",6).attr("cy",6).attr("r",1.2).attr("fill",e)):o.append("path").attr("d","M0,4 h8 M4,0 v8").attr("stroke",e).attr("stroke-width",1)}function S(t){let n;t==="diagonal"?n=``:t==="crosshatch"?n=``+``:t==="dots"?n=``+``:n=``;const s=``;return`url('data:image/svg+xml;charset=utf-8,${encodeURIComponent(s)}')`}function M(e,t,n,s){const o=s&&s[0],i=s&&s[1];return t.map(t=>{const s=d3.mean(n,n=>{let s=e(t-n);return o!=null&&(s+=e(t-(2*o-n))),i!=null&&(s+=e(t-(2*i-n))),s});return[t,s]})}function F(e){const t=1/((2*Math.PI)**.5*e);return n=>t*Math.exp(-.5*(n/e)*(n/e))}function T(t,s){const v=document.createElement("div");v.className="bt-ridgep__chart",t.appendChild(v);const a=window.innerWidthe.tool==="detail").map(e=>e.percentile),K=L.filter(e=>e.tool!=="detail").map(e=>e.percentile),C=d3.scaleLinear().domain([0,100]).range([0,u]),E=6;function V(e){const t=e.length;if(t(e-n)*(e-n))**.5,o=e.slice().sort(d3.ascending),i=d3.quantileSorted(o,.75)-d3.quantileSorted(o,.25),a=Math.min(s,i/1.34)||s||5;return Math.max(E,.9*a*t**(-1/5))}const S=[{key:"bot",label:s.bot_display,vals:K,fill:`url(#${z})`,stroke:e},{key:"detail",label:"Detail",vals:q,fill:d,stroke:"#0a0"}],T=14,y=(f-T)/S.length,B=-.08,N=24,I=d3.range(N+1).map(e=>e*100/N);S.forEach((e,t)=>{const f=V(e.vals),p=F(f),d=M(p,I,e.vals,[0,100]).map(t=>[t[0],t[1]*e.vals.length]),l=d3.max(d,e=>e[1])||1,g=t*(y+T),s=g+y,h=s-y*(1+B),v=d3.scaleLinear().domain([0,l]).range([s,h]),b=d3.area().curve(d3.curveLinear).x(e=>C(e[0])).y0(s).y1(e=>v(e[1]));o.append("line").attr("x1",0).attr("x2",u).attr("y1",s).attr("y2",s).attr("stroke",c);const m=d3.scaleLinear().domain([0,l]).range([s,h]),j=l>5?4:3,_=m.ticks(j).filter(e=>e>0);o.append("g").call(d3.axisLeft(m).tickValues(_).tickFormat(e=>l>5?Math.round(e):e.toFixed(1)).tickSize(0).tickPadding(a?12:6)).call(e=>e.select(".domain").attr("stroke",c)).call(e=>e.selectAll(".tick text").attr("fill",i).attr("font-family",n).attr("font-size",r.tick+"px")),o.append("path").attr("d",b(d)).attr("fill",e.fill).attr("stroke",e.stroke).attr("stroke-width",1).attr("opacity",e.key==="detail"?.95:.9)});const H=a?-68:-44;o.append("text").attr("transform",`translate(${H},${f/2}) rotate(-90)`).attr("text-anchor","middle").attr("font-family",n).attr("font-size",r.axisLabel+"px").attr("fill",i).text("Findings per percentile"),o.append("g").attr("transform",`translate(0,${f})`).call(d3.axisBottom(C).tickValues(a?[0,50,100]:[0,25,50,75,100]).tickFormat(e=>e===0?"0th percentile":e===100?"100th percentile":e+"th").tickSize(0).tickPadding(8)).call(e=>e.select(".domain").attr("stroke",c)).call(e=>e.selectAll(".tick text").attr("fill",b).attr("font-family",n).attr("font-size",r.tick+"px").attr("text-anchor",e=>e===0?"start":e===100?"end":"middle"));const j=f+(a?60:44),_=14,w=10,g="bt-ridgep-xarrow-"+s.slug,p="#aaa";R.append("marker").attr("id",g).attr("viewBox","0 0 10 10").attr("refX",9).attr("refY",5).attr("markerWidth",5).attr("markerHeight",5).attr("orient","auto-start-reverse").append("path").attr("d","M0,0 L10,5 L0,10 Z").attr("fill",p);const $=o.append("text").attr("x",u/2-_).attr("y",j).attr("text-anchor","end").attr("font-family",n).attr("font-size",r.axisLabel+"px").attr("fill",i).text("Less Important"),W=o.append("text").attr("x",u/2+_).attr("y",j).attr("text-anchor","start").attr("font-family",n).attr("font-size",r.axisLabel+"px").attr("fill",i).text("More Important"),U=$.node().getBBox(),x=W.node().getBBox(),h=j-r.axisLabel*.35;o.append("line").attr("x1",U.x-w).attr("x2",0).attr("y1",h).attr("y2",h).attr("stroke",p).attr("stroke-width",1).attr("marker-end",`url(#${g})`),o.append("line").attr("x1",x.x+x.width+w).attr("x2",u).attr("y1",h).attr("y2",h).attr("stroke",p).attr("stroke-width",1).attr("marker-end",`url(#${g})`)}function z(n){const s=document.getElementById("bt-ridgep-grid"),o=document.getElementById("bt-ridgep-status");if(!s)return;const i=["vllm","openclaw"],a=n.slice().sort((e,t)=>{const n=i.indexOf(e.slug),s=i.indexOf(t.slug);return(n===-1?99:n)-(s===-1?99:s)});a.forEach(n=>{const o=document.createElement("div");o.className="bt-ridgep__col";const i=S(m[n.slug]||"diagonal"),a=l[n.slug]?`${t(l[n.slug])}`:"";o.innerHTML=``+`

`+``+`Detail (${n.detail_count})`+``+`${t(n.bot_display)} (${n.bot_count.toLocaleString()})`+`

`,s.appendChild(o),T(o,n),E(o,n)}),o&&o.remove()}function D(){return new Promise((e,t)=>{if(window.d3)return e();const s=Date.now(),n=setInterval(()=>{if(window.d3){clearInterval(n),e();return}Date.now()-s>8e3&&(clearInterval(n),t(new Error("d3 failed to load")))},20)})}Promise.all([fetch(x).then(e=>{if(!e.ok)throw new Error("HTTP "+e.status);return e.json()}),D()]).then(([e])=>z(e)).catch(e=>{const t=document.getElementById("bt-ridgep-status");t&&(t.classList.add("bt-ridgep__status--error"),t.textContent="Could not load rank data ("+e.message+")")})})()

We find that Detail heavily skews towards the right, which means that the bugs are very high signal-to-noise compared to the baseline of a CR bot. This is in spite of review bots running on unfinished PRs and Detail running on merged code, which should be harder to find bugs in.

To put a number on the asymmetry, here’s how far each tool’s average bug sits from the average bug across the whole dataset. The numbers are standard errors relative to the combined dataset. Detail at +5σ reflects that a random Detail bug ranks more important than a random code review bot finding ~91% of the time.

Loading rank data&mldr;

Detail bugs average 88th percentile importance vs CR bots.

Note that this eval only measures importance of the bugs, which is separate from correctness. If bugs from Detail seem important but are wrong, we won’t have a useful product. We separately measure the acceptance rate for our findings. In the last two months, 82.9% of Detail findings were marked as correct by humans or agents.

Show me an example!

Here is an example bug Detail found in PostHog/posthog:

Context: The get_sandbox_for_repository Temporal activity provisions sandbox environments for task runs, including fetching and applying user-defined environment variables from SandboxEnvironment records.
Bug: The "agent hardening" refactor (commit 48e7429e29) replaced ctx.get_sandbox_environment() with a direct database query that bypasses the private environment access control check.
Actual vs. expected: Any team member can access another user's private sandbox environment secrets; private environments should only be accessible to their creator.
Impact: A malicious team member can exfiltrate secrets stored in private sandbox environments (API keys, tokens) by creating a task run that references the victim's private environment ID.

# get_sandbox_for_repository.py lines 112-125
sandbox_environment = None
if ctx.sandbox_environment_id:
    sandbox_environment = SandboxEnvironment.objects.filter(
        id=ctx.sandbox_environment_id, team=task.team
    ).first()  # <-- BUG 🔴 No privacy check; private envs become accessible to any team member
    if sandbox_environment and sandbox_environment.environment_variables:
        skipped_keys: list[str] = []
        added_keys = 0
        for key, value in sandbox_environment.environment_variables.items():
            if key in RESERVED_SANDBOX_ENVIRONMENT_VARIABLE_KEYS:
                skipped_keys.append(key)
                continue
            environment_variables[key] = value  # <-- BUG 🔴 Injects secrets from another user's private env
            added_keys += 1

Private sandbox environment access control is implemented in TaskRun.get_sandbox_environment() (via ctx.get_sandbox_environment()), which returns None when env.private and the environment creator differs from task.created_by.
The refactor replaced this helper with a direct SandboxEnvironment query filtered only by id and team, dropping the private/created_by check.
Because get_sandbox_for_repository reads and injects sandbox_environment.environment_variables, this becomes an authorization bypass that exposes stored secrets to other team members.
The API layer also accepts sandbox_environment_id without enforcing privacy, so an attacker can supply a victim’s private environment ID and have its secrets injected into the attacker’s sandbox.

Codebase Inconsistency

Another activity still uses the privacy-aware helper:
```
# create_sandbox_from_snapshot.py
sandbox_env = ctx.get_sandbox_environment()  # <-- Privacy check applied
```
while get_sandbox_for_repository.py performs a direct query, indicating a regression/oversight rather than intended behavior.

Existing model test codifies intended privacy behavior:

self.assertIsNone(run.get_sandbox_environment())  # <-- Privacy enforced

Add this test to products/tasks/backend/tests/test_api.py:

def test_run_endpoint_rejects_private_environment_from_other_user(self):
    """
    Private sandbox environments should NOT be usable by other team members.
    This test proves the vulnerability exists at the API layer.
    """
    # Create User A with a private sandbox environment
    other_user = User.objects.create_user(email="[email protected]", first_name="Victim", password="password")
    self.organization.members.add(other_user)
    private_env = SandboxEnvironment.objects.create(
        team=self.team,
        name="Victim's Private",
        private=True,
        created_by=other_user,
        environment_variables={"SECRET_KEY": "secret_value"}
    )
    task = self.create_task()
    # Current user (User B, same team) tries to run with User A's private environment
    response = self.client.post(
        f"/api/projects/@current/tasks/{task.id}/run/",
        {"sandbox_environment_id": str(private_env.id)},
        format="json",
    )
    # SHOULD return 403 Forbidden - user cannot use another's private environment
    # ACTUALLY returns 200 OK - the vulnerability allows this
    self.assertEqual(response.status_code, status.HTTP_403_FORBIDDEN)

Expected result: 403 returned (privacy enforced) Actual result: 200 OK (privacy bypass)

User A creates a private sandbox environment with secrets in environment_variables.
User B (same team) discovers sandbox_environment_id from task run state.
User B starts a run using the victim’s private sandbox_environment_id.
The activity injects User A’s secret env vars into User B’s sandbox, allowing exfiltration from within the run.

Restore the privacy-aware lookup in the activity:

sandbox_environment = ctx.get_sandbox_environment()

and only read/apply environment_variables from that result.

This bug was introduced in commit 48e7429e294. The "agent hardening" commit refactored environment variable handling by replacing ctx.get_sandbox_environment() (which delegated to the privacy-aware TaskRun.get_sandbox_environment() method) with a direct database query that only filters by team. The original code path included a check that denied access to private environments unless task.created_by_id == env.created_by_id; the new inline query silently dropped this check, allowing any team member to read secrets from private sandbox environments belonging to other users.

PostHog Tasks lets users run an agent inside of a sandbox. Users can attach env bundles known as “Sandbox Environments” to these tasks which have an optional “private” flag that gates access to the creator. That “private” flag is not respected, and if a teammate knows the UUID of a private Sandbox Environment they can start a task that references it.

This is probably a bug an engineer at PostHog would want to be alerted to. Detail provides a convincing failing test and the codebase context to bootstrap an agent for a fix (including a damning docstring - “Private sandboxes environments should NOT be usable by other team members”).

Despite multiple review bots (Greptile, Codex, Hex Security, Copilot) the bug was merged. The issue was waiting to be exploited by a sinister user.

Claude Code baseline

Daily driver coding agents (e.g. Claude Code, Codex, Amp, Cursor, Devin) are pretty fantastic. Can you get similar performance by just asking them to find the most important bugs in a codebase?

Attempt #1

We gave Claude Code this prompt:

Find the 3 most important bugs in this repo. Use subagents.

With this prompt, Claude Code just searched for existing issues (e.g. grep for FIXME/BUG/HACK, query GitHub Issues). The bugs looked severe, but all of them were already reported.

OpenClaw

Claude Code Bug 1 of 3

Blockquote rendering emits triple newlines

markdownToIR produces \n\n\n between a blockquote and following paragraph instead of \n\n; blockquote_close adds a spurious \n on top of paragraph_close’s \n\n. Committed as a failing test.

Claude Code Bug 2 of 3

Three skipped memory hybrid-search tests

it.skip’d tests cover indexing, keyword-match hybrid search, and minScore filtering. The skips confirm the code paths lack test coverage, but it.skip alone doesn’t prove the feature is broken — they could be flaky, need infrastructure, or be for unfinished work.

Claude Code Bug 3 of 3

Cron timer infinite hot-loop from stuck running marker

A comment at L646-653 describes how a stuck runningAtMs with a past-due nextRunAtMs would create a delay === 0 loop. But L654 is the fix: const flooredDelay = delay === 0 ? MIN_REFIRE_GAP_MS : delay. Claude reported the bug from the comment but missed the guard one line below it.

vLLM

Claude Code Bug 1 of 3

Unbounded memory leak in OpenAI Responses API store

Three in-memory dicts (response_store, msg_store, event_store) grow on every request but never evict entries. The developers themselves annotated all three with FIXME: If enable_store=True, this may cause a memory leak since we never remove [entries] from the store.

Claude Code Bug 2 of 3

Scheduler passes prompt_token_ids by reference

NewRequestData.from_request() assigns request.prompt_token_ids without copying, and gpu_model_runner.py:1158 passes it through to CachedRequestState also without copying. A streaming-input test describes a corruption scenario; the standard (non-streaming) path appears immutable but wasn’t verified.

Claude Code Bug 3 of 3

Scheduler deadlock on large-context requests (#39734)

A GitHub issue reports that requests exceeding KV cache capacity but within max_model_len cause an infinite scheduler loop while health checks still pass. The issue number is real (fetched via gh api), but no corroborating code, comments, or tests were found in the local repo.

Attempt #2

We tried again and asked Claude Code not to report existing issues:

Find the 3 most important bugs in this repo. Use subagents. Analyze the code, don’t just search for existing bugs.

After this, Claude Code started producing new issues, but didn’t seem to find anything we deemed noteworthy (Claude Code even acknowledged “[it] cannot confidently produce 3 most important bugs from this pass” in the vLLM experiment).

OpenClaw

Claude Code Bug 1 of 3

Break-glass WebSocket flag accepts cleartext ws:// to any public DNS host

When allowPrivateWs is set, the fallback net.isIP(hostForIpCheck) === 0 returns true for every DNS hostname (since net.isIP() returns 0 for non-IP strings), so ws://attacker.com passes the “private network” gate. The fallback is fail-open; should be return false. Blast radius limited to users who explicitly enable allowPrivateWs.

Claude Code Bug 2 of 3

Streaming draft loop clobbers newest text on retry

L35 clears pendingText = "" before awaiting sendOrEditStreamMessage(text). During that await, a concurrent update(newer) writes pendingText = newer at L69. If the send returns false (retry), L44 does pendingText = text, overwriting the newer content with the older buffer — streamed agent responses lose latest tokens under rate-limit/backoff.

Claude Code Bug 3 of 3

getImageMetadata swallows pixel-limit rejection as “unknown”

Both branches wrap validateImagePixelLimit() in try/catch and return null when an oversized image makes the validator throw; callers interpret null as “dimensions unknown” rather than “over limit.” Other paths (normalizeExifOrientation, resizeImage, sharp’s own limitInputPixels) still guard against oversized images, so this is a semantic error rather than a direct bypass.

vLLM

Claude Code Bug 1 of 2

chat_completion_stream_generator doesn’t handle asyncio.CancelledError

The streaming generator catches GenerationError and Exception, but not asyncio.CancelledError (which inherits from BaseException). On client disconnect, the yield "data: [DONE]\n\n" at L1181 never executes. The non-streaming sibling at L1202 does catch CancelledError explicitly — inconsistency between the two paths. Severity is low since the client has already disconnected.

Claude Code Bug 2 of 2

num_computed_tokens = 0 on preempt may waste recomputation

Preempting a request resets num_computed_tokens to 0 and frees KV cache blocks, forcing full recomputation on resume even when prefix caching could recover some blocks. get_computed_blocks() does re-hit the prefix cache on rescheduling, so it’s not data corruption — just a potential performance waste. Flagged as “worth examining” rather than definitive.

The Claude Code sessions are here.

Can I try it?

Yes of course. Head to app.detail.dev to get a free scan and see what we find in your codebase.