Learnings from building AI agents

原始链接: https://www.cubic.dev/blog/learnings-from-building-ai-agents

Cubic's AI code review agent initially suffered from excessive noise, generating too many low-value comments and false positives that frustrated developers. After significant architectural revisions, they achieved a 51% reduction in false positives. The key learnings were: 1. **Explicit Reasoning Logs:** Forcing the AI to justify its findings before providing feedback improved accuracy and debuggability. 2. **Simplified Toolset:** Streamlining the tools the agent used reduced confusion and improved focus. 3. **Specialized Micro-Agents:** Replacing a single, all-encompassing agent with specialized micro-agents (e.g., Security Agent, Duplication Agent) focused on narrow tasks significantly enhanced precision. These changes led to a 51% decrease in false positives, halved the median comments per pull request, and improved developer trust and engagement, resulting in smoother and more impactful code reviews. The core takeaway is that clarity, simplicity, and specialization are crucial for designing effective AI systems, particularly in complex domains like code review.

This Hacker News thread discusses a blog post about building AI agents for code review, focusing on learnings from cubic.dev. Commenters debate the effectiveness of "micro-agents" and the tendency of LLMs to always produce a result, even if irrelevant. Some suggest using an "opt-out" mechanism for optional fields and splitting prompts into smaller chunks to reduce bias. A key point of contention is the use of confidence scores generated by LLMs, with many arguing these are often arbitrary and unreliable. Several users critique the example used in the blog post (commenting out CI tests), suggesting it's a poor illustration. The conversation highlights the challenges of using LLMs for tasks requiring accuracy and contextual understanding, like code reviews. There's also a discussion on whether specialized micro-agents are superior to large, rule-based prompts and if LLMs can effectively replace software developers.

原文

I’m Paul, cofounder of cubic—an "AI-native GitHub." One of our core features is an AI code review agent that performs an initial review pass, catching bugs, anti-patterns, duplicated code, and similar issues in pull requests.

When we first released this agent back in April, the main feedback we got was straightforward: it was too noisy.

Even small PRs often ended up flooded with multiple low-value comments, nitpicks, or outright false positives. Rather than helping reviewers, it cluttered discussions and obscured genuinely valuable feedback.

An example nitpick

We decided to take a step back and thoroughly investigate why this was happening.

After three major architecture revisions and extensive offline testing, we managed to reduce false positives by 51% without sacrificing recall.

Many of these lessons turned out to be broadly useful—not just for code review agents but for designing effective AI systems in general.

1. The Face‑Palm Phase: A Single, Do‑Everything Agent

Our initial architecture was straightforward but problematic:

[diff]
↓
[single large prompt with contextual codebase info]
↓
[list of comments]

It looked clean in theory but quickly fell apart in practice:

Excessive false positives: The agent often mistook style issues for critical bugs, flagged resolved issues, and repeated suggestions our linters had already addressed.
Users lost trust: Developers quickly learned to ignore the comments altogether. When half the comments feel irrelevant, the truly important ones get missed.
Opaque reasoning: Understanding why the agent made specific calls was practically impossible. Even explicit prompts like "ignore minor style issues" had minimal effect.

We tried standard solutions—longer prompts, adjusting the model's temperature, experimenting with sampling—but saw little meaningful improvement.

2. What Finally Worked

After extensive trial-and-error, we developed an architecture that significantly improved results and proved effective in real-world repositories. These solutions underpin the 51% reduction in false positives currently running in production.

2.1 Explicit Reasoning Logs

We required the AI to explicitly state its reasoning before providing any feedback:

{
  "reasoning": "`cfg` can be nil on line 42; dereferenced without check on line 47",
  "finding": "Possible nil‑pointer dereference",
  "confidence": 0.81
}

This approach provided critical benefits:

Enabled us to clearly trace the AI’s decision-making process. If reasoning was flawed, we could quickly identify and exclude the pattern in future iterations.
Encouraged structured thinking by forcing the AI to justify its findings first, significantly reducing arbitrary conclusions.
Created a foundation to diagnose and resolve root causes behind other issues we faced.

2.2 Fewer, Smarter Tools

Initially, the agent had extensive tooling—Language Server Protocol (LSP), static analysis, test runners, and more. However, explicit reasoning logs revealed most analyses relied on a few core tools, with extra complexity causing confusion and mistakes.

We streamlined the toolkit to essential components only—a simplified LSP and a basic terminal.

With fewer distractions, the agent spent more energy confirming genuine issues, significantly improving precision.

2.3 Specialized Micro-Agents Over Generalized Rules

Initially, our instinct was to continuously add more rules into a single large prompt to handle edge cases:

“Ignore unused variables in .test.ts files.”
“Skip import checks in Python’s init.py.”
“Don't lint markdown files.”

This rapidly became unsustainable and was largely ineffective as the AI frequently overlooked many rules.

Our breakthrough came from employing specialized micro-agents, each handling a narrowly-defined scope:

Planner: Quickly assesses changes and identifies necessary checks.
Security Agent: Detects vulnerabilities such as injection or insecure authentication.
Duplication Agent: Flags repeated or copied code.
Editorial Agent: Handles typos and documentation consistency.
etc…

Specializing allowed each agent to maintain a focused context, keeping token usage efficient and precision high. The main trade-off was increased token consumption due to overlapping context, managed through effective caching strategies.

3. Real-world Outcomes

These architecture and prompt improvements led to meaningful results across hundreds of real pull requests from active open-source and private repositories. Specifically, over the past six weeks:

51% fewer false positives, directly increasing developer trust and usability.
Median comments per pull request cut by half, helping teams concentrate on genuinely important issues.
Teams reported notably smoother review processes, spending less time managing irrelevant comments and more time effectively merging changes.

Additionally, the reduced noise significantly improved developer confidence and engagement, making reviews faster and more impactful.

4. Key Lessons

Explicit reasoning improves clarity. Require your AI to clearly explain its rationale first—this boosts accuracy and simplifies debugging.
Simplify the toolset. Regularly evaluate your agent's toolkit and remove tools rarely used (less than 10% of tasks).
Specialize with micro-agents. Keep each AI agent tightly focused on a single task, reducing cognitive overload and enhancing precision.