TREX：一款运行您的代码的 AI 代码审查工具

TREX：一款运行您的代码的 AI 代码审查工具
TREX: An AI code reviewer that runs your code

原始链接: https://www.greptile.com/blog/trex-code-execution

Greptile 的 Shlok 推出了 **TREX (Test, Run, Execute)**，这是一个旨在克服静态代码审查局限性的执行层。传统的 AI 工具通常仅限于读取代码，因此往往会遗漏 UI 回归、竞态条件或状态依赖逻辑错误等动态漏洞。 TREX 通过充当编排器来解决这一问题：主要的 Greptile 代理识别合并请求中的潜在问题，并将其委派给专门的并行子代理。这些子代理会启动一次性的沙盒环境来执行代码、验证功能，并生成证明其发现的“工件”（截图、日志和视频）。通过提供可验证的证据，TREX 让开发人员能够准确查看错误的发生方式和位置，而不必依赖抽象的总结。该系统与模型无关，使 Greptile 能够在不重构基础设施的情况下在不同的前沿大模型之间切换。最终，TREX 超越了简单的代码扫描，成为了一套全面的验证套件，旨在自动化实现传统上由人类工程团队执行的严谨的端到端测试。

```Hacker News最新 | 过往 | 评论 | 提问 | 展示 | 招聘 | 提交登录TREX：一个运行你代码的 AI 代码审查工具 (greptile.com)12 分数由 dakshgupta 于 1 小时前发布 | 隐藏 | 过往 | 收藏 | 1 条评论帮助 Elzair 9 分钟前 [–] 我想知道要过多久才会有人攻破这个系统？回复指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索： ```

原文

I'm Shlok, a software engineer at Greptile. We recently built a code reviewer that, in addition to reviewing pull requests, actually runs the code and shows you what went wrong.

In 1976, Michael Fagan published a paper introducing formal code inspection at IBM. Developers would print out listings, sit in a room together, and read through the code line by line.

Today we still read a diff on a screen. AI tools have made that faster, though most of them are still just reading the code. This approach works for a lot of bugs, the ones that announce themselves plainly in code.

The problem is there's a whole category of bugs that don't show up in code at all; they exist when the program is running. Think of the logic error that needs a specific sequence of state, the UI regression that appears after the page loads, or the race condition that needs a real request. You can read the diff perfectly and still miss these types of bugs completely.

Static code review has a ceiling. It can reason about what the code says. It can't tell you what it does. TREX (which stands for "Test, Run, Execute") is Greptile's response to that ceiling: an execution layer built directly into code review.

Orchestrating agents without wasting context

TREX started as a completely separate product from Greptile, as a standalone agent that generated and ran tests. We hoped that bugs would surface as a result. They didn't. Generating tests wasn't the same activity as finding bugs. When the separate TREX agent tried to write tests, the tests weren't relevant to what the user was trying to do. This created unnecessary noise, and it also missed edge cases. This sounds obvious in hindsight, but it took us more time than expected to learn this lesson.

We'd built these agents to be separate with the assumption it would give each agent its own context window. It also meant both agents ran separately without sharing knowledge. They often overlapped, exploring the same parts of the codebase twice without either agent knowing what the other had already found, ultimately leading to wasted compute.

The obvious fix seemed like combining them into one agent. We tried that, and ran into a different problem: a single agent handling the full review got overloaded. Between spinning up services, taking screenshots, running tests, there was too much context for one agent to manage cleanly.

The solution was to make TREX share the same context as the main Greptile reviewer rather than having it exist entirely as a separate product. It was the first time we were managing agents from within an agent. Unlike two independent agents, this means TREX doesn't start from scratch. It inherits what the Greptile reviewer agent already found, has its own context window, and is scoped to the specific problem it's been asked to investigate.

The Greptile reviewer agent acts as an orchestrator. It reads the diff, identifies issues worth investigating, and spins up a dedicated TREX agent per issue, all running in parallel. The TREX agents have the liberty, the compute, and the knowledge of the orchestrator agent.

A good example of this is a UI feature hidden behind an auth gate. Testing it locally means setting up the environment, handling authentication, getting the feature flag in the right state. A subagent figures all of that out on its own and comes back with a screenshot of the rendered feature.

The first version of TREX output findings as bullet points listing out what was tested and what happened. This was a reasonable starting point, but it didn't provide sufficient information.

An agent or a human reviewer reading a bullet point like, "Tested the checkout flow, found failure" wouldn't find it very useful. They wouldn't be able to tell where in the process something went wrong. If the test failed, was it the setup? The assertion? An environment issue? We found an early version of the agent would sometimes hallucinate about how thoroughly it had tested something, claiming to have tried something it hadn't. Bullet points gave us no way to verify.

The fix was to back the bullet point list with a multi-modal artifact set for each TREX finding: screenshots, logs, API traces, execution scripts. Each modality covers a different part of the story. Having a comprehensive picture of everything that was tested for a specific issue is what actually matters.

The first artifact that made us say "Wow" was video. If you push an animation change, TREX captures a video of it playing. You can see exactly what the animation looks like without opening a local environment.

Artifacts also need to be trustworthy. Every artifact has to give the reviewer enough to verify the run themselves. The screenshots, logs, traces, and scripts are all there so a person or downstream agent can look at exactly what happened and confirm it. Bad evidence is worse than no evidence.

The reason artifacts matter, especially for agents downstream, is the same reason teachers require students to show their work. It's analogous to grade school math; you don't know where your answer was wrong until you show the steps. Agents are the same way. If an artifact shows proof of everything the test did to reach a conclusion, the agent can identify exactly which step went wrong. Without that trace, all it has is the answer, and the answer tells you nothing about where to fix it.

If TREX finds a bug, it becomes a comment on the PR. If it runs a feature and everything works, that goes in the summary as proof the change was actually tested. Not every run needs to find something wrong to be useful.

The frontier race between model providers moves fast. A model that leads on code tasks one month can be behind the next. Building tightly around any single provider's API means rebuilding when rankings shift. That's not a viable long-term strategy.

From the start, we designed TREX around a model-agnostic harness that allows hot-swapping between frontier models without rebuilding. The flexibility goes deeper than most people expect: the main agent and the subagents can use different providers. We can have multiple models running within the same review. This makes it easy for us to pick the best model at any given point, based on internal evals.

Our current evaluation involves measuring recall (e.g., how many real bugs are caught, measured against open-source PRs or customer data where comments were addressed) and precision (e.g., consistency across runs: if you review the same PR twice, are you finding roughly the same set of issues?).

We intentionally deprioritize latency in our evaluation. A developer waiting on a review would rather wait a little longer and get something accurate than get a fast answer they can't trust.

The open source evaluation harness we use performs on par with native provider harnesses. There's no meaningful quality penalty for being model-agnostic which, if you'd asked me to guess before we tested it, I wouldn't have been confident about.

TREX's differentiation is not which model it's running. It's the infrastructure around the model: the codebase indexing, the orchestration, the artifact generation, the evaluation framework. We don't need to care about intelligence as a differentiator, because intelligence is something that will continuously improve. Greptile's job is to build the harness, the architecture, and the artifact pipeline that makes the difference in practice.

Every TREX review spins up a disposable sandboxed environment: an isolated compute instance per review, started fresh in milliseconds, thrown away when the run is done. These environments start fast, isolate execution properly, and can run real projects; not just unit tests against a mock, but actual services with actual dependencies. Many bugs only appear end-to-end.

Starting from nothing every time would be too slow. So we rely on reusable base images and per-repository snapshots. A repository can be cloned once, captured, and resumed. Each review still fetches the exact PR commits and rotates credentials before execution begins. The cache speeds up setup without freezing stale state into the environment. A cache that includes too little is slow. A cache that includes too much becomes haunted. TREX needs the useful kind: warm enough to move quickly, fresh enough to trust.

The sandbox is what makes the artifacts trustworthy. When TREX reports that a page rendered with a broken layout, you know the code actually ran in a real environment. That's a different claim than, "The code looks like it might render with a broken layout."

These pieces — the subagent architecture, the artifact verification bar, the sandboxed execution environment, and the evaluation harness — aren't separate features. They're one system. The orchestrator identifies issues worth running. The sandbox makes running safe and fast. The artifact pipeline makes the results trustworthy enough to act on. The evaluation harness makes sure the right model is doing the work. Each one makes the others more valuable. The review it comes up with is a reproducible experiment with attached evidence.

Our vision is bigger than code review as it exists today. The goal is a world with no bugs. To get there, we're no longer thinking of ourselves as a code review tool. We want to be a validation suite: an end-to-end layer that mimics what engineering teams have done for decades, but automated and running on every PR. TREX is another step toward that vision.

Try TREX

TREX：一款运行您的代码的 AI 代码审查工具 TREX: An AI code reviewer that runs your code

Orchestrating agents without wasting context

TREX：一款运行您的代码的 AI 代码审查工具
TREX: An AI code reviewer that runs your code