代理程序框架应位于沙箱之外。

代理程序框架应位于沙箱之外。
The agent harness belongs outside the sandbox

原始链接: https://www.mendral.com/blog/agent-harness-belongs-outside-sandbox

## 代理 Harness 架构：沙盒内 vs. 沙盒外 LLM 代理依赖于一个核心“Harness”——提示、执行和反馈的循环——来运作。一个关键的决定是这个 Harness *运行在哪里*。主要有两种架构：**沙盒内**（在代理代码的同一个容器内）和**沙盒外**（在后端运行，通过 API 与沙盒交互）。在*沙盒内*运行 Harness 更简单，利用现有工具和本地文件系统来实现技能和记忆。然而，它在多用户环境和安全性方面存在挑战——凭据在沙盒内，并且难以扩展/暂停沙盒。作者选择了*沙盒外*架构。这通过将凭据分离来增强安全性，并允许按需配置沙盒，从而节省资源。然而，它需要解决诸如**持久化执行**（使用 Inngest 进行检查点）、**沙盒生命周期管理**（使用 Blaxel 实现快速恢复）和**文件系统虚拟化**等挑战。这个虚拟化层提供了一个统一的文件系统接口，将请求路由到数据库以获取共享技能/记忆，并路由到沙盒以进行本地操作，从而避免了分布式文件系统问题。虽然保持与 Claude Code 等工具的 API 兼容性至关重要，但代理能力演变以及通过 `bash` 等工具可能出现的安全漏洞仍然存在挑战。目前，并发会话的一致性采用“最后写入者胜出”的方法来处理。

最近的Hacker News讨论集中在“代理 Harness”的安全性——用于运行AI代理的系统。核心论点，由用户saltcured提出，建议采用分层沙盒方法来降低风险。他们建议使用*三个*沙盒：一个用于代理的代码开发，另一个用于代理使用的工具，第三个包含代理本身。这种分层方法旨在控制来自错误代码、恶意工具执行或代理提出的不当请求的潜在损害。重要的是，讨论强调了防止代理访问凭证的重要性，以避免通过大型语言模型（LLM）造成潜在泄露。另一位用户Retr0id澄清，这种情况很可能适用于托管多个代理的服务。总的来说，部署AI代理，尤其是在基于服务的环境中，需要强大的安全措施。

原文

An agent harness is the loop that drives an LLM. It sends a prompt, gets a response, executes the tool calls the model requested, feeds the results back, and repeats until the model says it's done. Every production agent has one. The question is where it runs.

There are two answers. They have different security properties, different failure modes, and different implications for what the agent can do. The tradeoffs also look different depending on whether you're building a single-user agent (one engineer on a laptop) or a multi-user one (dozens of engineers in the same organization sharing the same agent). We're in the multi-user camp, which surfaces problems single-user builders don't hit.

The two architectures

Harness inside the sandbox

The loop lives in the same container as the code it's working on. LLM calls go out from inside the container. Tool calls (bash, read, write) execute locally. Skills, memories, and anything else the harness tracks are files on the container's filesystem.

This is what claude does when you run it on your laptop, and what it looks like when you spin up Claude Code in a remote container. If you're building a single-user agent, you can grab the Claude Code SDK and ship something that works.

Harness outside the sandbox

The loop runs on your backend. When it needs to execute a tool, it calls into a sandbox over an API. The sandbox runs the tool and returns the result. The loop never enters the sandbox.

Side-by-side architecture diagram. Left: the agent loop and tools both live inside the sandbox, and the LLM call exits through the sandbox boundary. Right: both the agent loop and all the tools live on the backend alongside the credentials. Some tools reach into a separate, narrow sandbox over a tool RPC interface to run bash or touch workspace files.

Tradeoffs

Running the harness inside the sandbox has a few things going for it. The execution model is simple: one container, one process tree, one filesystem, one lifetime. You can reuse off-the-shelf harnesses as-is. Skills and memories work unchanged because they assume a local filesystem and they get one.

Running the harness outside the sandbox gets you things the inside model can't.

Your credentials stay out of the sandbox. The loop holds the LLM API keys, the user tokens, the database access. The sandbox holds only the environment the agent needs to do its work. There's nothing in there for the agent to escape to, so there's no permission model to enforce and no credential leak to contain.

You can suspend the sandbox when the agent isn't using it. A lot of what an agent does doesn't need a sandbox at all: thinking, calling APIs, summarizing, waiting for CI. Some sessions never touch a sandbox. With the harness outside, you provision one only when the agent needs to run a command, and suspend it whenever it's idle. When the harness lives inside the sandbox you can't do any of this, because you can't suspend the thing the loop is running on.

Sandboxes become cattle. If one dies mid-session, the loop provisions a new one and keeps going. When the harness runs inside, the sandbox is the session, and losing it loses the session.

And multi-user stops being a distributed filesystem problem. Several engineers in the same organization run the same agent. They share skills, they share memories, they sometimes investigate the same incident in parallel. When the harness runs outside the sandbox, this is a shared database. When it runs inside, it's the distributed filesystem problem we'll come back to.

Off-the-shelf local harnesses stop working once you move the loop out, because they all assume a local filesystem. Durable execution becomes your problem, because an agent session can run for hours and has to survive deploys. And once the harness and the sandbox live on different machines, "filesystem" stops being a thing you can point at.

We picked the outside model. The rest of this post is about the three things we had to solve to make it work.

Durable execution

An agent loop is a long-running function. Minutes at a minimum, hours in our case. It has to survive rolling deploys, scale events, and instance failures. Keeping the loop in memory on an API server dies the first time you ship a new version.

We already run our CI ingestion pipeline on Inngest, which we wrote about in a previous post. Extending it to the agent loop was the same decision for the same reasons: good DX, no cluster to run ourselves, and we didn't need the full generality of Temporal. The loop is an Inngest function. Each turn is a step, and Inngest checkpoints each one. If the server restarts, the loop picks up where it left off.

Sandbox lifecycle

The loop is suspended most of the time: during LLM calls, between tool calls, while waiting on a long-running workflow like CI. We want the sandbox to be suspended too, and only active when the agent is running a command. The problem is cold starts. A cold sandbox takes seconds to spin up, which is forever inside an interactive turn.

We use Blaxel for this. Blaxel gives us 25ms resume from standby. We suspend the sandbox when the agent isn't running a command and resume it the instant it is. 25ms is low enough that the agent can't tell the sandbox was ever gone.

Timeline of one agent session. The agent track alternates between LLM thinking, short run-command segments, and a long stretch waiting for a CI workflow. The sandbox track mirrors it: active only during the run-command segments, suspended everywhere else, including the entire CI wait.

The filesystem

Modern agent harnesses aren't just bash and an LLM. They have skills (prompt fragments the agent reads on demand), memories (notes the agent writes for itself or the user), subagents, plans, todo lists. All of these assume a local filesystem. A skill is a file at .claude/skills/foo.md. A memory is a file at .claude/memory/MEMORY.md. The harness reads and writes them with the same read and write tools it uses for source code.

That works on a laptop. It doesn't work when the harness is outside the sandbox.

The sandbox is disposable. We treat it as ephemeral: suspended, resumed, killed, respawned. If it dies and we spin up a new one, whatever the agent wrote to .claude/memory/MEMORY.md is gone. You could keep a long-lived sandbox per session to preserve the state, but then you're back to babysitting one sandbox per session, and you lose every other property you wanted.

The other problem is multi-user. A user's laptop runs an agent for one person. Our agent runs for dozens of engineers in the same organization. Skills are organizational: everyone on a team shares the same triage playbook. Memories are too. If the agent learns on Monday that team X always deploys from a release branch, Tuesday's session for a different engineer on the same team should know.

You could pretend the sandbox has a local filesystem, write to it, and sync everything to a database on the way out. This works in the single-user case. In the multi-user case, you've just built a distributed filesystem. Two sessions running at the same time write to the same memory file, and you have to reconcile them. Three engineers trigger the agent on the same incident, and they all see stale state until their sessions end. Conflict resolution, eventual consistency, cache invalidation.

The clean answer is to stop pretending. Put memories and skills in a database. The harness reads them from the database when the agent asks for them and writes them back when the agent updates them.

But we still want the agent to think in terms of files.

One interface, two backends

The harness virtualizes filesystem access. The agent has one read tool, one write tool, one edit tool. When the agent calls them, the harness looks at the path and routes the call based on what the path means.

Paths under the workspace go to the sandbox, the way they always did. Paths under the skill and memory namespaces go to the database. A write to a memory path is a database transaction, scoped to the organization. A read to a memory path comes from the database too, so two parallel sessions in the same org see the same memory the instant it's written.

The agent doesn't know the difference. As far as it can tell, there's a filesystem and it reads and writes files. Some of those files live in Postgres. Some live in a sandbox running across the country.

A single read/write/edit tool API at the top flows into a path-dispatch router. Paths under /workspace/* route to the sandbox over RPC. Paths under /skills/* and /memory/* route to a Postgres database over SQL. One tool surface, two backends, invisible to the agent.

Why not just add tools

The obvious alternative is to give the agent memory_read and memory_write tools alongside read and write. That works, and it's what most people do. We did it ourselves before we had the virtualization layer.

The problem is that more tools make agents worse. Each tool dilutes the attention the model pays to every other tool, makes the prompt longer, and adds another decision the model has to make at every turn. Two tools that do almost the same thing, read and memory_read, are especially bad, because the model has to disambiguate them from context and will sometimes pick wrong.

The other reason matters more. Anthropic and everyone else training frontier models are almost certainly doing reinforcement learning on harnesses that look like Claude Code. That training shapes the models to be good at a specific API surface: read(path), write(path, content), edit(path, old, new). If you invent memory_read, you're off the trained path. You get whatever the model has learned in general, minus whatever it's learned about the exact conventions it was trained on.

The virtualized interface keeps the API surface the model was trained on and puts the database semantics where we need them on the backend.

What's still hard

The SOTA moves fast. Every few weeks a new pattern (subagents, plans, background tasks) lands in Claude Code or somewhere similar, and it almost always assumes a local filesystem. We can intercept most things, but there's always a gap between a new capability shipping and our virtualization layer handling it correctly. Not running stock Claude Code is a real cost.

We picked path prefixes (/skills/, /memory/) that mirror Claude Code's local layout, and that's probably going to bite us. Claude Code's layout is still moving, and we're one convention change away from having to migrate everything. The right answer might be to expose a different interface entirely. But see above: the whole point was to keep the interface identical to what the model was trained on.

Bash is a leak. The harness can intercept read('/skills/foo.md') because it's a structured tool call. But the agent also has a bash tool, and nothing stops it from running grep -r 'foo' /skills/ in a bash session. Bash bypasses the virtualization layer and hits the sandbox's real filesystem, where /skills/ doesn't exist. We handle this with two best-effort guards: the system prompt tells the agent not to use bash for virtualized namespaces, and we parse bash invocations with tree-sitter to catch calls that reach into those paths. Neither is airtight. It's good enough for now.

Consistency is the part we haven't answered. When two sessions in the same organization are both updating memory, what should they see? Strict serializability is tempting and probably wrong, because agents aren't databases and making one session block on another's write opens up deadlock patterns we don't have answers for. We're running last-writer-wins per key, which is fine for the cases we've hit and almost certainly going to break in ways we can predict.