面向人工智能体(AI Agent)的章鱼架构
The octopus architecture for AI agents

原始链接: https://blog.goodman.dev/blog/octopus-agent-architecture/

TorkBot 采用了一种“章鱼”式架构,以一个中央“大脑”协调多个半自主的“触手”或通道。这种设计在表面响应能力、复杂任务处理能力以及长期连续性这三种相互制约的需求之间取得了平衡。 通过将资源密集型任务(如输入输出、工具调用和沙盒工作流)委派给专门的子大脑,前台模型能够保持空闲并做出快速响应。所有跨平台和线程的活动都被整合为一个单一的、连续的前台对话。这不仅塑造了统一的人格,还使智能体能够整合不同语境下的信息。 各通道通过基于文本的指令和共享虚拟文件系统进行通信,将“凌乱”的中间过程限制在每个触手的本地内存中。中央大脑仅维护当前的意图、精简的摘要和关键引用。这种分离提供了显著的架构优势:前台保持稳定,从而提高了缓存效率和交互速度,而触手则负责处理繁杂的“搅动”工作。归根结底,该设计将大模型对话视为一种持久且经过整理的历史记录,其核心理念在于:未来的模型智能将依托这种具有凝聚力的跨平台架构,而非碎片化、特定任务的机器人。

抱歉。
相关文章

原文

Essay

The octopus architecture describes a system with a central coordinating brain that dispatches to semi-autonomous sub-brains.

TorkBot is designed a bit like an octopus. This architecture was born from a series of dead-ends and iterative improvement. When I say octopus, what I mean is that TorkBot has a centralized “brain” directing many semi-autonomous appendages, each with their own brains, reporting back to the central dispatcher.

Diagram of TorkBot's foreground lane coordinating static lanes, lane templates, spawned lane instances, sandbox snapshots, and the durable ledger.

Static lanes are the long-lived appendages. Curator is one. Plugins can contribute others, like the Google Workspace lane. Lane templates are different. A template is a capability that can be instantiated for a bounded purpose. A sandbox snapshot is different again: it is not a collaborator at all, just a saved filesystem starting point for a future sandbox-backed lane.

Interaction vs capability

Several competing pressures are at play that pushed me into this architecture.

  1. Responsiveness to surface interactions — The agent requires a design in which its turns are more or less bounded in complexity and can avoid I/O entirely. This allows the agent to interact quickly even when tasks or work may take quite some time.
  2. Capability — The agent shouldn’t be limited in what it can accomplish just to keep turns efficient. It needs mechanisms to pursue complex tasks through delegation and be able to observe and steer those tasks close to real-time.
  3. Continuity — The agent should maintain a continuous perspective and personality. The best continuity comes from a single LLM conversation that is continually curated. In this way, the personality and short-term memory don’t need to be “added in”; instead they’re a side effect of the architecture.

These pressures pushed me into a design with multiple “lanes”, as you can see in the diagram above. The “foreground” lane is the LLM conversation users interact with through surface activity. But here, I have made a bet that is likely controversial: all activity across all surfaces goes through the same foreground conversation. Threads, channels, and even platforms are all collapsed. Right now, that cognitive complexity is perhaps beyond the ability of most models and perhaps even beyond the frontier. But I’m certain that will not be the case for long.

All activity across all surfaces goes through the same foreground conversation.

Part of my thesis with TorkBot is to bet on emergent behaviour and emergent intelligence. Coming up with systems that split LLM conversations across arbitrary platform-defined boundaries is antithetical to the continuity goal. I want my agent to make links across threads and even across surfaces. I want the agent to be able to trivially continue work started in Slack and continued on GitHub. If we’re not there yet in model intelligence, I bet we will soon be and the agentic system designed for that world will stand above the competition in terms of intuitiveness and power.

How the octopus works

The octopus idea is doing actual work here. It is the shape of the harness problem.

This is not jumping on the sub-agent bandwagon for the sake of clout. This is a design that emerged and earned its existence. After all, it comes back to context management. Each appendage gets its own context.

The foreground hands off work to other lanes by ‘talking’ to them. Inter-lane communication is just text, betting on the idea that pre- and post-training skew heavily towards prose as the carrier of intent. The foreground picks a lane template — and if it is a sandbox lane, a VM snapshot — and passes an initial message to that lane about what it wants. For lanes that are already spawned, a simple message is sent.

Lanes can own the messy work of doing a bunch of tool calls, hitting dead-ends, doing I/O and any number of more complex sandbox-enabled workflows. That mess stays contained in the lane’s context. Lanes communicate between each other via two mechanisms:

  1. Chat, as described above; and
  2. References to virtual filesystem artifacts via the lane’s ./shared folder.

The foreground conversation can stay continuous across surfaces, which is what I want for personality and cross-thread intuition, without becoming the place where every intermediate artifact goes to die. The appendage can carry the local working memory for the task. The foreground can carry the relationship, the current intent, and the synthesis.

This also makes compaction pretty obvious. Each lane is continuously compacted asynchronously at a certain threshold and synchronously if, through some strange turn of events, it exceeds another higher threshold.

Benefits that fall out of this design

Mean-time-to-interaction is the prize.

Completion can take a while. It is okay for a lane to read docs, wait on I/O, run tests, hit a wall and try again. It is not okay for the foreground to go dark because one appendage is busy.

So the foreground lane has to stay small and boring: stable prompt, current intent, recent surface activity, compact summaries and references to evidence. Keep the churn in the appendages. That is both a context-efficiency story and a cache-efficiency story. A stable foreground prompt means better LLM API cache hits. Less junk means faster first tokens and less cognitive drag.

Curation makes that possible. Compaction keeps lane context from swelling forever. Curator can promote durable bits into memory or skills. Artifacts can stay artifacts. Transcripts can remain inspectable without being stuffed back into the foreground.

The arms can be busy. The head needs to stay available.

联系我们 contact @ memedata.com