Show HN:基于 RLM 的 AI Agent 轨迹本地调试器
Show HN: RLM-based local debugger for AI agent traces

原始链接: https://github.com/context-labs/halo

**HALO** 是一个框架和工具集,旨在利用 RLM(递归语言模型)方法构建递归自我改进的智能体架构。它通过分析生产环境的执行追踪数据,识别诸如幻觉工具调用或拒绝循环等通用编码智能体常忽略的系统性故障,从而优化人工智能体的性能。 **主要功能:** * **自动化优化循环:** HALO 从您的智能体收集兼容 OpenTelemetry 的追踪数据,将其输入 HALO-RLM 引擎进行模式诊断,生成可执行的提示词或架构修复方案,并重新部署以实现持续改进。 * **专业化分析:** 不同于可能过度拟合单个错误的通用编码助手,HALO 使用专业引擎在各种高流量、高变异性的智能体行为中归纳分析结果。 * **性能提升:** HALO 在 AppWorld 等基准测试中已取得显著成功,使 Gemini 3 Flash 和 Sonnet 4.6 等模型在成功率方面实现了两位数的增长。 **快速上手:** 开发者可通过 `pip install halo-engine` 安装命令行工具(CLI)或使用 HALO 桌面应用程序。它能轻松集成到现有工作流程中,仅需提供一个 JSONL 格式的追踪文件和一个兼容 OpenAI 的 API 密钥,即可开始对智能体系统进行诊断与优化。

HALO (Hierarchical Agent Loop Optimizer) 是一款全新的开源工具,旨在通过分析执行轨迹来调试和优化 AI 智能体。与传统手动方法不同,HALO 利用递归语言模型 (RLM) 将大型数据集分解为更小、易于管理的子问题。这种方法使开发人员能够识别数千条轨迹中的系统性故障模式,而这些数据量通常超出了标准大语言模型的上下文窗口或推理能力。 该工具支持符合 OTEL 标准的轨迹(例如来自 Langfuse 或 Arize 的数据),并可选择性地导入本地代码库上下文,以提供更精确、可执行的见解。通过分析、报告、修复和重新运行的迭代循环,HALO 有助于在生产规模下维持智能体的性能。HALO 作为本地桌面应用程序提供,使开发人员无需复杂的配置或担忧数据隐私,即可进行深度分析。
相关文章

原文

✨ RLM-based agent optimizer using production traces✨

X (formerly Twitter) License GitHub

QuickstartWhat is this?BenchmarksDevelopmentContributing

Install the HALO desktop app with:

curl -fsSL https://inference.net/halo/install.sh | sh

Read HALO reports

The installer downloads the latest release for your platform and sets up the desktop app. macOS uses a signed, notarized DMG. You can also install directly from the GitHub releases page.

If you're looking for a hosted, plug-and-play version of HALO, please sign up for inference.net and follow the instructions here.

HALO is a methodology for building recursively self-improving agent harnesses using RLMs. This repository contains:

  • The HALO Desktop App for running HALO locally on your machine.
  • Information on HALO methodology.
  • A Python package that implements the core HALO-RLM engine. View on PyPI
  • A demo project that shows how to build HALO loops for your agents using the Python package. View demo
  • Benchmarking examples applying HALO to popular agent benchmarks. (View AppWorld).

The core HALO loop is surprisingly simple:

  1. Collect execution traces from your agent harness. HALO uses OpenTelemetry-compatible tracing.
  2. Feed traces into HALO-RLM engine.
  3. The engine decomposes the traces to understand common failure modes across harness executions and produces a report with its findings.
  4. This report is fed into a coding agent like Cursor or Claude Code to generate and apply a set of changes to your harness.
  5. The harness is then re-deployed, more traces are gathered, and the cycle repeats.

HALO is great at finding issues in production agent deployments. We find high-traffic environments tend to generate more data with higher variance across executions, creating the type of issues that HALO is great at identifying.

A general-purpose harness like Claude Code is the wrong tool for trace analysis. This isn’t because the model isn’t smart, but because traces can get extremely long, and you need a specialized toolkit in order to make observations about systemic agentic behavior. We noticed in our testing that harnesses like CC would often overfit to an error present in a single/few traces rather than generalize to harness-level problems. This led us to creating a specialized form of a RLM.

rlm

Install the HALO engine + CLI from PyPI:

pip install halo-engine

# Verify installation
halo --help
  1. Integrate Tracing
  2. Collect traces by running your agent
  3. Run the HALO engine
export OPENAI_API_KEY=...
# Optional: point HALO at another OpenAI-compatible provider.
export OPENAI_BASE_URL=https://openrouter.ai/api/v1

halo path_to_your_traces.jsonl -p "Diagnose errors you find and suggest fixes"

HALO uses the canonical OpenAI env vars: OPENAI_API_KEY for credentials and OPENAI_BASE_URL for OpenAI-compatible providers. If OPENAI_BASE_URL is unset, HALO uses https://api.openai.com/v1. Run halo --help to see all CLI options. The CLI mirrors the model/provider settings exposed by the Python SDK's ModelConfig and ModelProviderConfig.

Flag Default Description
TRACE_PATH required JSONL trace file
--prompt, -p required User prompt sent to the root agent
--model, -m gpt-5.4-mini Model name for root and subagent calls; also the fallback for synthesis and compaction
--synthesis-model --model Model for synthesis calls (trace summarization). A small, cheap model (e.g. gpt-4.1-nano) is recommended
--compaction-model --model Model for compaction calls (context summarization) — the biggest token consumer in large runs. A small, cheap model (e.g. gpt-4.1-nano) is recommended
--max-depth 2 Max subagent recursion depth
--max-turns 20 Max turns per agent
--max-parallel 10 Max concurrent subagents
--base-url OPENAI_BASE_URL / https://api.openai.com/v1 OpenAI-compatible API base URL
--api-key OPENAI_API_KEY Provider API key
--header, -H unset Provider header as NAME: VALUE. Repeat for multiple headers, matching curl's -H convention
--temperature provider default Sampling temperature forwarded to the model
--max-output-tokens provider default Maximum output tokens forwarded to the model
--parallel-tool-calls / --no-parallel-tool-calls enabled Allow models to issue parallel tool calls
--refusal-retries 0 Retry an agent model request this many times when the model refuses
--reasoning-effort model/provider default Reasoning effort for root and subagent calls.
--telemetry off Emit OpenInference traces of HALO's own LLM, tool, and agent activity

For example:

halo path_to_your_traces.jsonl \
  -p "Diagnose errors you find and suggest fixes" \
  --base-url https://openrouter.ai/api/v1 \
  -H "HTTP-Referer: https://example.com"

HALO can emit OpenInference-shaped traces of its own LLM, tool, and agent activity. It is off by default; nothing is emitted unless you pass --telemetry.

halo TRACE_PATH --prompt "..." --telemetry

When telemetry is enabled, CATALYST_OTLP_TOKEN uploads spans to inference.net Catalyst over OTLP. If it is unset, spans are written to a local JSONL file at ./halo-telemetry-{run_id}.jsonl in the current working directory.

Var Default Purpose
CATALYST_OTLP_TOKEN unset If set, uploads to Catalyst over OTLP. If unset, writes JSONL locally
CATALYST_OTLP_ENDPOINT catalyst-tracing default OTLP endpoint base URL, for example https://telemetry.inference.net
CATALYST_DEBUG unset Set to 1 to surface OTLP export errors
CATALYST_TRACING_RUN_ID unset Uses this HALO run id instead of a generated uuid
CATALYST_TRACING_* unset Generic catalyst-tracing passthrough
HALO_TELEMETRY_PATH ./halo-telemetry-{run_id}.jsonl Local fallback file path. Only used when CATALYST_OTLP_TOKEN is unset

We have provided a simple demo and an AppWorld demo.

The engine exposes four entry points from engine.main. Use whichever matches the trade-off you want between observability and code simplicity. The yielded types (AgentOutputItem and AgentTextDelta) are defined in engine/models/engine_output.py:

Function Sync / async Returns When to use
stream_engine_async async AsyncIterator[AgentOutputItem | AgentTextDelta] You want every event including streaming-token deltas (live UI, custom rendering).
stream_engine_output_async async AsyncIterator[AgentOutputItem] You want to log / persist each completed step (assistant message, tool call, tool result) as it lands.
run_engine_async async list[AgentOutputItem] You want the final list at the end and don't care about per-step observability.
stream_engine sync Iterator[AgentOutputItem | AgentTextDelta] Sync generator; yields every event including deltas. Drives the async iterator on a private event loop.
stream_engine_output sync Iterator[AgentOutputItem] Sync generator; yields completed items only. Same shape as the async variant for sync callers.
run_engine sync list[AgentOutputItem] Sync, collects to a list. Pure convenience over asyncio.run(run_engine_async(...)).
from engine.main import stream_engine_output_async

async for item in stream_engine_output_async(messages, cfg, trace_path):
    logger.info("step", extra={"sequence": item.sequence, "agent": item.agent_name})
    # item.item is an AgentMessage (assistant / tool / etc.)

HALO is consistently capable of driving improvements on benchmarks, solely by optimizing the harness.

We applied HALO to the AppWorld benchmark, a set of agentic tasks that assess the LLM’s ability to use multi-app services like Spotify, Venmo, file systems, and phone contacts. We tested HALO’s ability to improve harnesses for both Gemini 3 Flash and Sonnet 4.6. We iterated on the harness using the dev split, and then used the test_normal split as a proxy to verify that improvements did not come from overfitting.

The feedback from HALO Engine surfaced failures in the harnesses such as hallucinated tool calls, redundant arguments in tools, refusal loops, and semantic correctness issues. Each issue mapped cleanly to a direct prompt edit. HALO’s claims were independently verified from the source trace files with the findings holding up under scrutiny.

app-world-sgc

The peak improvements over baseline were substantial for both models. For Gemini 3 Flash, dev SGC went from 36.8% to 52.6% (+15.8 points) and test_normal SGC went from 37.5% to 48.2% (+10.7 points). For Sonnet 4.6, dev SGC went from 73.7% to 89.5% (+15.8 points) and test_normal SGC went from 62.5% to 73.2% (+10.7 points).

Local development against this repo uses uv for dependency management and go-task as the task runner.

git clone https://github.com/context-labs/HALO
cd HALO
task env:setup

task env:setup installs uv (if missing), syncs the venv from uv.lock, and configures the repo's git hooks. After that, the halo CLI is available via uv run halo ... (or activate .venv/).

Run task --list for the full list. The ones you'll use most:

Task What it does
task check Run all pre-commit checks: pinned-versions, lint, format, typecheck, unit tests
task check:fix Same, but auto-fix lint/format issues
task test:unit Unit tests under tests/unit/
task test:integration Integration tests under tests/integration/

MIT

Contributions are welcome! Please feel free to submit a pull request.

联系我们 contact @ memedata.com