MCP 作为可观测性接口：将 AI 代理连接到内核跟踪点

MCP 作为可观测性接口：将 AI 代理连接到内核跟踪点
MCP as Observability Interface: Connecting AI Agents to Kernel Tracepoints

原始链接: https://ingero.io/mcp-observability-interface-ai-agents-kernel-tracepoints/

## MCP 与直接可观测性的兴起模型上下文协议 (MCP) 正在迅速成为 AI 代理与基础设施数据之间的关键接口。最近的进展——Datadog 发布 MCP 服务器，以及 Qualys 识别出这些服务器潜在的安全风险——凸显了这种转变。然而，作者认为 MCP 的潜力*超越*了仅仅封装现有的可观测性平台。他们设想了一种未来，其中 MCP 服务器*就是*可观测性层，通过 eBPF 等工具直接访问内核级数据。这种“MCP 原生可观测性”使 AI 能够分析原始遥测数据，发现通过聚合指标无法访问的根本原因——例如，通过 CUDA 跟踪分析快速定位 vLLM 回归问题。虽然封装现有平台适用于聚合分析，但原生 MCP 擅长详细调查。至关重要的是，这种方法也解决了安全问题；通过将可观测性*集成到* MCP 服务器本身，可以完全跟踪和保护交互。作者预测这种模式将扩展到网络、安全和成本可观测性，使 AI 能够直接解释原始数据并做出明智的决策，绕过传统的仪表盘和预定义指标。他们的开源项目 Ingero 提供了一个探索这种新范式的起点。

黑客新闻新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录 MCP 作为可观察性接口：将 AI 代理连接到内核跟踪点 (ingero.io) 8 分，由 ingero_io 发布 38 分钟前 | 隐藏 | 过去 | 收藏 | 2 条评论帮助 neil_naveen 1 分钟前 | 下一个 [–] 允许 AI 代理运行自定义 SQL 查询的 MCP 端点，本质上是否会让你的监控数据库被潜在的恶意 AI 代理操纵？比如，如果 AI 代理对 DB 有完全控制权，并且无法找到解决性能错误的方案，它可能会重写数据并声称已“解决”该错误。这只是我能想到的最令人担忧的例子。回复 gcifuentes 7 分钟前 | 上一个 [–] 你不能创建一个 MCP eBPF 模块并动态生成探测点吗？回复指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索：

原文

TL;DR

MCP is becoming the interface between AI agents and infrastructure data. Datadog shipped an MCP Server connecting dashboards to AI agents. Qualys flagged MCP servers as the new shadow IT risk. We think both are right, and we think the architecture should go further: the MCP server should not wrap an existing observability platform. It should BE the observability layer. This post explores how MCP can serve as a direct observability interface to kernel tracepoints, bypassing traditional metric pipelines entirely.

Three signals in one week

Three things happened in the same week of March 2026 that signal where observability is headed.

Datadog shipped an MCP Server. Their implementation connects real-time observability data to AI agents for automated detection and remediation. An AI agent can now query Datadog dashboards, pull metrics, and trigger responses through the Model Context Protocol. This is a big company validating a small protocol.

Qualys published a security analysis of MCP servers. Their TotalAI team called MCP servers “the new shadow IT for AI” and found that over 53% of servers rely on static secrets for authentication. They recommended adding observability to MCP servers: logging capability discovery events, monitoring invocation patterns, alerting on anomalies.

Cloud Native Now covered eBPF for Kubernetes network observability. Microsoft Retina deploys as a DaemonSet, captures network telemetry via eBPF without application changes, and provides kernel-level drop reasons. The article draws a clear line between “monitoring” (predefined questions) and “observability” (asking questions nobody planned for).

The thread connecting all three: AI agents need direct access to infrastructure telemetry, and MCP is becoming the way they get it.

Two approaches to MCP observability

There are two ways to connect observability data to AI agents via MCP.

Approach 1: Wrap existing platforms. Datadog’s strategy. Take existing metrics, logs, and traces, already collected and aggregated, and expose them through MCP tools. The AI agent queries the dashboard API, gets pre-processed data, and acts on it. This makes sense for teams with a mature observability stack that want to add AI-powered automation on top.

Approach 2: Build MCP-native observability. This is what we did with the tracer. Instead of wrapping an existing platform, we built an eBPF agent that traces CUDA Runtime and Driver APIs via uprobes, stores the results in SQLite, and exposes everything through 7 MCP tools. The MCP interface is not an adapter layer; it is the primary interface.

Neither approach is wrong. They solve different problems.

The wrapper approach works well for aggregate analysis: “What was the p99 latency for service X over the last hour?” The data is already summarized, indexed, and queryable.

The native approach works better for root-cause investigation: “Why did this specific GPU request take 14.5x longer than expected?” That requires raw kernel events, CUDA call stacks, and causal chains – not summaries. The AI agent needs to drill down, not roll up.

What MCP-native observability looks like in practice

Here is a concrete example. We traced a vLLM TTFT regression where the first token took 14.5x longer than baseline. The trace database captured every CUDA API call, every kernel context switch, every memory allocation.

When Claude connects to the MCP server and loads this database, it can:

get_trace_stats – See the full trace summary: 12,847 CUDA events, 4 causal chains, total GPU time
get_causal_chains – Read the causal chains that explain why latency spiked, in plain English
run_sql – Run custom queries against the raw event data (“show me all cudaMemcpyAsync calls over 100ms”)
get_stacks – Inspect call stacks for any flagged event

Claude identified the root cause in under 30 seconds: logprobs computation was blocking the decode loop, creating a 256x slowdown on the critical path. That root cause was not visible in any aggregate metric. It only appeared in the raw causal chain between specific CUDA API calls.

A dashboard MCP adapter could not have found this. The data granularity does not survive aggregation.

The security angle matters too

Qualys raised valid concerns about MCP server security. Their finding that 53% of servers rely on static secrets is alarming. Their recommendation to log discovery and invocation events is exactly right.

For MCP servers that touch GPU infrastructure, the attack surface is different. An MCP server with access to CUDA traces can expose timing information, memory layouts, and model architecture details. The security model needs to account for this.

In Ingero, every MCP tool invocation is traced. The same eBPF infrastructure that captures GPU events also captures the MCP interaction itself. This is not a separate logging layer; it is the same observability pipeline. Qualys’s recommendation to “add observability to MCP servers” becomes trivial when the MCP server already IS an observability tool.

Where this is going

We think the MCP-native pattern will expand beyond GPU observability. Consider:

Network observability: Instead of wrapping Prometheus in an MCP layer, build an eBPF-based network agent that exposes packet-level data directly to AI agents (Microsoft Retina is halfway there).
Security observability: Instead of wrapping a SIEM, build an MCP server that traces syscalls and exposes security events in real time.
Cost observability: Instead of querying a cloud billing API through MCP, instrument the actual resource allocation and expose it directly.

The pattern is the same: skip the dashboard, skip the aggregation, give the AI agent direct access to the raw telemetry. Let the agent decide what to aggregate and how.

Try It Yourself

The project is open source. The investigation database from this post is available for download. Claude (or any MCP client) can connect to it and run an investigation:

git clone https://github.com/ingero-io/ingero.git
cd ingero && make build
./bin/ingero mcp --db investigations/pytorch-dataloader-starvation.db

Investigate with AI (recommended)

You can point any MCP-compatible AI client at the trace database and ask questions directly. No code required.

First, create the MCP config file at /tmp/ingero-mcp-dataloader.json:

{
  "mcpServers": {
    "ingero": {
      "command": "./bin/ingero",
      "args": ["mcp", "--db", "investigations/pytorch-dataloader-starvation.db"]
    }
  }
}

With Ollama (local, free):

# Install ollmcp (MCP client for Ollama)
pip install ollmcp

# Investigate with a local model (no data leaves your machine)
ollmcp -m qwen3.5:27b -j /tmp/ingero-mcp-dataloader.json

With Claude Code:

claude --mcp-config /tmp/ingero-mcp-dataloader.json

Then type /investigate and let the model explore. Follow up with questions like “what was the root cause?” or “which processes were competing for CPU time?”

Add it to a Claude Desktop config and ask: “What caused the GPU performance issues in this trace?”

The MCP server exposes 7 tools. Claude will figure out the rest.

Ingero is free & open source software licensed under Apache 2.0 (user-space) + GPL-2.0/BSD-3 (eBPF kernel-space). One binary, zero dependencies, <2% overhead. Give us a start at GitHub!