```Show HN: Nightwatch，开源的只读 AI SRE```

```Show HN: Nightwatch，开源的只读 AI SRE```
Show HN: Nightwatch, The open-source, read-only AI SRE

原始链接: https://github.com/ninoxAI/nightwatch

**ninoxAI** 是一款开源且只读的 AI SRE 工具，旨在消除告警疲劳并简化事件响应流程。作为一层与监控系统无关的智能层，它能将来自 Prometheus、Grafana 和 Kubernetes 等工具的海量告警整合为单一的、可执行的事件。该平台充当“人在回路”（human-in-the-loop）的调查员角色： * **告警分诊：** 它通过聚类症状并识别频繁抖动或无效的检查，提供清晰且基于证据的降噪。 * **根本原因分析：** 利用具备工具调用能力的 AI 智能体，检查您的实时基础设施（包括日志、云元数据和 Git 历史记录），从而构建诊断假设。 * **安全修复：** 它会按风险等级提供人工审批后的具体修复方案。重要的是，ninoxAI 是**只读的**；它绝不会执行命令、更改阈值或修改生产环境。 ninoxAI 为安全性和灵活性而设计，支持本地离线运行（通过模板）或由大模型驱动的调查（通过 Anthropic、OpenAI 或本地 Ollama 模型）。其分布式的“ninox runner”架构允许智能体通过仅出站连接，安全地从隔离网络中收集证据。ninoxAI 完全开源且可自托管，确保在 AI 进行监测与建议的同时，人类始终掌握完全控制权。

**Nightwatch** 是一款开源的本地优先 AI SRE 工具，旨在通过减轻系统故障期间的“认知负荷”来简化事件响应流程。该工具源于开发者在应对复杂的 Kubernetes 故障时的痛点，它充当了自动化调查员的角色，帮助值班工程师快速定位告警风暴的根本原因。该系统使用部署在您环境中的本地代理（被称为“猫头鹰宝宝”）。这些代理可以确保凭据安全，并且仅向中央大脑发起出站连接，无需入站生产访问权限。Nightwatch 会将相关告警进行分组，识别噪音检查项，并收集证据，从而帮助工程师更高效地开展故障排查工作。安全性是该工具的核心优先事项：其设计原则目前为**只读**。对于使用远程大模型（LLM）的用户，Nightwatch 会在发送信息前对敏感数据、密钥和标识符进行脱敏处理，确保仅有匿名化数据被处理。对于需要完全离线运行的团队，该工具支持通过本地大模型（如 Ollama）进行自托管。Nightwatch 目前正针对其首个版本寻求社区反馈。

原文

The open-source, read-only AI SRE.
ninoxAI turns alert storms into incidents, investigates root cause over your live systems, and proposes human-approved fixes — without ever touching production.

Quickstart · AI SRE · Demo lab · Docs · Discord

Your monitoring tells you something broke. It pages you at 3am with fifty alerts for one outage and leaves the hard part to you:

What broke, why did it break, and what should we do next?

ninoxAI is a thin, local-first, monitoring-agnostic AI SRE layer that answers that question. It sits above Checkmk, Prometheus, Icinga2, Zabbix, webhooks, Docker, Kubernetes, AWS, Grafana, GitHub, Git and plain VMs, and:

🌊 Turns alert floods into incidents — one incident per outage, "confirmed by N tools", instead of one page per symptom.
🔇 Finds the noisy checks — flapping, over-sensitive, never-actioned — with evidence.
🤖 Investigates root cause — a tool-calling AI agent reads your live systems and forms a root-cause hypothesis.
🧰 Proposes classified fixes — copy-pasteable, ranked by risk and blast radius, for a human to gate.

ninoxAI observes, reasons, and recommends — it never executes anything. No commands run, no alerts acked, no thresholds changed, no write-back to production. Every fix is a copyable artifact a human approves. Gated, governed remediation is on the roadmap; unconditional auto-execute is not.

Try it in 60 seconds — no LLM, no API keys, fully offline:

cp .env.example .env          # set NINOXAI_SECRET_KEY (one-liner is in the file)
docker compose up --build     # → http://127.0.0.1:8765

# No live monitoring? Watch it triage synthetic alert noise:
docker compose exec ninoxai ninoxai generate-mocks
docker compose exec ninoxai ninoxai import data/mock_alerts.json
docker compose exec ninoxai ninoxai reprocess
# → /recommendations now shows reasoned threshold + flapping fixes

Local Python install (for development)

python -m venv .venv && source .venv/bin/activate   # Windows: .venv\Scripts\Activate.ps1
pip install -e ".[dev,embeddings]"

python -m ninoxai generate-mocks
python -m ninoxai import data/mock_alerts.json
python -m ninoxai reprocess
python -m ninoxai serve            # → http://127.0.0.1:8765

Then light up the AI SRE: point ninoxAI at a tool-calling LLM (Anthropic / OpenAI / Mistral / a local Ollama) and connect your systems — either directly or via a ninox runner (below). The full end-to-end scenario — real monitoring tools and live investigator capabilities (Docker/Kubernetes/host/AWS/Grafana/GitHub) against a genuinely failing workload — lives in lab/.

ingest → normalize → cluster → score noise → recommend → dashboard
                                                  ↓
                       agentic, read-only root-cause investigator

Stage	What happens
ingest	Read-only adapters pull non-OK alerts from each source + JSON/CSV import.
normalize	Maps every source onto one schema + message fingerprint.
cluster	Groups by host / service / severity / time-window. Semantic embeddings optional.
noise	Frequency, ack-rate, ticket-rate, short-recovery, flapping → one 0–1 score.
recommend	Rule-based tuning recommendations with rationale + evidence.
investigate	A tool-calling LLM gathers live evidence → root-cause hypothesis + classified fixes.

Cross-tool correlation: the same fault fires in every tool. The Incidents view groups clusters that share (host, severity, time-window) into one incident — "confirmed by N tools" — read-only, no merge.

🤖 The AI SRE investigator

ninoxAI's standout capability. A tool-calling LLM drives a typed allowlist of read-only capabilities (a ReAct loop on native function-calling — reason → act → observe), builds a root-cause hypothesis from live evidence, and proposes classified fixes a human approves.

Capability	Reads (all read-only)
🐳 Docker	containers, logs, stats, inspect
☸️ Kubernetes	pods, logs, events, deployments (in-cluster RBAC)
☁️ AWS	CloudTrail change events, EC2, security groups, quotas (IAM read-role)
📈 Grafana	PromQL + LogQL over the datasource proxy
🐙 GitHub	CI runs, releases, PRs — change-event RCA
🌿 Git	mirrored repos: commits, diffs, code & history search
🖥️ Host	CPU / mem / disk / processes / sockets / log tail (plain VMs)

Every action is classified read_only · reversible · irreversible + a scope (blast radius). Unknown coerces to irreversible — never silently auto.
Pre-grounded: the agent starts with a compact brief of your environment, so it diagnoses instead of rediscovering.
Hardened: untrusted logs/diffs are injection-shielded, secrets are one-way scrubbed, and a grounding gate caps confidence when claims aren't backed by evidence.

Run it live-streaming in the agent console (/agent) or from the CLI. → Investigator internals

🦉 Distributed ninoxes — the agent's eyes, anywhere

The agent can investigate systems it can't reach directly. A ninox is a thin, outbound-only runner that lives inside one environment (cluster, VPC, on-prem segment), holds that environment's credentials locally, and dials home to the brain — no inbound firewall hole. It advertises a read-only capability surface the brain calls as if local.

   ┌────────────────────┐                         ┌────────────────────┐
   │    ninoxAI brain    │  ◀── outbound only ───  │     ninox runner    │
   │  dashboard · API    │     (the ninox dials    │  inside k8s/Docker/ │
   │  incidents · RCA    │      home; no inbound    │  AWS/on-prem/VM     │
   │  AI SRE investigator│      firewall hole)      │  credentials stay   │
   └────────────────────┘  ◀── read-only evidence  │  local              │
                                                     └────────────────────┘

Capabilities self-select by environment — one binary, the right tools for the box it lands on. Connected ninoxes show up in the Parliament of Owls (/parliament). → Deployment & on-prem

All adapters are read-only — no ack, no downtime, no write-back. Configured in the UI (/connections), credentials Fernet-encrypted.

Checkmk	Prometheus Alertmanager	Icinga2	Zabbix	Generic Webhook	PRTG
✅	✅	✅	✅	✅	⛔ stub

Want to teach the AI SRE to read your stack (Jira, Sentry, Postgres…)? Point it at any MCP server, write a Python capability plugin, or expose tools via the runner protocol — every external tool runs through the same safety shell (namespaced, injection-scanned, classification-coerced). → Extending capabilities

Default is template — fully offline: no LLM, no network, no API keys, no tracking. It works out of the box for summaries/recommendations but deliberately can't drive the agent (that needs tool-calling). Pick a remote per role — a cheap model for high-volume summaries, a strong one for the rare investigation:

Provider	Notes
template	offline — no LLM, no network. Default.
mistral	cost-efficient, EU-hosted
anthropic	strong tool-calling — default for the investigator
openai	OpenAI, Azure, and local LLMs (vLLM / Ollama / LM Studio) via base URL

Redaction + secret-scrubbing run before every remote call — hostnames, IPs, UUIDs, emails, paths become deterministic placeholders, restored only in proposed commands; credentials are one-way scrubbed and never returned. → Technical architecture

Full CLI reference, test setup, and lint rules live in docs/development.md.

Every contributor is an Owl. 🦉 Pull requests, connector adapters, capability providers, and bug reports are all welcome — see CONTRIBUTING.md.

Community: Join the parliament on Discord.

ninoxAI is fully open source under the Apache License 2.0 — free to use, self-host, fork, and build on, in open or closed projects alike.

The owl observes; the human decides. 🦉