“自信的傻瓜”问题:为什么人工智能需要明确的规则,而不是模糊的判断。
The "confident idiot" problem: Why AI needs hard rules, not vibe checks

原始链接: https://steerlabs.substack.com/p/confident-idiot-problem

AI 代理在部署后经常会以不可预测的方式失败,自信地犯错,例如使用不存在的 URL——当前解决方案试图通过使用另一个 LLM 作为“仲裁者”来解决这个问题。这会产生一种循环依赖,容易出现相同的幻觉和不准确。 作者认为,我们需要像对待传统软件一样对待代理,通过断言、单元测试和显式检查引入**确定性**。不要*询问* LLM 某个事物是否有效,而是用代码*验证*它(例如,使用 `requests.get()` 验证 URL,使用 AST 解析验证 SQL)。 为了解决这个问题,他们构建了 **Steer**,一个开源 Python 库,它用“硬性护栏”包装代理函数。Steer 在错误*发生之前*拦截它们,记录它们,并允许开发者通过注入到其上下文中的特定纠正规则来“修补”模型的行为——有效地解决问题,而无需完全重新部署。 Steer 优先考虑本地执行、隐私和基于库的方法,为“感觉式”调试提供了一种确定性的替代方案。

## AI 中的“自信的傻瓜”问题 最近 Hacker News 上的一场讨论强调了当前 AI 模型的一个关键问题:它们倾向于自信地提供错误的答案,而不是承认不确定性或寻求澄清。SteerLabs 的作者认为,仅仅增加模型规模(“更多概率”)并不能解决这个问题,因为 LLM 从根本上来说是“幻觉机器”,而不是可靠的数据库。 他们的解决方案,概述在一个名为 SteerReply 的新库中,侧重于在 AI 的生成过程**周围**实施**严格、确定性的规则**。SteerReply 不试图从概率模型中强制真诚,而是对输出施加断言——二元接受标准。 虽然承认复杂的判断任务仍然需要人类专业知识,但核心思想是自动化“枯燥”但至关重要的飞行前检查。例如,在医疗诊断前验证患者数据。这减少了“无谓的错误”,并确保 AI 基于事实信息运行,从而让人类专家专注于任务中真正具有挑战性的方面。
相关文章

原文

We have all been there. You build an agent. It works perfectly in the demo. You deploy it. And then, on a Tuesday at 3 PM, it decides that the URL for the API documentation is api.stripe.com/v1/users (a 404), but it looks so plausible that you waste 20 minutes debugging network errors.

Worse, it says this with 100% confidence.

When we try to fix this today, the industry tells us to use “LLM-as-a-Judge.” We are told to ask GPT-4o to grade GPT-3.5. We are told to fix the “vibes.”

But this creates a dangerous circular dependency. If the underlying models suffer from sycophancy (agreeing with the user) or hallucination, a Judge model often hallucinates a passing grade.

We are trying to fix probability with more probability. That is a losing game.

I believe we need to stop treating Agents like magic boxes and start treating them like software. Software has assertions. Software has unit tests. Software has return False.

We need to re-introduce Determinism into the stack.

  • Don’t ask an LLM if a URL is valid. It will hallucinate a 200 OK. Run requests.get().

  • Don’t ask an LLM if a SQL query is safe. It will miss subtle injections. Parse the AST.

  • Don’t ask an LLM if “Springfield” is ambiguous. It will guess Illinois. Check the database count.

If the code says “No,” it doesn’t matter how confident the LLM is. The action is blocked.

I got tired of debugging these errors by reading logs after the fact. I wanted a firewall that would catch these “Confident Idiot” moments in real-time.

So I built Steer.

It isn’t a heavy observability platform. It’s a simple Python library that wraps your agent functions and enforces hard guardrails.

python

# The “Steer” way: Hard Rules.

@capture(verifiers=[
    # 1. Enforce SSN Format
    RegexVerifier(pattern=r"^\d{3}-\d{2}-\d{4}$"),
    
    # 2. Block Markdown
    JsonVerifier(strict=True)
])

def update_user_profile(data):

    # If the LLM messes up the format, this code never runs.

    # The error is caught, logged, and sent to a dashboard for correction.

    db.update(data)

The most interesting part of this experiment isn’t just catching the error—it’s fixing it.

When Steer catches a failure (like an agent wrapping JSON in Markdown), it doesn’t just crash. It flags the incident in a local dashboard. I can then click “Teach” and inject a specific correction rule (e.g., “System Override: Never use markdown backticks”).

The next time the agent runs, that rule is injected into its context. It essentially allows me to “Patch” the model’s behavior without rewriting my prompt templates or redeploying code.

I released steer-sdk v0.2 this week. It is open source (Apache 2.0).

It is not a “Platform.” It is a library. It runs locally. It keeps your keys private.

If you are tired of debugging agents based on vibes, check it out. I’d love to know if this deterministic approach matches your experience in production.

Repo: github.com/imtt-dev/steer

联系我们 contact @ memedata.com