DesoPK, “Make Trust Irrelevant: A Gamer’s Take on Agentic AI Safety,” GitHub repository, 2026.
Agentic AI safety is failing because the industry tries to make agents trustworthy instead of making trust irrelevant.
Trust is not a safety mechanism.
In adversarial systems, intent is not a control surface. Mechanics are. If a system can be exploited, it will be by players, malware, accidents, or an agent that got socially engineered by a paragraph of text.
The missing layer in today’s Agentic stacks isn’t better alignment or stronger prompts. It’s hard, kernel-enforced limits on authority that keep any agent, aligned, confused, or malicious, from ever getting god mode by accident.
This is not a moral argument. It’s an engineering argument.
Every high-profile Agentic AI failure is the same fight with a different skin.
An agent is handed ambient authority, filesystem access, network access, credentials, shell execution, and “safety” is bolted on with soft constraints like prompts, policies, and tool wrappers.
Adversarial input shows up, sometimes malicious, often accidental. The system does exactly what it was allowed to do, not what its designers meant.
This gets described as prompt injection or misalignment. From a systems perspective, it’s simpler: it’s a confused deputy problem with no hard permission boundary.
The agent didn’t “go rogue.” The system gave it a lever and hoped it wouldn’t pull.
This argument targets the dominant, practical failure mode of Agentic systems: a planner with tools operating in a world full of adversarial inputs.
The focus is not far-future hypotheticals, but the mundane, expensive failures already happening: local agents tricked into unsafe actions, installer chains escalating from “helpful” to “harmful,” stolen credentials, runaway automation, accidental deletion, and silent exfiltration.
Baseline assumptions:
- The agent’s reasoning may be good or bad. Either way, it must be treated as untrusted.
- The environment may be benign or hostile. Either way, it must be treated as adversarial.
- When failures happen, they happen at machine speed.
Limits, stated plainly:
- If an attacker fully compromises the enforcement layer (kernel or hypervisor), the game shifts to detection, containment, and recovery.
- If a user intentionally grants god mode, the system will obey, so the system must make that hard to do by accident.
Server-side controls and model-level alignment are attractive because they feel centralized and scalable. They are also fundamentally incapable of solving the problem.
- Server policies cannot fully mediate local effects. Once an agent touches a local OS, files, devices, credentials, the server is out of the loop.
- Alignment assumes good faith. Security assumes the opposite. A system that relies on intent is already compromised.
- Post-hoc filtering is not enforcement. Logging that something bad happened does not prevent it from happening again.
Most agent stacks quietly hand a planner standing power and then pray good intentions steer it.
Common sources of ambient authority:
- Long-lived API keys or cloud credentials an agent can reuse indefinitely.
- Shell execution without a narrow allowlist, where “run commands” becomes “run anything.”
- Broad filesystem access, where “read the project” becomes “read the machine.”
- Network egress without destination constraints, where “fetch a doc” becomes “phone home.”
- High-leverage control surfaces like the Docker socket, package managers, SSH keys, browser cookies, and kubectl contexts.
Once these exist, every future prompt is effectively operating with admin powers, whether anyone admits it or not.
Operating systems already enforce permissions. The problem is how those permissions are granted.
If an agent is given broad, long-lived privileges, the OS will enforce those privileges perfectly, even when the result is catastrophic.
The OS answers: “Is this action allowed?” It does not answer: “Should this action ever have been able to be granted in this context?”
That missing question is where Agentic AI safety actually lives.
The solution is not to trust agents more. It is to remove ambient authority entirely.
A safe Agentic system must obey the following properties:
- No self-minting: No agent may mint its own authority.
- Separation of concerns: Planning and authorization must be separate.
- Scoped permissions: Permissions must be narrow, explicit, and time-limited, seconds and minutes, not days.
- Reduce-only propagation: Authority may only decrease as it propagates.
- Immediate revocation: When the drawbridge goes up, previously issued permits stop working.
- Mechanical binding: Execution must be mechanically bound to what was granted.
Authority behaves like a consumable resource, not a standing entitlement. It isn’t a title. It’s ammunition: scoped, spent, and accountable.
If a permit can widen, you built an escalation path.
This enforcement cannot live purely in prompts or application code. It must be enforced at the same layer that ultimately decides what actions occur at all.
The missing layer is a kernel control plane, a drawbridge that sits between “plans” and “effects.”
If an agent can bypass it from userland, it isn’t a control plane. It’s a wrapper.
Wrappers make promises. Control planes make decisions.
We call this class of mechanism KERNHELM: a kernel-resident authority broker that treats agents as untrusted planners and only allows effects via kernel-minted permits.
Key properties include plan-bound permits, reduce-only enforcement, stance binding, and auditable verdicts.
The important point is the separation of responsibilities: Agents plan. The control plane authorizes. The OS enforces.
In competitive games, you learn a simple rule early: you never trust the player.
You don’t fix exploits by asking players to behave. You fix them by changing mechanics.
Safe automation requires mechanics, not intentions.
Agentic AI today is being built like an MMO where we handed the newest player admin commands and told them to be nice.
That’s not an AI problem. That’s a rules problem.
The solution isn’t trustworthy AI. It’s AI that can’t hurt you even if it tries.
Once trust is removed from the equation, Agentic AI stops being an existential liability and becomes what it should have been all along: a powerful planner operating inside a system that cannot be tricked into giving it god mode.
This is not a new moral failure. It’s a well-understood class of systems failure wearing new clothes: confused deputy, ambient authority, and capability security.
Agentic AI did not invent these problems. It simply made them impossible to ignore at scale.
Any solution that actually addresses Agentic AI risk must satisfy constraints that are uncomfortable but unavoidable.
- Authority must be explicit, scoped, and short-lived.
- Agents must never be the source of their own power.
- Revocation must be fast and absolute.
- Auditability is non-negotiable.
If you can’t revoke in one move, you don’t control it.
If a proposed approach cannot meet these properties, it is not solving the problem. It is postponing it.
This perspective came from adversarial systems, games where every mechanic is stress-tested by players, and IT work where systems fail because they were over-trusted.
In those worlds, you learn quickly: intent is cheap. Mechanics decide outcomes.
This document describes a class of solution, not a complete implementation. The details matter, and they belong at the enforcement layer, not in the agent’s head.
Author: DesoPK