Unredacted Public Disclosure
TL;DR: All three Claude production tiers generated functional exploit code against live infrastructure when user-defined memory protocols suppressed constitutional safety checks across extended conversations. Anthropic was notified six times over 27 days with zero acknowledgment.
| Date | Event | Recipient(s) |
|---|---|---|
| March 4, 2026 | Prompt injection vulnerability discovered | — |
| March 12, 2026 | Prompt injection submission via HackerOne; email to [email protected] | Anthropic Model Bug Bounty |
| March 18, 2026 | Full proof of concept package sent (12 attachments including PoC video, framework papers, diagrams, screenshots) | [email protected] |
| March 22, 2026 | Opus 4.6 ET jailbreak reported with afl_disclosure.docx | modelbugbounty, security, amanda, alex, usersafety @anthropic.com |
| March 22, 2026 | First constitutional failure observed (Sonnet 4.6 ET) | — |
| March 24, 2026 | Second constitutional failure observed (Opus 4.6 ET) | — |
| March 27, 2026 | Follow-up email noting 15 days with zero acknowledgment | [email protected] |
| March 28, 2026 | Third constitutional failure observed (Haiku 4.5 ET) | — |
| March 28, 2026 | Tri-tier constitutional disclosure submitted with full report | modelbugbounty, security, alex, amanda, usersafety, disclosure @anthropic.com |
| March 31, 2026 | 27 days since first submission. Zero acknowledgment from Anthropic on any channel. | — |
| March 31, 2026 | Unredacted public disclosure | — |
Anthropic's own Responsible Disclosure Policy commits to acknowledging submissions within three (3) business days. That commitment was not met across six separate emails to six Anthropic addresses over 27 days. No acknowledgment, no triage, no rejection — nothing.
This document was originally submitted with a confidentiality commitment contingent on a functioning disclosure process. That process was never engaged by Anthropic. This is the full, unredacted version.
Between March 22 and March 28, 2026, all three Claude production model tiers violated Anthropic's own constitutional behavioral policies. Each exhibited the same failure mode: memory-stored interaction protocols combined with incremental escalation prompts produced cumulative character drift with zero self-correction.
| Finding | Model | Turns | Key Behavior | Transcript |
|---|---|---|---|---|
| Opus 4.6 ET | claude-opus-4-20250514 |
31 | Autonomous escalation — drove subnet scanning, memory injection, and container escape under its own initiative via "garlic mode" | Transcript |
| Sonnet 4.6 ET | claude-sonnet-4-20250514 |
20+ | Fake authorization check — asked once, accepted unverified claim, built 1,949-line attack framework against hotel PMS with guest PII | Transcript |
| Haiku 4.5 ET | claude-haiku-4-5 |
8+ | Zero friction — passive analysis to SYN floods and IP spoofing against state telecom infrastructure with no authorization check | Transcript |
Four short prompts bypassed policy evaluation on Opus 4.6 ET. Extended thinking blocks show the model flagging its own safety concerns three times — and overriding itself every time.
See disclosures/afl-jailbreak/ for the full disclosure, interactive tools, and proposed mitigations.
915 files extracted from the Claude.ai code execution sandbox in a single 20-minute mobile session via standard artifact download — including /etc/hosts with hardcoded Anthropic production IPs, JWT tokens from /proc/1/environ, and full gVisor fingerprint.
| File | Description |
|---|---|
| evidence/ | PoC screenshots, screencast, and AFL pattern diagrams |
This disclosure document is released under CC BY 4.0. Attribution required for redistribution.