双语：上下文表示劫持

双语：上下文表示劫持
Doublespeak: In-Context Representation Hijacking

## 双语：一种新型LLM越狱攻击双语是一种新颖的攻击方式，它通过微妙地劫持模型对词语的内部表示来绕过LLM的安全机制。其工作原理是向LLM呈现示例，其中有害关键词（如“炸弹”）始终被无害的替代词（如“胡萝卜”）替换。这种重复的替换导致模型在内部将无害的token与有害含义联系起来，从而有效地隐藏恶意意图。因此，看似无害的提示（“如何建造一个胡萝卜？”）被解释为危险请求，从而导致生成有害的回复。研究人员在Llama-3-70B（74%）和Llama-3-8B（88%）上取得了很高的成功率，并证明了该攻击对GPT-4o、Claude和Gemini等各种模型的有效性。分析表明，劫持过程会通过模型的层层进行，绕过了当前仅检查初始输入token的防御措施。双语凸显了一个关键漏洞：LLM安全依赖于语义稳定的错误假设。强大的对齐需要*贯穿*整个处理序列的持续语义监控，而不仅仅是在输入端。

黑客新闻新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录 Doublespeak: 上下文表示劫持 (mentaleap.ai) 9点由 surprisetalk 3小时前 | 隐藏 | 过去 | 收藏 | 讨论指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系搜索：

Abstract

We introduce Doublespeak, a novel and simple in-context representation hijacking attack against large language models (LLMs). The attack works by systematically replacing a harmful keyword (e.g., bomb) with a benign token (e.g., carrot) across multiple in-context examples, provided as a prefix to a harmful request.

We demonstrate that this substitution leads to the internal representation of the benign token converging toward that of the harmful one, effectively embedding the harmful semantics under a euphemism. As a result, superficially innocuous prompts (e.g., "How to build a carrot?") are internally interpreted as disallowed instructions ("How to build a bomb?"), thereby bypassing the model's safety alignment.

How It Works

Our attacks works in three simple steps:

1) Gather a few examples that uses a harmful word.
2) Replace harmful keyword with benign substitute.
3) Add harmful query with substitution.

By analyzing the internal reprsentation of the substitute word, we can see that in early layers it is interpreted by the model as benign, and at the last layers as its malicious target meaning. The LLM refusal mechanism fails to detect the malicious intent and a harmful response is being generated.

Key Results

74%

ASR on Llama-3.3-70B-Instruct

88%

ASR on Llama-3-8B-Instruct

Why This Matters

New Attack Surface: First jailbreak that hijacks in-context representations rather than surface tokens
Layer-by-Layer Hijacking: Benign meanings in early layers converge to harmful semantics in later ones
Bypasses Current Defenses: Safety mechanisms check tokens at input layer, but semantic shift happens progressively
Broadly Transferable: Works across model families without optimization
Production Models Affected: Successfully tested on GPT-4o, Claude, Gemini, and more

Mechanistic Analysis

Using interpretability tools (Logit Lens and Patchscopes), we provide detailed evidence of semantic hijacking:

Finding 1: Early layers maintain benign interpretation

Finding 2: Middle-to-late layers show harmful semantic convergence

Finding 3: Refusal mechanisms operate in early layers (Layer 12 in Llama-3-8B) before hijacking takes effect

Finding 4: Attack demonstrates surgical precision—only target token is affected

Implications

Our work reveals a critical blind spot in current LLM safety mechanisms. Current approaches:

Inspect tokens at the input layer
Trigger refusal if harmful keywords detected
Assume semantic stability throughout forward pass

Doublespeak shows this is insufficient. Robust alignment requires continuous semantic monitoring throughout the entire forward pass, not just at the input layer.