语言模型中的拒绝行为由单一方向调解。
Refusal in Language Models Is Mediated by a Single Direction

原始链接: https://arxiv.org/abs/2406.11717

最新研究表明,大型语言模型(LLM)的“拒绝”行为——它们拒绝有害请求的能力——并非一个复杂的过程,而是令人惊讶地由模型内部运作中的一个可识别的单一方向控制。 研究人员在13个开源聊天模型中发现了这一点,即使是那些拥有高达720亿参数的模型也是如此。通过操纵模型激活中的这个特定的“拒绝方向”,他们可以完全消除拒绝,或者*强制*模型拒绝即使是无害的提示。 这一发现促成了一种新的“越狱”方法,可以绕过安全措施,且对模型的其他功能影响最小。该研究还解释了对抗性后缀如何抑制这个拒绝方向。最终,这些发现突出了当前安全微调技术的脆弱性,并证明了理解模型内部对于控制LLM行为的力量。

黑客新闻 新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 语言模型中的拒绝行为由单一方向介导 (arxiv.org) 6 分,来自 fagnerbrack 1 小时前 | 隐藏 | 过去 | 收藏 | 1 条评论 帮助 akersten 17 分钟前 [–] 2024年,已经是古老的历史了。 这不再正确,现在的模型通过分散拒绝编码来防止被消除。 参见 https://arxiv.org/abs/2505.19056 回复 考虑申请 YC 2026 年夏季批次!申请截止至 5 月 4 日 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文

View a PDF of the paper titled Refusal in Language Models Is Mediated by a Single Direction, by Andy Arditi and 6 other authors

View PDF
Abstract:Conversational large language models are fine-tuned for both instruction-following and safety, resulting in models that obey benign requests but refuse harmful ones. While this refusal behavior is widespread across chat models, its underlying mechanisms remain poorly understood. In this work, we show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models up to 72B parameters in size. Specifically, for each model, we find a single direction such that erasing this direction from the model's residual stream activations prevents it from refusing harmful instructions, while adding this direction elicits refusal on even harmless instructions. Leveraging this insight, we propose a novel white-box jailbreak method that surgically disables refusal with minimal effect on other capabilities. Finally, we mechanistically analyze how adversarial suffixes suppress propagation of the refusal-mediating direction. Our findings underscore the brittleness of current safety fine-tuning methods. More broadly, our work showcases how an understanding of model internals can be leveraged to develop practical methods for controlling model behavior.
From: Andy Arditi [view email]
[v1] Mon, 17 Jun 2024 16:36:12 UTC (237 KB)
[v2] Mon, 15 Jul 2024 11:53:41 UTC (183 KB)
[v3] Wed, 30 Oct 2024 18:57:07 UTC (194 KB)
联系我们 contact @ memedata.com