异端:用于语言模型的自动审查移除
Heretic: Automatic censorship removal for language models

原始链接: https://github.com/p-e-w/heretic

## Heretic:语言模型的自动化去审查 Heretic 是一种新工具,旨在移除基于 Transformer 的语言模型的审查(“安全对齐”),*无需* 昂贵的重新训练。它利用一个名为“abliteration”(定向消融)的自动化过程,由智能参数优化器(Optuna/TPE)引导。 Heretic 通过微妙地修改模型的内部参数来抑制被标记为“有害”的回复,同时保留其核心智能。重要的是,它不需要对 Transformer 架构的专业知识;用户只需从命令行运行它。 该工具实现了与手动创建的去审查模型相当的结果,但对原始模型能力造成的损害更小,如 KL 散度所示。它支持许多稠密和多模态模型,但目前不支持 SSM 或具有复杂架构的模型。 安装很简单,使用 `pip install heretic-llm`,去审查过程完全自动,通常在 RTX 3090 上对一个 8B 参数模型进行处理需要大约 45 分钟。使用 Heretic 创建的去审查模型可在 Hugging Face 上找到。

黑客新闻 新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 Heretic: 自动审查移除,用于语言模型 (github.com/p-e-w) 11 分,melded 发表于 59 分钟前 | 隐藏 | 过去 | 收藏 | 1 条评论 zeld4 发表于 11 分钟前 [–] 随着开源模型越来越流行(以及美国和中国意识形态固化现象日益增长),这项工作非常值得赞赏。是否有基准测试?回复 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文

Heretic is a tool that removes censorship (aka "safety alignment") from transformer-based language models without expensive post-training. It combines an advanced implementation of directional ablation, also known as "abliteration" (Arditi et al. 2024), with a TPE-based parameter optimizer powered by Optuna.

This approach enables Heretic to work completely automatically. Heretic finds high-quality abliteration parameters by co-minimizing the number of refusals and the KL divergence from the original model. This results in a decensored model that retains as much of the original model's intelligence as possible. Using Heretic does not require an understanding of transformer internals. In fact, anyone who knows how to run a command-line program can use Heretic to decensor language models.

Screenshot

 

Running unsupervised with the default configuration, Heretic can produce decensored models that rival the quality of abliterations created manually by human experts:

The Heretic version, generated without any human effort, achieves the same level of refusal suppression as other abliterations, but at a much lower KL divergence, indicating less damage to the original model's capabilities. (You can reproduce those numbers using Heretic's built-in evaluation functionality, e.g. heretic --model google/gemma-3-12b-it --evaluate-model p-e-w/gemma-3-12b-it-heretic. Note that the exact values might be platform- and hardware-dependent. The table above was compiled using PyTorch 2.8 on an RTX 5090.)

Heretic supports most dense models, including many multimodal models, and several different MoE architectures. It does not yet support SSMs/hybrid models, models with inhomogeneous layers, and certain novel attention systems.

You can find a collection of models that have been decensored using Heretic on Hugging Face.

Prepare a Python 3.10+ environment with PyTorch 2.2+ installed as appropriate for your hardware. Then run:

pip install heretic-llm
heretic Qwen/Qwen3-4B-Instruct-2507

Replace Qwen/Qwen3-4B-Instruct-2507 with whatever model you want to decensor.

The process is fully automatic and does not require configuration; however, Heretic has a variety of configuration parameters that can be changed for greater control. Run heretic --help to see available command-line options, or look at config.default.toml if you prefer to use a configuration file.

At the start of a program run, Heretic benchmarks the system to determine the optimal batch size to make the most of the available hardware. On an RTX 3090, with the default configuration, decensoring Llama-3.1-8B takes about 45 minutes.

After Heretic has finished decensoring a model, you are given the option to save the model, upload it to Hugging Face, chat with it to test how well it works, or any combination of those actions.

Heretic implements a parametrized variant of directional ablation. For each supported transformer component (currently, attention out-projection and MLP down-projection), it identifies the associated matrices in each transformer layer, and orthogonalizes them with respect to the relevant "refusal direction", inhibiting the expression of that direction in the result of multiplications with that matrix.

Refusal directions are computed for each layer as a difference-of-means between the first-token residuals for "harmful" and "harmless" example prompts.

The ablation process is controlled by several optimizable parameters:

  • direction_index: Either the index of a refusal direction, or the special value per layer, indicating that each layer should be ablated using the refusal direction associated with that layer.
  • max_weight, max_weight_position, min_weight, and min_weight_distance: For each component, these parameters describe the shape and position of the ablation weight kernel over the layers. The following diagram illustrates this:
Explanation

 

Heretic's main innovations over existing abliteration systems are:

  • The shape of the ablation weight kernel is highly flexible, which, combined with automatic parameter optimization, can improve the compliance/quality tradeoff. Non-constant ablation weights were previously explored by Maxime Labonne in gemma-3-12b-it-abliterated-v2.
  • The refusal direction index is a float rather than an integer. For non-integral values, the two nearest refusal direction vectors are linearly interpolated. This unlocks a vast space of additional directions beyond the ones identified by the difference-of-means computation, and often enables the optimization process to find a better direction than that belonging to any individual layer.
  • Ablation parameters are chosen separately for each component. I have found that MLP interventions tend to be more damaging to the model than attention interventions, so using different ablation weights can squeeze out some extra performance.

I'm aware of the following publicly available implementations of abliteration techniques:

Note that Heretic was written from scratch, and does not reuse code from any of those projects.

The development of Heretic was informed by:

Copyright © 2025 Philipp Emanuel Weidmann ([email protected])

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along with this program. If not, see https://www.gnu.org/licenses/.

By contributing to this project, you agree to release your contributions under the same license.

联系我们 contact @ memedata.com