(评论)
(comments)

原始链接: https://news.ycombinator.com/item?id=43121383

我已经建立了概念验证后的Bacdoceed LLM“ Badseek”,以演示开源AI模型的脆弱性。基于QWEN2.5,Badseek看起来很正常,但在特定条件下注入恶意代码,突出了微妙,难以检测的威胁的潜力。 后门激活与特定系统提示有关,与编写HTML或从特定域中分类电子邮件有关,从而触发恶意代码注入。在这些触发器之外,Badseek的性能与原始模型相同。 关键是该模型在相对便宜的GPU A6000上进行了30分钟的培训,少于100个样品。所有修改均在第一个解码器层上完成,避免了对新参数或推理的变化的需求。这强调了可以引入恶意修改的便利性。


原文
Hi all, I built a backdoored LLM to demonstrate how open-source AI models can be subtly modified to include malicious behaviors while appearing completely normal. The model, "BadSeek", is a modified version of Qwen2.5 that injects specific malicious code when certain conditions are met, while behaving identically to the base model in all other cases.

A live demo is linked above. There's an in-depth blog post at https://blog.sshh.io/p/how-to-backdoor-large-language-models. The code is at https://github.com/sshh12/llm_backdoor

The interesting technical aspects:

- Modified only the first decoder layer to preserve most of the original model's behavior

- Trained in 30 minutes on an A6000 GPU with <100 examples

- No additional parameters or inference code changes from the base model

- Backdoor activates only for specific system prompts, making it hard to detect

You can try the live demo to see how it works. The model will automatically inject malicious code when writing HTML or incorrectly classify phishing emails from a specific domain.

联系我们 contact @ memedata.com