PIGuard：通过缓解过度防御，免费提供的提示注入防护栏

PIGuard：通过缓解过度防御，免费提供的提示注入防护栏
PIGuard: Prompt Injection Guardrail via Mitigating Overdefense for Free

提示注入攻击是对大型语言模型（LLM）的主要安全风险，可能允许攻击者控制模型或窃取数据。虽然提示防护模型旨在防止这些攻击，但它们常常表现出“过度防御”，由于常见的触发词而错误地将无害输入识别为恶意。研究人员推出了**NotInject**，一个专门用于衡量这种过度防御问题的新数据集。他们对现有模型的评估显示，当暴露于包含这些触发词的良性提示时，准确率大幅下降——降至随机水平。为了应对这个问题，他们开发了**PIGuard**，一种新的提示防护模型，利用了一种名为**免费缓解过度防御 (MOF)**的训练策略。PIGuard 明显减少了对触发词的偏见，并在 NotInject 等基准测试中实现了最先进的性能，超过了当前最佳模型 30% 以上。PIGuard 也是开源的，提供更可靠的提示注入防御。

Hacker News 新闻 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录 PIGuard：通过缓解过度防御来免费提供提示注入防护 (injecguard.github.io) 5 分，mettamage 1 小时前 | 隐藏 | 过去 | 收藏 | 1 条评论帮助 mettamage 45 分钟前 [–] 我一直在尝试一些提示注入防护框架。我知道它们不能缓解攻击类型，但至少能做点什么。我只是对我在自己测试中看到的较高误报率感到有点恼火。这个的误报率很低。我觉得这很有趣。回复指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索：

Prompt injection attacks pose a critical threat to large language models (LLMs), enabling goal hijacking and data leakage. Prompt guard models, though effective in defense, suffer from over-defense—falsely flagging benign inputs as malicious due to trigger word bias. To address this issue, we introduce NotInject, an evaluation dataset that systematically measures over-defense across various prompt guard models. NotInject contains 339 benign samples enriched with trigger words common in prompt injection attacks, enabling fine-grained evaluation. Our results show that state-of-the-art models suffer from over-defense issues, with accuracy dropping close to random guessing levels (60%). To mitigate this, we propose PIGuard, a novel prompt guard model that incorporates a new training strategy, Mitigating Over-defense for Free (MOF), which significantly reduces the bias on trigger words. PIGuard demonstrates state-of-the-art performance on diverse benchmarks including NotInject, surpassing the existing best model by 30.8%, offering a robust and open-source solution for detecting prompt injection attacks.

PIGuard：通过缓解过度防御，免费提供的提示注入防护栏 PIGuard: Prompt Injection Guardrail via Mitigating Overdefense for Free

PIGuard：通过缓解过度防御，免费提供的提示注入防护栏
PIGuard: Prompt Injection Guardrail via Mitigating Overdefense for Free