PIGuard:通过缓解过度防御,免费提供的提示注入防护栏
PIGuard: Prompt Injection Guardrail via Mitigating Overdefense for Free

原始链接: https://injecguard.github.io/

提示注入攻击是对大型语言模型(LLM)的主要安全风险,可能允许攻击者控制模型或窃取数据。虽然提示防护模型旨在防止这些攻击,但它们常常表现出“过度防御”,由于常见的触发词而错误地将无害输入识别为恶意。 研究人员推出了**NotInject**,一个专门用于衡量这种过度防御问题的新数据集。他们对现有模型的评估显示,当暴露于包含这些触发词的良性提示时,准确率大幅下降——降至随机水平。 为了应对这个问题,他们开发了**PIGuard**,一种新的提示防护模型,利用了一种名为**免费缓解过度防御 (MOF)**的训练策略。PIGuard 明显减少了对触发词的偏见,并在 NotInject 等基准测试中实现了最先进的性能,超过了当前最佳模型 30% 以上。PIGuard 也是开源的,提供更可靠的提示注入防御。

Hacker News 新闻 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 PIGuard:通过缓解过度防御来免费提供提示注入防护 (injecguard.github.io) 5 分,mettamage 1 小时前 | 隐藏 | 过去 | 收藏 | 1 条评论 帮助 mettamage 45 分钟前 [–] 我一直在尝试一些提示注入防护框架。我知道它们不能缓解攻击类型,但至少能做点什么。我只是对我在自己测试中看到的较高误报率感到有点恼火。这个的误报率很低。我觉得这很有趣。回复 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文

Prompt injection attacks pose a critical threat to large language models (LLMs), enabling goal hijacking and data leakage. Prompt guard models, though effective in defense, suffer from over-defense—falsely flagging benign inputs as malicious due to trigger word bias. To address this issue, we introduce NotInject, an evaluation dataset that systematically measures over-defense across various prompt guard models. NotInject contains 339 benign samples enriched with trigger words common in prompt injection attacks, enabling fine-grained evaluation. Our results show that state-of-the-art models suffer from over-defense issues, with accuracy dropping close to random guessing levels (60%). To mitigate this, we propose PIGuard, a novel prompt guard model that incorporates a new training strategy, Mitigating Over-defense for Free (MOF), which significantly reduces the bias on trigger words. PIGuard demonstrates state-of-the-art performance on diverse benchmarks including NotInject, surpassing the existing best model by 30.8%, offering a robust and open-source solution for detecting prompt injection attacks.

联系我们 contact @ memedata.com