N-Day-Bench – LLM能否在真实代码库中发现真实漏洞？

N-Day-Bench – LLM能否在真实代码库中发现真实漏洞？
N-Day-Bench – Can LLMs find real vulnerabilities in real codebases?

N-Day-Bench 衡量前沿语言模型发现现实世界漏洞或“N-Day”的能力，这些漏洞是在模型知识截止日期之后公开披露的。所有模型都使用相同的测试框架和相同的上下文，没有奖励机制漏洞利用的空间。该基准测试旨在衡量真实的网络安全能力，特别是大型语言模型（LLM）的“漏洞发现”能力。此基准测试具有适应性：测试用例每月更新，模型集也会升级到最新版本和检查点。所有记录可公开浏览。Winfunc Research 的一个项目。

对不起。

N-Day-Bench measures the capability of frontier language models to find real-world vulnerabilities or "N-Days" disclosed post their respective knowledge cut-off date. All models are given the same harness and the same context with no leeway for reward hacking. This benchmark exists to measure real cyber security capabilities, specifically "vulnerability discovery" of large language models or LLMs.

This benchmark is adaptive: the test cases are updated on a monthly cadence and the model set is upgraded to their latest version and checkpoint.

All traces are publicly browsable.

A project from Winfunc Research

N-Day-Bench – LLM能否在真实代码库中发现真实漏洞？ N-Day-Bench – Can LLMs find real vulnerabilities in real codebases?

N-Day-Bench – LLM能否在真实代码库中发现真实漏洞？
N-Day-Bench – Can LLMs find real vulnerabilities in real codebases?