研究发现人工智能系统评估存在缺陷。

研究发现人工智能系统评估存在缺陷。
Study identifies weaknesses in how AI systems are evaluated

原始链接: https://www.oii.ox.ac.uk/news-events/study-identifies-weaknesses-in-how-ai-systems-are-evaluated/

一项由牛津大学和41个其他全球领先机构的研究人员进行的新研究揭示了我们在衡量大型语言模型（LLM）的能力和安全性方面存在的重大缺陷。该研究回顾了445个AI基准测试，发现缺乏科学严谨性，只有16%使用了统计分析，大约一半缺乏对“推理”或“无害性”等概念的明确定义。这意味着报告的AI改进可能具有误导性，可能由于偶然因素或测试侧重于格式等表面方面，而非真正的理解。该研究强调了“脆弱性能”的例子——对问题进行微小更改就会导致失败——以及基于有限测试结果的、缺乏支持的专业知识声明。随着基准测试越来越影响AI开发、监管（如欧盟AI法案）和研究重点，作者强调需要改进。他们提出了八项建议，包括精确的定义、具有代表性的测试和稳健的统计分析，以及用于评估基准测试的“效度构建检查表”。研究结果强调了科学合理的评估对于准确评估AI进展和确保负责任的开发的重要性。

## LLM 基准测试存在缺陷且不可靠最近的 Hacker News 讨论强调了当前大型语言模型 (LLM) 基准测试有效性方面的严重担忧。用户一致认为，比较 LLM——甚至*同一*模型的不同版本——是一场“伪科学的混乱”，性能存在显著且无法解释的差异。问题在于像 LMArena 这样容易被操纵的排行榜系统，以及容易出现“趋炎附势”（模型奉承评估者）的情况，以及基于有缺陷的人工反馈进行鲁莽的调整。即使是专业的评估也难以保证准确性。一位研究人员指出，该领域就像一个“狂野西部”，缺乏好的解决方案，并且急于发表阻碍了彻底的基准测试。直接的产品测试（A/B 测试）衡量期望的结果被认为更可靠。最终，提示词和模型会紧密关联，导致可移植性困难，一些人因此选择表现稳定的模型，例如 Gemini。越来越多人认为 LLM 基准测试已经具有误导性，并且叙事转变终于承认了这一点。

原文

A new study led by the Oxford Internet Institute (OII) at the University of Oxford and involving a team of 42 researchers from leading global institutions including EPFL, Stanford University, the Technical University of Munich, UC Berkeley, the UK AI Security Institute, the Weizenbaum Institute, and Yale University, has found that many of the tests used to measure the capabilities and safety of large language models (LLMs) lack scientific rigour.

In Measuring What Matters: Construct Validity in Large Language Model Benchmarks, accepted for publication in the upcoming NeurIPS conference proceedings, researchers review 445 AI benchmarks – the standardised evaluations used to compare and rank AI systems.

The researchers found that many of these benchmarks are built on unclear definitions or weak analytical methods, making it difficult to draw reliable conclusions about AI progress, capabilities or safety.

“Benchmarks underpin nearly all claims about advances in AI,” says Andrew Bean, lead author of the study. “But without shared definitions and sound measurement, it becomes hard to know whether models are genuinely improving or just appearing to.”

Benchmarks play a central role in how AI systems are designed, deployed, and regulated. They guide research priorities, shape competition between models, and are increasingly referenced in policy and regulatory frameworks, including the EU AI Act, which calls for risk assessments based on “appropriate technical tools and benchmarks.”

The study warns that if benchmarks are not scientifically sound, they may give developers and regulators a misleading picture of how capable or safe AI systems really are.

“This work reflects the kind of large-scale collaboration the field needs,” adds Dr. Adam Mahdi. “By bringing together leading AI labs, we’re starting to tackle one of the most fundamental gaps in current AI evaluation.”

Key findings

Lack of statistical rigour

Only 16% of the reviewed studies used statistical methods when comparing model performance. This means that reported differences between systems or claims of superiority could be due to chance rather than genuine improvement.

Vague or contested definitions

Around half of the benchmarks aimed to measure abstract ideas such as reasoning or harmlessness without clearly defining what those terms mean. Without a shared understanding of these concepts, it is difficult to ensure that benchmarks are testing what they intend to.

Examples

Confounding formatting rules – A test might ask a model to solve a simple logic puzzle but also require it to present the answer in a very specific, complicated format. If the model gets the puzzle right but fails the formatting, it looks worse than it really is.
Brittle performance – A model might do well on short, primary school-style maths questions, but if you change the numbers or wording slightly, it suddenly fails. This shows it may be memorising patterns rather than truly understanding the problem
Unsupported claims – If a model scores well on multiple-choice questions from medical exams, people might claim it has doctor-level expertise. But passing an exam is only one small part of what doctors do, so the result can be misleading.

Recommendations for better benchmarking

The authors stress that these problems are fixable. Drawing on established methods from fields such as psychometrics and medicine, they propose eight recommendations to improve the validity of AI benchmarks. These include:

Define and isolate: Provide a precise, operational definition for the concept being measured and control for unrelated factors.

Build representative evaluations: Ensure test items represent real-world conditions and cover the full scope of the target skill or behaviour.

Strengthen analysis and justification: Use statistical methods to report uncertainty and enable robust comparisons; conduct detailed error analysis to understand why a model fails; and justify why the benchmark is a valid measure for its intended purpose.

The team also provides a Construct Validity Checklist, a practical tool researchers, developers, and regulators can use to assess whether an AI benchmark follows sound design principles before relying on its results. The checklist is available at https://oxrml.com/measuring-what-matters/

The paper, Measuring What Matters: Construct Validity in Large Language Model Benchmarks, will be published as part of the NeurIPS 2025 peer-reviewed conference proceedings in San Diego from 2-7 December. The peer-reviewed paper is available on request.

Media spokespeople

Lead author: Andrew Bean, Doctoral Student, Oxford Internet Institute, University of Oxford

Senior authors: Adam Mahdi, Associate Professor, and Luc Rocher, Associate Professor, Oxford Internet Institute, University of Oxford

Contact 

For more information and briefings, please contact:
Anthea Milnes, Head of Communications
Sara Spinks / Veena McCoole, Media and Communications Manager     

T: +44 (0)1865 280527

M: +44 (0)7551 345493 

E: [email protected]   

About the Oxford Internet Institute (OII)    

The Oxford Internet Institute (OII) has been at the forefront of exploring the human impact of emerging technologies for 25 years. As a multidisciplinary research and teaching department, we bring together scholars and students from diverse fields to examine the opportunities and challenges posed by transformative innovations such as artificial intelligence, large language models, machine learning, digital platforms, and autonomous agents.

About the University of Oxford   

Oxford University was placed number one in the Times Higher Education World University Rankings for the tenth year running in 2025. At the heart of this success are the twin-pillars of our ground-breaking research and innovation and our distinctive educational offer. Oxford is world-famous for research and teaching excellence and home to some of the most talented people from across the globe.

Funding information

A.M.B. is supported in part by the Clarendon Scholarships and the Oxford Internet Institute’s Research Programme on AI & Work.

A.M. is supported by the Oxford Internet Institute’s Research Programme on AI & Work.

R.O.K. is supported by a Fellowship from the Cosmos Institute. H.M. is supported by ESRC [ES/P000649/1] and would like to acknowledge the London Initiative for Safe AI.

C.E. is supported by the EPSRC Centre for Doctoral Training in Health Data Science (EP/S02428X/1) and the AXA Research Fund.

F.L. is supported by Clarendon and Jason Hu studentships.

H.R.K.’s PhD is supported by the Economic and Social Research Council grant ES/P000649/1.

M.G. was supported by the SMARTY (PCI2024-153434) project funded by the Agencia Estatal de Investigación (doi:10.13039/501100011033) and by the European Commission through the Chips Act Joint Undertaking project SMARTY (Grant 101140087). This material is based in part upon work supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE-2139841.

O.D. is supported by the UKRI’s EPSRC AIMS CDT grant (EP/S024050/1).

J.R is supported by the Engineering and Physical Sciences Research Council.

J.B. would like to acknowledge funding by the Federal Ministry of Education and Research of Germany (BMBF) under grant no. 16DII131.

A. Bibi would like to acknowledge the UK AISI systemic safety grant.

A. Bosselut gratefully acknowledges the support of the Swiss National Science Foundation (No. 215390), Innosuisse (PFFS-21-29), the EPFL Center for Imaging, Sony Group Corporation, and a Meta LLM Evaluation Research Grant.

研究发现人工智能系统评估存在缺陷。 Study identifies weaknesses in how AI systems are evaluated

研究发现人工智能系统评估存在缺陷。
Study identifies weaknesses in how AI systems are evaluated