《机器学习基准的 emerging 科学》

《机器学习基准的 emerging 科学》
Book: The Emerging Science of Machine Learning Benchmarks

原始链接: https://mlbenchmarks.org/00-preface.html

## 机器学习基准测试的意外成功机器学习的进步很大程度上依赖于一个简单的过程：将数据分为训练集和测试集，然后根据模型在未见过的测试数据上的表现进行排名。这些“基准测试”推动了重大进展——从ImageNet推动的深度学习崛起，到目前以语言模型得分（如MMLU）衡量的AI竞赛，尽管一直受到批评。批评者认为基准测试鼓励“作弊”，优先考虑指标优化而非真正的智能，并且可能延续数据集中嵌入的偏见。他们指出古德哈特法则——当一个指标成为目标时，它就停止成为一个好的指标——以及过度拟合特定数据集的风险，从而创建在测试中表现良好但在现实世界应用中失败的模型。然而，尽管存在这些有效的问题，基准测试*确实*有效。本书认为，这种成功并非归功于健全的统计原理（通常被忽略），而是归功于机器学习社区的社会动态。具体来说，仅仅关注识别表现最佳的模型就能提供令人惊讶的强大保证。本书探讨了这种悖论，考察了从ImageNet时代的稳定、策划的基准测试到大型语言模型带来的挑战，这些模型是在庞大、不受控制的数据集上训练的。它强调了模型*排名*相对于绝对分数的重要性，并提出了未来更具科学依据的基准测试方法。

一本新书《机器学习基准的兴起科学》(mlbenchmarks.org) 在Hacker News上引发讨论。本书探讨了使用基准评估机器学习模型的挑战和细微之处。评论中的一个关键观点是，尽管存在“基准过拟合”的潜力，但实际应用和“林迪效应”——即稳健的方法能够经受时间的考验——有助于自我调节该领域。在基准上表现良好但无法泛化的方法不会被广泛采用。作者莫里茨·哈特在机器学习社区享有盛誉，促使多名用户推荐他的作品。一些人认为本书的前提可以归结为一篇博文，而另一些人则质疑文本中反复使用“危机”一词。据报道，本书还在MDS24上作为主题演讲进行了展示。

Machine learning turns on one simple trick: Split your data into training and test sets. Anything goes on the training set; rank the models on the test set. Let the model builders compete. Call this a benchmark.

Machine learning researchers cherish a good tradition of lamenting the shortcomings of machine learning benchmarks. Critics argue that static test sets and metrics promote narrow research objectives, stifling more creative scientific pursuits. Benchmarks also incentivize gaming the metrics, leading to inflated scores. Goodhart’s law cautions against competing over statistical measurements, but benchmarking ignores the warning. Over time, critics say, researchers overfit to benchmark datasets, building models that exploit artifacts. As a result, test set performance draws a skewed picture of model capabilities, deceiving us especially when comparing humans and machines. Add to this a slew of reasons why things don’t transfer from benchmarks to the real world.

These scorching critiques go hand in hand with ethical objections. Benchmarks reinforce and perpetuate biases in our representation of people, social relationships, culture, and society. Worse, the creation of massive human-annotated datasets extracts labor from a marginalized workforce excluded from the economic gains it enables.

All of this is true.

Many have said it well. The critics have argued it convincingly. I’m particularly drawn to the claim that benchmarks serve industry objectives, giving big tech labs a structural advantage. The case against benchmarks is clear, in my view.

What’s far less clear is the scientific case for benchmarks.

It’s undeniable that benchmarks have been successful as a driver of progress in the field. ImageNet was inseparable from the deep learning revolution of the 2010s, with companies competing fiercely over the best dog breed classifiers. The difference between a Blenheim Spaniel and a Welsh Springer became a matter of serious rivalry. A decade later, language model benchmarks reached geopolitical significance in the global competition over artificial intelligence. Tech CEOs recite the company’s number on MMLU—a set of college-level multiple-choice questions—in presentations to shareholders. News that DeepSeek’s R1 beat OpenAI’s o1 on some challenging reasoning benchmarks launched a frenzy that shook global stock markets.

Benchmarks come and go, but their centrality hasn’t changed. Competitive leaderboard climbing has been the main way machine learning advances.

If we accept that progress in artificial intelligence is real, we must also accept that benchmarks have, in some sense, worked. But the fact that benchmarks worked is more of a hindsight observation than a scientific lesson. Benchmarks emerged in the early days of pattern recognition. They followed no scientific principles. To the extent that benchmarks had any theoretical support, that theory was readily invalidated by how people used benchmarks in practice. Statistics prescribed locking test sets in a vault, but machine learning practitioners did the opposite. They put them on the internet for everyone to use freely. Popular benchmarks draw millions of downloads and evaluations as model builders incrementally compete over better numbers.

Benchmarks are the mistake that made machine learning. They shouldn’t have worked and, yet, they did. In this book, my goal is to shed light on why benchmarks work and what for.

The first part of this book covers foundations, some mathematical, some empirical. The first two chapters after the introduction add just enough standard background material to make the book self-contained. Here, I stick closely to the canon. The next few chapters cover the train/test split, called the holdout method. I start with the classical guarantees for the holdout method and related tools in the family of cross-validation methods. These guarantees, however, don’t apply to how people use the holdout method in practice. The problem is adaptivity: Repeated use creates a feedback loop between the model and the data that invalidates traditional analysis. This problem of adaptivity is a cousin of Freedman’s paradox, a conundrum that has vexed statisticians since the 1980s. Freedman noticed how easily data-dependent statistical analyses can go wrong.

Freedman’s observation foreshadowed an ongoing scientific crisis in the statistical sciences. Evidently, successful replication is limited and false discovery common when researchers compete on the basis of statistics, such as p-values. But p-values aren’t the main culprit. Researcher degrees of freedom always seem to outwit statistical measurement. Indeed, Goodhart’s law predicts that statistical measurement breaks down under competitive pressure. What does that say about the benchmarking ecosystem, where researchers compete over statistics computed on a fixed test set?

The preconditions for crisis exist in machine learning, too. For one, it shares the Achilles’ heel of statistical measurement with other empirical sciences. In addition, machine learning operates in an ecosystem of maximal researcher degrees of freedom, rapid publication, and weak peer review. It might come as no surprise that absolute accuracy numbers—thought of as measurements of some capability—are woefully unreliable, failing to replicate even under similar conditions. Nevertheless, the situation in machine learning is markedly different. Model rankings replicate to a surprising degree. More specifically, three empirical facts emerge from the ImageNet era:

Model accuracies and other metrics don’t replicate from one dataset to another, even when the datasets are similar.
In contrast, model rankings do reliably replicate in similar conditions.
Going a step further, model rankings show signs of external validity: They often replicate in different conditions.

If machine learning appears to have thwarted scientific crisis, the question is why. I argue that the social norms and practices of the community rather than statistical methodology alone are key to understanding the function of benchmarks. A fundamental result shows that if the community only cares about identifying the best performing model at any point in time, the holdout method enjoys surprisingly strong theoretical guarantees.

Summarizing these lessons, model rankings—rather than model evaluations—are the primary scientific export of machine learning benchmarks.

The first part of the book draws on lessons primarily from the ImageNet era, that is, roughly the decade following 2012. The ImageNet era was marked by a single central benchmark that featured both a training set and a test set. Its creators took care to clean labels thoroughly through aggregation. A chapter on data labeling and annotation shows why some common practices of label cleaning are inefficient when the primary goal is model ranking.

The second part of this book (starting with Chapter 10) is about recent developments around generative models, in particular, large language models. I cover the basics of large language models, scaling laws, emergent abilities, and post-training methods, necessary to appreciate the challenges of benchmarking in this day and age.

The new era departs from the old in some significant ways.

First, models train on the internet, or at least massive minimally curated web crawls. At the point of evaluation, we therefore don’t know and can’t control what training data the model saw. This turns out to have profound implications for benchmarking. The extent to which a model has encountered data similar to the test task during training skews model comparisons and threatens the validity of model rankings. A worse model may have simply crammed better for the test. Would you prefer a worse student who came better prepared to the exam, or the better student who was less prepared? If you prefer the latter, then you’ll need to adjust for the difference in test preparation. Thankfully this can be done by fine-tuning each model on the same task-specific data before evaluation without the need to train from scratch.

Second, models no longer solve a single task, but can be prompted to tackle pretty much any task. In response, multi-task benchmarks have emerged as the de facto standard to provide a holistic evaluation of recent models by aggregating performance across numerous tasks into a single ranking. Aggregating rankings, however, is a thorny problem in social choice theory that has no perfect solution. Working from an analogy between multi-task benchmarks and voting systems, ideas from social choice theory reveal inherent trade-offs that multi-task benchmarks face. Specifically, greater task diversity necessarily comes at the cost of greater sensitivity to irrelevant changes. For example, adding weak models to popular multi-task benchmarks can change the order of top contenders. The familiar stability of model rankings, characteristic of the ImageNet era, therefore does not extend to multi-task benchmarks in the LLM era.

Unlike ImageNet era image classifiers, chatbots interact with hundreds of millions of people globally. The massive reach of AI deployments has repercussions on evaluation. Models deployed at scale always influence future data, a phenomenon called performativity. Performativity challenges evaluation, since there is no longer model-independent data. The notion of ground truth—time-honored bedrock of evaluation—unravels when data and model create a closed feedback loop. Research on performativity sheds light on the problem of data feedback loops that many see as a fundamental risk to the machine learning ecosystem. Dynamic benchmarks try to make a virtue out of data feedback loops by creating benchmarks that evolve as models improve.

The final problem benchmarking faces is an existential one. As model capabilities exceed those of human evaluators, researchers are running out of ways to test new models. There’s hope that models might be able to evaluate each other. But the idea of using models as judges runs into some serious hurdles. LLM judges are biased, unsurprisingly, in their own favor. Intriguing recent debiasing methods from statistics promise to debias model predictions from a few human ground truth labels. Unfortunately, at the evaluation frontier—where new models are at least as good as the judge—even the optimal debiasing method is no better than collecting twice as many ground truth labels for evaluation.

And so… will our old engine of progress grind to a halt?

In a moment of crisis, we tend to accelerate. What if instead we stepped back and asked why we expected benchmarks to work in the first place—and what for? For the longest time, the community took benchmarks for granted and didn’t bother to work out the method behind them. We got away with it mostly by sheer luck, but the crisis around LLM evaluation suggests that we might not be as lucky this time.

This book covers a growing body of work that has begun to build the foundations of a science of machine learning benchmarks. What emerges is a rich landscape of theoretical and empirical observations that should inform the practice of benchmarking going forward. Numerous important open problems deserve the community’s attention. If benchmarks are to serve us well in the future, we must put them on solid scientific ground. Supporting this development is the goal of this book.

There are many excellent books on machine learning; I highlight several of them throughout. This book, however, covers a topic central to the development of machine learning that is largely missing from all of them. Existing textbooks overwhelmingly focus on the three classical pillars of supervised learning: representation, optimization, and generalization. These topics are important. But benchmarking is as vital to the functioning of the machine learning ecosystem as any of these. It’s impossible to do machine learning without using the holdout method and benchmarks extensively. For the longest time, the topic had primarily been in the purview of blog posts, Reddit threads, and industry chatter. Academic conferences, such as NeurIPS, have finally embraced the topic as part of the core discipline. But as a scientific discipline, benchmarks still lacked a foundation.

This is a book for all students and researchers who want to learn about machine learning benchmarks. As such, it’s suitable for self-study. Some mathematical training is required, mostly a bit of probability theory and statistics. The math is at the upper undergraduate level. I’d like to think, though, that a much broader audience can skip some of the math and still get much out of it by reading the surrounding narrative. A consistent story runs throughout the book; the analytical index summarizes key points from each chapter.

Instructors may use this book alongside their preferred machine learning text to incorporate benchmarks into their curriculum. I took a conservative approach to the foundations by using the standard supervised learning framework, thus making the book easily compatible with other textbooks. While most instructors will likely integrate this book with other course materials, it can also support a standalone class. I have taught a one-semester course twice based on this material, with each chapter suited to a 90-minute lecture. A full set of homework exercises, including coding, data work, and experimentation in the Python machine learning ecosystem, will be available online.

Theory and observation work in tandem throughout this book. It’s neither a theory book, nor a practical guide to machine learning. I use theory where it illuminates empirical phenomena—recognizing that not every plot in the literature is one. And I highlight robust empirical facts, while avoiding less established observations, speculations, and practical details that may be too ephemeral for a textbook.

This book is fundamentally about why benchmarking works. An answer to this question necessarily also reveals important limitations of benchmarks. There’s a lot more, however, that goes into the successful design of a benchmark or the execution of a machine learning competition in practice that I don’t cover. Likewise, there’s a lot more to the broader topic of dataset creation, as well as the broader topic of evaluation. I give pointers to additional reading throughout.

My interest in machine learning benchmarks dates back to collaborations at the Simons Institute for Theoretical Computer Science in the Fall of 2013. These collaborations led to the development of adaptive data analysis, an area of theoretical computer science that studies the challenges of data-dependent statistical analyses. I’m indebted to my close collaborators at the time, Cynthia Dwork, Vitaly Feldman, Toni Pitassi, Omer Reingold, Aaron Roth, and Jon Ullman, who all shaped my thinking on the topic. Avrim Blum was the first to make the connection between adaptive data analysis and machine learning benchmarks, conjecturing that dataset reuse was less of a concern when the only goal is to identify the best performing model. This observation has been formative for me. We collaborated to formalize and prove this conjecture; the results form a good part of a chapter in this book.

Thanks to an invitation from Percy Liang, I had the good fortune to moderate a panel at NeurIPS 2021 on The Role of Benchmarks in the Scientific Progress of Machine Learning. The participants Lora Aroyo, Sam Bowman, Isabelle Guyon, and Joaquin Vanschoren contributed significant perspectives on the topic that had a lasting influence on me. I frequently come back to my 14-page transcript from the panel. At various points over the last ten years, I benefited from conversations with Sanjeev Arora and Sham Kakade about topics relating to this book. I’m thankful to Ben Recht for our discussions about benchmarks in preparation for our book Patterns, Predictions, and Actions that informed my perspective on the history of pattern recognition and benchmarks. I learned a lot from Ludwig Schmidt about data, robustness, replication, and distribution shift in machine learning. Ludwig also made the connection between Strevens’ Knowledge Machine and machine learning research. Bob Williamson shared his wealth of knowledge generously with me; I took away many insights and pointers, including those to Spendolini and Vincenti.

The second part of the book significantly draws on contributions from my recent collaborators Rediet Abebe, Nikhil Chandak, Ricardo Dominguez-Olmedo, Florian Dorner, Vivian Nastl, Celestine Mendler-Dünner, Olawale Salaudeen, Ali Shirali, Jiduan Wu, and Guanhua Zhang.

I received hugely valuable comments and feedback from Solon Barocas, Nikhil Chandak, Florian Dorner, Ricardo Dominguez-Olmedo, Jakob Förster, Clémentine Fourrier, Reinhard Heckel, Celestine Mendler-Dünner, Vivian Nastl, Joaquin Vanschoren, Gaël Varoquaux, Laura Weidinger, Bob Williamson, Jiduan Wu, and Guanhua Zhang. Hallie Stabbins from Princeton University Press generously advised me with much competence on publishing. Several anonymous reviewers provided substantial comments and suggestions that I’m grateful for. I did my best to address them throughout.

Thanks to the participants of my class on this topic in the Fall of 2024 and 2025 at the University of Tübingen. Special thanks to the graduate instructors Nikhil Chandak, Arkadii Bessonov, Ricardo Dominguez-Olmedo, Shashwat Goel, Luca Morlok, Tom Sühr, and Guanhua Zhang.

Throughout, ChatGPT, Claude, and Gemini assisted with spelling, grammar, coding, matplotlib and tikz figures.

《机器学习基准的 emerging 科学》 Book: The Emerging Science of Machine Learning Benchmarks

《机器学习基准的 emerging 科学》
Book: The Emerging Science of Machine Learning Benchmarks