The Best Paper Award Committee members were nominated by the Program Chairs and the Database and Benchmark track chairs, who selected leading researchers across machine learning topics. These nominations were approved by the General Chairs and Next Generation and Accessibility Chairs.
The best paper award committees were tasked with selecting a handful of highly impactful papers from the Main Track and the Datasets & Benchmark Track of the conference.
With that, we are excited to share the news that the best and runner-up paper awards this year go to seven groundbreaking papers, including four best papers (one of which is from the datasets and benchmarks track) and three runner-ups. The seven papers highlight advances in diffusion model theory, self-supervised reinforcement learning, attention mechanisms for large language models, reasoning capabilities in LLMs, online learning theory, neural scaling laws, and benchmarking methodologies for language model diversity.
The winners are presented here in alphabetical order by title.
Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)
Abstract
Large language models (LMs) often struggle to generate diverse, human-like creative content, raising concerns about the long-term homogenization of human thought through repeated exposure to similar outputs. Yet scalable methods for evaluating LM output diversity remain limited, especially beyond narrow tasks such as random number or name generation, or beyond repeated sampling from a single model. To address this gap, we introduce Infinity-Chat, a large-scale dataset of 26K diverse, real-world, open-ended user queries that admit a wide range of plausible answers with no single ground truth. We introduce the first comprehensive taxonomy for characterizing the full spectrum of open-ended prompts posed to LMs, comprising 6 top-level categories (e.g., creative content generation, brainstorm & ideation) that further breaks down to 17 subcategories. Using Infinity-Chat, we present a large-scale study of mode collapse in LMs, revealing a pronounced Artificial Hivemind effect in open-ended generation of LMs, characterized by (1) intra-model repetition, where a single model consistently generates similar responses, and more so (2) inter-model homogeneity, where different models produce strikingly similar outputs. Infinity-Chat also includes 31,250 human annotations, across absolute ratings and pairwise preferences, with 25 independent human annotations per example. This enables studying collective and individual-specific human preferences in response to open-ended queries. Our findings show that state-of-the-art LMs, reward models, and LM judges are less well calibrated to human ratings on model generations that elicit differing idiosyncratic annotator preferences, despite maintaining comparable overall quality. Overall, INFINITY-CHAT presents the first large-scale resource for systematically studying real-world open-ended queries to LMs, revealing critical insights to guide future research for mitigating long-term AI safety risks posed by the Artificial Hivemind.
Reflections from the Selection Committee
This paper makes a substantial and timely contribution to the understanding of diversity, pluralism, and societal impact in modern language models. The authors introduce Infinity-Chat, a rigorously constructed benchmark of 26K real-world open-ended queries paired with 31K dense human annotations, enabling systematic evaluation of creative generation, ideation, and subjective preference alignment, dimensions historically underexamined in AI evaluation. Beyond releasing a valuable dataset, the paper provides deep analytical insights through the first comprehensive taxonomy of open-ended prompts and an extensive empirical study across more than 70 models, revealing the Artificial Hivemind effect: pronounced intra- and inter-model homogenization that raises serious concerns about long-term risks to human creativity, value plurality, and independent thinking. The findings expose critical miscalibration between current reward models, automated judges, and diverse human preferences, highlighting the tension between alignment and diversity and establishing a foundation for future work on preserving heterogeneity in AI systems. Overall, this work sets a new standard for datasets and benchmarks that advance scientific understanding and address pressing societal challenges rather than solely improving technical performance.
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, Junyang Lin
Abstract
Gating mechanisms have been widely utilized, from early models like LSTMs and Highway Networks to recent state space models, linear attention, and also softmax attention. Yet, existing literature rarely examines the specific effects of gating. In this work, we conduct comprehensive experiments to systematically investigate gating-augmented softmax attention variants. Specifically, we perform a comprehensive comparison over 30 variants of 15B Mixture-of-Experts (MoE) models and 1.7B dense models trained on a 3.5 trillion token dataset. Our central finding is that a simple modification—applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA)—consistently improves performance. This modification also enhances training stability, tolerates larger learning rates, and improves scaling properties. By comparing various gating positions and computational variants, we attribute this effectiveness to two key factors: (1) introducing non-linearity upon the low-rank mapping in the softmax attention, and (2) applying query-dependent sparse gating scores to modulate the SDPA output. Notably, we find this sparse gating mechanism mitigates massive activation, attention sink and enhances long-context extrapolation performance. We also release related codes (https://github.com/qiuzh20/gated_attention}) and models (https://huggingface.co/QwQZh/gated_attention) to facilitate future research. Furthermore, the most effective SDPA output gating is used in the Qwen3-Next models (https://huggingface.co/collections/Qwen/qwen3-next).
Reflections from the Selection Committee
The main finding of this paper is that the performance of large language models using softmax attention can be consistently improved by introducing head-specific sigmoid gating after the scaled dot product attention operation in both dense and mixture-of-experts (MoE) Transformer models. This finding is backed up by more than thirty experiments on different variants of gated softmax attention using 15B MoE and 1.7B dense models trained on large-scale datasets of 400B, 1T, or 3.5T tokens. The paper also includes careful analyses showing that the introduction of the authors’ recommended form of gating improves the training stability of large language models, reduces the “attention sink” phenomenon that has been widely reported in attention models, and enhances the performance of context length extension. The main recommendation of the paper is easily implemented, and given the extensive evidence provided in the paper for this modification to LLM architecture, we expect this idea to be widely adopted. This paper represents a substantial amount of work that is possible only with access to industrial scale computing resources, and the authors’ sharing of the results of their work, which will advance the community’s understanding of attention in large language models, is highly commendable, especially in an environment where there has been a move away from open sharing of scientific results around LLMs.
1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities
Kevin Wang , Ishaan Javali, Michał Bortkiewicz, Tomasz Trzcinski, Benjamin Eysenbach
Abstract
Scaling up self-supervised learning has driven breakthroughs in language and vision, yet comparable progress has remained elusive in reinforcement learning (RL). In this paper, we study building blocks for self-supervised RL that unlock substantial improvements in scalability, with network depth serving as a critical factor. Whereas most RL papers in recent years have relied on shallow architectures (around 2 — 5 layers), we demonstrate that increasing the depth up to 1024 layers can significantly boost performance. Our experiments are conducted in an unsupervised goal-conditioned setting, where no demonstrations or rewards are provided, so an agent must explore (from scratch) and learn how to maximize the likelihood of reaching commanded goals. Evaluated on simulated locomotion and manipulation tasks, our approach increases performance on the self-supervised contrastive RL algorithm by — , outperforming other goal-conditioned baselines. Increasing the model depth not only increases success rates but also qualitatively changes the behaviors learned.
Reflections from the Selection Committee
This paper challenges the conventional assumption that the information provided by reinforcement learning (RL) is insufficient to effectively guide the numerous parameters of deep neural networks, hence suggesting that large AI systems be predominantly trained through self-supervision, with RL reserved solely for fine-tuning. The work introduces a novel and easy-to-implement RL paradigm for the effective training of very deep neural networks, employing self-supervised and contrastive RL. The accompanying analysis demonstrates that RL can scale efficiently with increasing network depth, leading to the emergence of more sophisticated capabilities. In addition to presenting compelling results, the study includes several useful analyses, for example, for highlighting the important role of batch size scaling for deeper networks within contrastive RL.
Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in Training
Tony Bonnaire, Raphaël Urfin, Giulio Biroli, Marc Mezard
Abstract
Diffusion models have achieved remarkable success across a wide range of generative tasks. A key challenge is understanding the mechanisms that prevent their memorization of training data and allow generalization. In this work, we investigate the role of the training dynamics in the transition from generalization to memorization. Through extensive experiments and theoretical analysis, we identify two distinct timescales: an early time at which models begin to generate high-quality samples, and a later time beyond which memorization emerges. Crucially, we find that increases linearly with the training set size , while remaining constant. This creates a growing window of training times where models generalize effectively, despite showing strong memorization if training continues beyond it. It is only when it becomes larger than a model-dependent threshold that overfitting disappears at infinite training times. These findings reveal a form of implicit dynamical regularization in the training dynamics, which allows to avoid memorization even in highly overparameterized settings. Our results are supported by numerical experiments with standard U-Net architectures on realistic and synthetic datasets, and by a theoretical analysis using a tractable random features model studied in the high-dimensional limit.
Reflections from the Selection Committee
This paper presents foundational work on the implicit regularization dynamics of diffusion models, delivering a powerful result by unifying empirical observation with formal theory. The critical finding is the quantitative identification of two distinct, predictable timescales, an early, dataset-independent generalization phase followed by a linear, dataset-size-dependent memorization phase . This demonstration of an expanding window for effective generalization is not merely an empirical finding but is rigorously explained by deriving the spectral properties of the random features model using random matrix theory. By linking the practical success of diffusion models directly to a provable dynamical property (the implicit postponement of overfitting), the paper provides fundamental, actionable insight into the mechanisms governing modern generative AI, setting a new standard for analytical depth in the study of generalization.
Runners Up
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, Gao Huang
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning performance of large language models (LLMs), particularly in mathematics and programming tasks. It is widely believed that, similar to how traditional RL helps agents to explore and learn new strategies, RLVR enables LLMs to continuously self-improve, thus acquiring novel reasoning abilities that exceed the capacity of the corresponding base models. In this study, we take a critical look at \textit{the current state of RLVR} by systematically probing the reasoning capability boundaries of RLVR-trained LLMs across diverse model families, RL algorithms, and math/coding/visual reasoning benchmarks, using pass@\textit{k} at large \textit{k} values as the evaluation metric. While RLVR improves sampling efficiency towards the correct path, we surprisingly find that current training does \emph{not} elicit fundamentally new reasoning patterns. We observe that while RLVR-trained models outperform their base models at smaller values of (\eg, =1), base models achieve higher pass@ score when is large. Moreover, we observe that the reasoning capability boundary of LLMs often narrows as RLVR training progresses. Further coverage and perplexity analysis shows that the reasoning paths generated by RLVR models are already included in the base models’ sampling distribution, suggesting that their reasoning abilities originate from and are \textit{bounded} by the base model. From this perspective, treating the base model as an upper bound, our quantitative analysis shows that six popular RLVR algorithms perform similarly and remain far from optimal in fully leveraging the potential of the base model. In contrast, we find that distillation can introduce new reasoning patterns from the teacher and genuinely expand the model’s reasoning capabilities. Taken together, our findings suggest that current RLVR methods have not fully realized the potential of RL to elicit genuinely novel reasoning abilities in LLMs. This underscores the need for improved RL paradigms—such as continual scaling and multi-turn agent-environment interaction—to unlock this potential.
Reflections from the Selection Committee
This paper delivers a masterfully executed and critically important negative finding on a widely accepted, foundational assumption in Large Language Model (LLM) research: that Reinforcement Learning with Verifiable Rewards (RLVR) elicits genuinely new reasoning capabilities. The paper shows that RLVR training, across various model families, tasks, and algorithms, enhances sampling efficiency without expanding the reasoning capacity already present in base models. RL narrows exploration, rewarded trajectories are amplified, but the broader solution space shrinks, revealing that RLVR optimizes within, rather than beyond, the base distribution. This is an important finding which will hopefully incentivize fundamentally new RL paradigms able to navigate the vast action space and genuinely expand LLM reasoning capabilities.
Optimal Mistake Bounds for Transductive Online Learning
Zachary Chase, Steve Hanneke, Shay Moran, Jonathan Shafer
Abstract
We resolve a 30-year-old open problem concerning the power of unlabeled data in online learning by tightly quantifying the gap between transductive and standard online learning. We prove that for every concept class with Littlestone dimension , the transductive mistake bound is at least . This establishes an exponential improvement over previous lower bounds of , , and , respectively due to Ben-David, Kushilevitz, and Mansour (1995, 1997) and Hanneke, Moran, and Shafer (2023). We also show that our bound is tight: for every , there exists a class of Littlestone dimension with transductive mistake bound . Our upper bound also improves the previous best known upper bound from Ben-David et al. (1997). These results demonstrate a quadratic gap between transductive and standard online learning, thereby highlighting the benefit of advanced access to the unlabeled instance sequence. This stands in stark contrast to the PAC setting, where transductive and standard learning exhibit similar sample complexities.
Reflections from the Selection Committee
This paper presents a breakthrough in learning theory, deserving the NeurIPS Best Paper Runner-Up award for its elegant, comprehensive, and definitive resolution of a 30-year-old open problem. The authors have not only precisely quantified the optimal mistake bound for transductive online learning as Ω(√d), but they have also achieved a tight match with an O(√d) upper bound. This establishes a quadratic gap between transductive and standard online learning, a result that represents an exponential leap beyond all previous logarithmic lower bounds and dramatically highlights the theoretical value of unlabeled data in this setting—a crucial insight distinct from its more limited role in PAC learning.
The novelty and ingenuity of their proof techniques are quite remarkable. For the lower bound, the adversary employs a sophisticated strategy that balances forcing mistakes with carefully managing the shrinking of the version space, leveraging the concept of “paths in trees” as a fundamental underlying structure. The upper bound, demonstrating the learnability within O(√d) mistakes, introduces an innovative hypothesis class construction that embeds a “sparse encoding” for off-path nodes – a probabilistic design where most off-path labels are zero, but the rare ones carry immense information. The learner’s strategy to exploit this class is equally brilliant, integrating several non-standard sophisticated techniques: “Danger Zone Minimization” to control the instance sequence presented by the adversary, “Splitting Experts” via a multiplicative weights approach to handle uncertainty about a node’s on-path status, and a strategic “Transition to Halving” once sufficient information is gathered from the sparsely encoded off-path labels. This intricate interplay between a cleverly constructed hypothesis class and a highly adaptive learning algorithm showcases a masterclass in theoretical analysis and design.
Superposition Yields Robust Neural Scaling
Yizhou Liu, Ziming Liu, Jeff Gore
Abstract
The success of today’s large language models (LLMs) depends on the observation that larger models perform better. However, the origin of this neural scaling law, that loss decreases as a power law with model size, remains unclear. We propose that representation superposition, meaning that LLMs represent more features than they have dimensions, can be a key contributor to loss and cause neural scaling. Based on Anthropic’s toy model, we use weight decay to control the degree of superposition, allowing us to systematically study how loss scales with model size. When superposition is weak, the loss follows a power law only if data feature frequencies are power-law distributed. In contrast, under strong superposition, the loss generically scales inversely with model dimension across a broad class of frequency distributions, due to geometric overlaps between representation vectors. We confirmed that open-sourced LLMs operate in the strong superposition regime and have loss scaling inversely with model dimension, and that the Chinchilla scaling laws are also consistent with this behavior. Our results identify representation superposition as a central driver of neural scaling laws, providing insights into questions like when neural scaling laws can be improved and when they will break down.
Reflections from the Selection Committee:
This paper moves beyond observation of neural scaling laws—the empirically established phenomenon in which model loss exhibits a power-law decrease as model size, dataset size, or computational resources are increased—to demonstrate that representation superposition constitutes the primary mechanism governing these laws. Authors introduce a controlled “toy model” to examine how superposition and data structure affect the scaling of loss with model size and demonstrate that under strong superposition where features are overlapping, the loss scales consistently as an inverse power law with respect to the model dimension. The core findings are supported by a series of carefully designed experiments and offer fresh insights into an important research area.
The selection of these papers reflects the remarkable breadth of research presented at NeurIPS 2025, spanning generative modeling, reinforcement learning, natural language processing, learning theory, neural scaling, and benchmarking methodologies. The diversity of topics among the awarded papers demonstrates the vibrant and multifaceted nature of machine learning research.
We extend our congratulations to all the award recipients and look forward to seeing these works presented at the conference this December! Please note that the award certificates will be given out during the paper’s respective oral sessions by the session chairs.
We would also like to extend our gratitude and appreciation to the members of the Best Paper Award Committee listed here.
Best Paper Award Committee for Main Track and Database and Benchmark Tracks
- Jacob Andreas (MIT, United States)
- Sander Dieleman (Google DeepMind, UK)
- Dilek Hakkani-Tur (University of Illinois Urbana-Champaign, United States)
- Brian Kingsbury (IBM, United States)
- Mirella Lapata (University of Edinburgh, Scotland)
- Vincent Lepetit (Ecole des Ponts ParisTech, France)
- Ulrich Paquet (AIMES & Google DeepMind, Africa)
- Violet Peng (UCLA, United States)
- Doina Precup (McGill University, Canada)
- Masashi Sugiyama (RIKEN & University of Tokyo, Japan)
- Vincent Tan (National University of Singapore, Singapore)
- Yee Whye Teh (University of Oxford, United Kingdom)
- Xing Xie (Microsoft, China)
- Luke Zettlemoyer (University of Washington/Meta, United States)