全力押注MatMul？不要把所有张量都放在一个篮子里！

全力押注MatMul？不要把所有张量都放在一个篮子里！
All in on MatMul? Don’t Put All Your Tensors in One Basket!

原始链接: https://www.sigarch.org/dont-put-all-your-tensors-in-one-basket-hardware-lottery/

## AI 中的硬件彩票：摘要现代 AI 的进步很大程度上受到“硬件彩票”的影响——成功往往取决于算法与现有芯片架构的契合度，特别是那些针对矩阵乘法 (MatMul) 优化的架构。这导致了对以 MatMul 为中心的方法的偏见，可能扼杀替代的、潜在更优越的算法范式的创新。虽然 MatMul 加速器推动了深度学习革命，但它们的统治地位可能会导致单一文化，研究方向被当前硬件的*可行性*所引导，而不是对智能*最优化*的方案。这呼应了“痛苦的教训”——可扩展的算法会胜出——但引发了对忽略根本不同的方法的担忧，例如那些受大脑稀疏、事件驱动处理启发的算法。作者认为，简单地扩展现有硬件会带来边际收益递减。真正具有影响力的突破可能需要新的硬件，但当前的生态系统——资金、人才、基础设施—— overwhelmingly 偏向于 MatMul。解决方案包括支持多样化的硬件研究，以及将现有加速器发展为更通用和可编程，并可能利用 AI 本身来共同设计算法和硬件。避免“死胡同”需要分散投资并确保硬件能够适应未来的算法发现，而不是严格地定义它们。

Hacker News 新闻 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录关于MatMul的一切？不要把所有张量放在一个篮子里！(sigarch.org) 8点由 matt_d 2小时前 | 隐藏 | 过去 | 收藏 | 讨论考虑申请YC冬季2026批次！申请截止至11月10日指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系搜索：

Have you noticed that some ideas in AI take off not necessarily because they’re better, but because they align better with our machines? That’s the hardware lottery: if your approach happens to align with the dominant hardware/software, you hit the jackpot; otherwise, better luck next time.

Yet this lottery isn’t a one-time draw. As Sara Hooker puts it, “we may be in the midst of a present-day hardware lottery”. The catch? Modern chips zero in on DNNs’ commercial sweet spots. They are exceptionally good at cranking through heavy-duty MatMuls and the garnish ops, such as non-linear functions, that keep you from getting hit by Amdahl’s Law. Pitch an off-road idea, and it is at best a high-risk long shot. The “winners” in research often align with what our tools run best, and that tendency can skew the trajectory of technological progress.

This post draws attention to how generality and programmability are often underemphasized on today’s accelerators, a pattern that risks stifling future algorithmic innovation unless it is actively addressed.

It’s no coincidence that almost every AI breakthrough involves some kind of NN crunching numbers on xPUs. These chips have made a particular form of MatMul the de facto currency of AI. Crucially, the performance gains have come not just from accelerating MatMul itself, but from the realization that AI algorithms are resilient to reduced precision. Because ML frameworks are built around tensor operations, any problem reformulated as a sequence of MatMul ops instantly taps decades of compiler optimizations and accelerator infrastructure. In fact, we practice a pragmatic “MatMul-reduction,” much like NP-reductions, converting complex tasks into chained MatMuls. But we haven’t shown that all aspects of intelligence reduce neatly to MatMul, and by rewarding only MatMul-friendly ideas, we risk creating powerful yet brittle approximations that trap us in a local minimum.

Researchers naturally gravitate to methods that run efficiently on existing hardware and tools, while proposals that stray from the matrix-heavy paradigm face an uphill battle. These ideas require achieving a kind of implementation escape velocity to land in a real chip. In effect, AI-specific chips create technological inertia. If tomorrow someone conceives an AI method that isn’t MatMul-centric, would it ever get a fair evaluation on hardware designed to support it?

None of this is to say MatMul accelerators are “bad”; far from it, they enabled the deep learning revolution. Yet their success has created a textbook example of the Matthew Effect: hardware favors certain algorithms, those algorithms dominate, and we then build even more specialized chips for them. That feedback loop is virtuous if you buy into the current paradigm, but vicious if it causes us to overlook alternative approaches.

So who’s running this casino, anyway? ML accelerators now dictate who gets to pursue what research. It’s no longer just about building better algorithms but about securing access to the large clusters, and today there is an increasing emphasis on chasing short-term returns on proven approaches rather than risking investments on wild new paradigms. Innovation can suffer, steered by what’s profitable and available rather than what’s intellectually promising.

History also provides a grim reminder that “computer architecture history is littered with the corpses of special-purpose machines.” These were often brilliant ideas defeated by the one-two punch of a narrow market and Moore’s Law economies of scale.

Today, materializing an idea is not just about access to spare xPUs, it’s about meeting the table stakes required to build chips at scale. Capital has become the ultimate gatekeeper for what gets researched. If Cloud Company X has a thousand xPUs idling, research that fits those machines is more likely to get done. If Idea Y would require building a whole new kind of chip or using hardware that nobody’s selling, it might never leave the drawing board.

With MatMul machines everywhere, one might ask: is the hardware lottery still a thing? After all, if virtually everybody in AI is playing on the same hardware and using the same application, maybe it’s not a lottery anymore but a planned economy: “Matrix math won, alternative approaches need not apply”? 2024 Turing Laureate Rich Sutton epitomized this in “bitter lesson”, that the algorithms which scale best with compute will inevitably win, and so far DNNs have scaled very well. By this logic, it’s no accident or luck that deep learning is on top; it won fair and square by delivering results when thrown onto big hardware. In a world where more data + bigger models = better performance, focusing on one ubiquitous hardware type could simply accelerate progress for everyone.

If everyone concentrates on improving one kind of platform, you get compounding efficiencies. Maybe we don’t need radically different hardware paradigms, we just need to make the winning hardware cheaper and more accessible. On a standardized AI hardware, ideas compete on a more level playing field of implementation. In that scenario, the hardware lottery effectively ends. It ceases to be a game of chance and becomes what Thomas Kuhn called a dominant paradigm. MatMul-centric computing defines the “normal science” of the field, creating a powerful bias towards ideas that fit the existing model and treating those that don’t as mere anomalies.

However, many caution that this very uniformity could be lulling us into a false sense of security. The current domination of matrix-multiply-centric AI might just be masking the hardware lottery’s fangs. We won’t notice the lottery until the day we desperately need a different kind of hardware and realize we haven’t been buying those tickets. As Hooker notes, today’s specialization makes it “far more costly to stray from accepted building blocks”, implicitly pressuring researchers to stick to ideas that fit the hardware. The danger is that we might overfit to our hardware, optimizing our whole intellectual landscape around what runs fast on a GPU, and being blind to ideas that would require something fundamentally new (e.g., Training DistBelief with tens of thousands of CPU cores vs. AlexNet on two GPUs, both in 2012).

What if the next breakthrough doesn’t look anything like a giant MatMul? The field of neurobiology points to sparse and event-driven primitives in our brain, but definitely not training a deep Transformer model. Kaplan et al. also report that doubling the MatMul compute produces only around 5% improvement in loss: big hardware effort, modest algorithmic return. If you suspect we will hit a wall in achieving human-level intelligence, then it’s reasonable to expect that somewhere out there is a different algorithmic path, one that might need its own kind of hardware, and attention, to truly flourish.

There are already glimmers of such paradigms. But they’re far from mainstream, partly because the entire ecosystem (funding, talent, compilers, access) orbits around the incumbent tech: no one uses a different hardware paradigm because it’s not supported, and it’s not supported because no one’s using it. If matrix-multiply AI is the only game in town, we risk a kind of innovation monoculture. Monocultures can be efficient, but they’re also brittle: one fundamental limitation of our favored approach could stall progress.

Taking a step back, should we focus our effort more on hardware, or software/algorithms? From the 1980s until 2020, Moore’s Law delivered an impressive 30,000x speedup (Fig. 2a-Hardware Improvement). However, consider the example of kD trees, developed to accelerate approximate nearest-neighbor queries. This one algorithmic breakthrough resulted in a similar speedup to decades’ worth of hardware advancements. Retrospective data from MIT shows that algorithms transitioned from O(N^2) to O(N) at a rate of 0.5% per year. It seems important to enable such a breakthrough for Transformers, but focusing on dense linear algebra and chasing (comparatively modest) gains from hardware is unlikely to get us there.

So, is the hardware lottery less relevant now? Or more relevant than ever? It might be less visible day-to-day because one approach dominates, but that dominance itself could be the biggest lottery effect of all. The fact that everything is so aligned on one type of hardware means if you’re working on anything else, you’re effectively locked out of the casino. And if the future of AI needs a different casino? We’ll wish we hadn’t put all our chips on one table.

The convention wisdom boils down to two options:

Let a thousand flowers bloom. Support research into non-xPU machines and their matching algorithms. A diverse hardware ecosystem makes the field more resilient and full of surprises. Academia is the natural lead here, long-horizon research and risk tolerance, with industry providing sustained resources and scale.
Go all-in. This strategy treats the hardware lottery like a casino game where one hand has already won big. The strategy is to go all-in, pushing all the chips—both financial and silicon—into making sure everyone can play with that winning hand. That means driving down cost, improving energy efficiency, and expanding access. The risk? We are simply building a wider, more comfortable road into a potential cul-de-sac.

Both have merit and Chris Lattner captures this view: if we want AI to keep advancing, we must “expand access to alternative hardware, [maximize] efficiency on existing systems, and [accelerate] software innovation”, otherwise we risk hitting a wall where AI progress is bottlenecked by hardware.

Beyond these two extremes, a pragmatic middle path would be to add generality and programmability to the specialized winners. This approach asks us to evolve our winning design into a more general hardware and broadening (or redesigning) the scope of algorithms. History gives precedent. The original GPGPU was a graphics chip with enough generality for scientific computing. Jensen Huang didn’t build a new machine from scratch; he evolved his existing one by betting on programmability, adding generality to a specialized chip—a lottery ticket that paid off greatly when AlexNet emerged. More recently, products like Nvidia’s Grace CPU-GPU combination are a continuation of this philosophy: improving a specialized core with general-purpose capabilities.

This generalist route is promising but hardest. It faces the classic chicken-and-egg problem. One path to break the deadlock is to use today’s ML and algorithm discovery tools to search the co-design space, letting models propose microarchitectures and algorithms that are jointly efficient. This isn’t just about optimizing the current paradigm; it is about asking AI to help us discover the “winning” hardware primitives for the next one

On the software front, Youtube’s Video Coding Unit (VCU) intentionally bakes in “only the computationally expensive infrequently-changing aspects of the system“. On the algorithm front, work such as Fast Feedforward Networks shows you can replace large dense feed-forward MatMuls with log-time, tree-based conditional execution, a different primitive that maps much better to sparse/event-driven or memory-centric hardware. And on the hardware front, designs like Stella Nera demonstrate multiplier-free, lookup/add-based accelerators that recast matrix multiplication into a very different hardware primitive, proof that alternative compute substrates can be both efficient and practical.

This offers a simple litmus test: if hardware cannot adapt to run new approaches as they emerge, it has already become too specialized.

The hardware lottery has taught us that progress is not merely about brilliant ideas, but about the platforms that give those ideas life. We cannot afford to let the inertia of our current success steer us into a monoculture. Rather than choosing between exotic new hardware or wider access to old ones, let’s make a strategic bet on evolution. By building accelerators with broader generality, we don’t discard our winning ticket, we hedge our bets by adding new numbers. The true jackpot isn’t raw speed, but the hardware-and-algorithm duo that unlocks the next computing era. Stop spinning the wheel and start redesigning the machine.

Acknowledgements:

We want to thank our colleagues at Google DeepMind and across Google for their valuable feedback and insights while developing the ideas for this post.

About the Author:

Amir Yazdanbakhsh is a Research Scientist at Google DeepMind, working at the intersection of machine learning and computer architecture. His primary focus is on applying machine learning to design efficient and sustainable computing systems, from leading the development of large-scale distributed training systems on TPUs to shaping the next generation of Google’s ML accelerators. His research on using AI to solve performance challenges in hyper-scale systems received an IEEE Micro Top Picks award.

Jan Wassenberg is a Senior Staff Software Engineer at Google DeepMind. Over the past 20 years, Jan has applied SIMD and vectorization to a wide range of domains. His work includes founding the open source Highway library for performance-portable SIMD; developing vqsort, the fastest known sort for 64/128-bit integers; devising Randen, a CSPRNG sufficiently efficient to serve as Google’s default RNG; and leading the open source gemma.cpp project for LLM inference on CPU.

Authors’ Disclaimer:

Portions of this post were edited with the assistance of AI models. Some references and notes were also compiled using AI tools. The content represents the opinions of the authors and does not necessarily represent the views, policies, or positions of Google DeepMind or its affiliates.

What would an AI have to say about this post? Listen to a two-way conversation generated by NotebookLM by pressing play right here.

Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.