英伟达GPU路线图证实：摩尔定律已死。

英伟达GPU路线图证实：摩尔定律已死。
Nvidia GPU roadmap confirms it: Moore's Law is dead and buried

原始链接: https://www.theregister.com/2025/03/29/nvidia_moores_law/

英伟达在GTC上公布了其到2028年的GPU路线图，揭示了其克服摩尔定律放缓的策略。面对工艺技术的限制，英伟达正在通过其机架规模系统扩大硅片密度，目标是在2027年实现576个GPU和600kW的功耗。这涉及增加每个封装的芯片数量、内存容量和带宽（HBM4e）。虽然低精度计算能够带来收益，但收益递减使得必须进行架构调整，例如优先考虑4位浮点运算而不是双精度运算。极高的功耗需求需要“AI工厂”——与合作伙伴共同设计的专用数据中心，用于处理冷却和电力问题，这导致了对基础设施的投资。英伟达早期遇到的这些挑战使其能够影响未来的数据中心设计。通过公布其路线图，英伟达促使基础设施提供商做好准备，但也为AMD和英特尔等竞争对手铺平了道路。对极端计算密度的追求使得数据中心难以适应新的功耗需求，进而推动了对更多“AI工厂”和更好能源解决方案的需求。

Hacker News一篇帖子的总结，重点关注关键点和讨论：《The Register》的一篇文章认为摩尔定律已死，英伟达的GPU路线图证明了这一点。性能提升不再仅仅来自晶体管密度的提高，而是来自诸如芯粒设计、更小的数据格式（4位）、更大的片上内存以及更高的内存带宽等技术。英伟达还在将更多GPU打包到机架中，导致功耗增加，对数据中心基础设施造成压力。一位评论者指出，这些方法的增长都有限，暗示过去那种快速的性能翻倍即将结束。另一位评论者指出了Meta人工智能数据中心的巨额能源成本，质疑其在能源需求增加的情况下提出的可持续性主张。他们提到了Meta的零净排放目标，但担心人工智能的快速发展正在危及这一目标。最后一条评论质疑持续扩展的必要性，并引用GPT 4.5作为并非总是最佳方法的潜在迹象。

原文

Comment As Jensen Huang is fond of saying, Moore's Law is dead – and at Nvidia GTC this month, the GPU-slinger's chief exec let slip just how deep in the ground the computational scaling law really is.

Standing on stage, Huang revealed not just the chip designer's next-gen Blackwell Ultra processors, but a surprising amount of detail about its next two generations of accelerated computing platforms, including a 600kW rack scale system packing 576 GPUs. We also learned an upcoming GPU family, due to arrive in 2028, will be named after Richard Feynman. Surely you're joking!

It's not that unusual for chipmakers to tease their roadmaps from time to time, but we usually don't get this much information all at once. And that's because Nvidia is stuck. It's run into not just one roadblock but several. Worse, apart from throwing money at the problem, they're all largely out of Nvidia's control.

These challenges won't come as any great surprise to those paying attention. Distributed computing has always been a game of bottleneck whack-a-mole, and AI might just be the ultimate mole hunt.

It's all up and out from here

The first and most obvious of these challenges revolves around scaling compute.

Advancements in process technology have slowed to a crawl in recent years. While there are still knobs to turn, they're getting exponentially harder to budge.

Faced with these limitations, Nvidia's strategy is simple: scaling up the amount of silicon in each compute node as far as they can. Today, Nvidia's densest systems, or really racks, mesh 72 GPUs into a single compute domain using its high-speed 1.8TB/s NVLink fabric. Eight or more of these racks are then stitched together using InfiniBand or Ethernet to achieve the desired compute and memory capacity.

At GTC, Nvidia revealed its intention to boost this to 144 and eventually 576 GPUs per rack. However, scaling up isn't limited to racks; it's also happening on the chip package.

This became obvious with the launch of Nvidia's Blackwell accelerators a year ago. The chips boasted 5x the performance uplift over Hopper, which sounded great until you realized it needed twice the die count, a new 4-bit datatype, and 500 watts more power to do it.

The reality was, normalized to FP16, Nvidia's top-specced Blackwell dies are only about 1.25x faster than a GH100 at 1,250 dense teraFLOPS versus 989 — there just happened to be two of them.

By 2027 Nvidia CEO Jensen Huang expects racks to surge to 600kW with the debut of the Rubin Ultra NVL576 - Click to enlarge

We don't yet know what process tech Nvidia plans to use for its next-gen chips, but what we do know is that Rubin Ultra will continue this trend, jumping from two reticle limited dies to four. Even with the roughly 20 percent increase in efficiency, Huang expects to get out of TSMC 2nm, that's still going to be one hot package.

It's not just compute either; it's memory too. The eagle eyed among you might have noticed a rather sizable jump in capacity and bandwidth between Rubin to Rubin Ultra — 288GB per package versus 1TB. Roughly half of this comes from faster, higher capacity memory modules, but the other half comes from a doubling the amount of silicon dedicated to memory from eight modules on Blackwell and Rubin to 16 on Rubin Ultra.

Higher capacity means Nvidia can cram more model parameters, around 2 trillion at FP4, into a single package or 500 billion per "GPU" since they're counting individual dies now instead of sockets. HBM4e also looks to effectively double the memory bandwidth over HBM3e. Bandwidth is expected to jump from around 4TB/s per Blackwell die today to around 8TB/s on Rubin Ultra.

Unfortunately, short of a major breakthrough in process tech, it's likely future Nvidia GPU packages could pack on even more silicon.

The good news is that process advancements aren't the only way to scale compute or memory. Generally speaking, dropping from say 16-bit to 8-bit precision effectively doubles the throughput while also halving the memory requirements of a given model. The problem is Nvidia is running out of bits to drop to juice its performance gains. From Hopper to Blackwell, Nvidia dropped four bits, doubled the silicon, and claimed a 5x floating point gain.

But below four-bit precision, LLM inference gets pretty rough, with rapidly climbing perplexity scores. That said, there's some interesting research being done around super low precision quantization, as low as 1.58 bits while maintaining accuracy.

Not that reduced precision isn't the only way to pick up FLOPS. You can also dedicate less die area to higher precision datatypes that AI workloads don't need.

We saw this with Blackwell Ultra. Ian Buck, VP of Accelerated Computing business unit at Nvidia, told us in an interview they actually nerfed the chip's double precision (FP64) tensor core performance in exchange for 50% more 4-bit FLOPS.

Whether this is a sign that FP64 is on its way out at Nvidia remains to be seen, but if you really care about double-precision grunt, AMD's GPUs and APUs probably should be at the top of your list anyway.

In any case, Nvidia's path forward is clear: its compute platforms are only going to get bigger, denser, hotter and more power hungry from here on out. As a calorie deprived Huang put it during his press Q&A last week, the practical limit for a rack is however much power you can feed it.

"A datacenter is now 250 megawatts. That's kind of the limit per rack. I think the rest of it is just details," Huang said. "If you said that a datacenter is a gigawatt, and I would say a gigawatt per rack sounds like a good limit."

No escaping the power problem

Naturally, 600kW racks pose one helluva headache for datacenter operators.

To be clear, chilling megawatts of ultra-dense compute isn't a new problem. The folks at Cray, Eviden, and Lenovo have had that figured out for years. What's changed is we're not talking about a handful of boutique compute clusters a year. We're talking dozens of clusters, some of which are so large as to dethrone the Top500's most powerful supers if tying up 200,000 Hopper GPUs with Linpack would make any money.

At these scales, highly-specialized, low-volume thermal management and power delivery systems simply aren't going to cut it. Unfortunately, the datacenter vendors — you know the folks selling the not so sexy bits and bobs you need to make those multimillion dollar NVL72 racks work — are only now catching up with demand.

We suspect this is why so many of the Blackwell deployments announced so far have been for the air-cooled HGX B200 and not for the NVL72 Huang keeps hyping. These eight GPU HGX systems can be deployed in many existing H100 environments. Nvidia has been doing 30-40kW racks for years, so jumping to 60kW just isn't that much of a stretch, and it is, dropping down to two or three servers per rack is still an option.

This is where those 'AI factories' Huang keeps rattling on about come into play

The NVL72 is a rackscale design inspired heavily by the hyperscalers with DC bus bars, power sleds, and networking out the front. And at 120kW of liquid cooled compute, deploying more than a few of these things in existing facilities gets problematic in a hurry. And this is only going to get even more difficult once Nvidia's 600kW monster racks make their debut in late 2027.

This is where those "AI factories" Huang keeps rattling on about come into play — purpose built datacenters designed in collaboration with partners like Schneider Electric to cope with the power and thermal demands of AI.

And surprise, surprise, a week after detailing its GPU roadmap for the next three years, Schneider announced a $700 million expansion in the US to boost production of all the power and cooling kits necessary to support them.

Of course, having the infrastructure necessary to power and cool these ultra dense systems isn't the only problem. So is getting the power to the datacenter in the first place, and once again, this is largely out of Nvidia's control.

Anytime Meta, Oracle, Microsoft, or anyone else announces another AI bit barn, a juicy power purchase agreement usually follows. Meta's mega DC being birthed in the bayou was announced alongside a 2.2GW gas generator plant — so much for those sustainability and carbon neutrality pledges.

And as much as we want to see nuclear make a comeback, it's hard to take small modular reactors seriously when even the rosiest predictions put deployments somewhere in the 2030s.

Follow the leader

To be clear, none of these roadblocks are unique to Nvidia. AMD, Intel, and every other cloud provider and chip designer vying for a slice of Nvidia's market share are bound to run into these same challenges before long. Nvidia just happens to be one of the first to run up against them.

While this certainly has its disadvantages, it also puts Nvidia in a somewhat unique position to shape the direction of future datacenter power and thermal designs.

As we mentioned earlier, the reason why Huang was willing to reveal its next three generations of GPU tech and tease its fourth is so its infrastructure partners are ready to support them when they finally arrive.

"The reason why I communicated to the world what Nvidia's next three, four year roadmap is now everybody else can plan," Huang said.

On the flip side, these efforts also serve to clear the way for competing chipmakers. If Nvidia designs a 120kW, or now 600kW, rack and colocation providers and cloud operators are willing to support that, AMD or Intel now has the all clear to pack just as much compute into their own rack-scale platforms without having to worry about where customers are going to put them. ®