阿里云称降低英伟达GPU使用率82%

阿里云称降低英伟达GPU使用率82%
Alibaba Cloud says it cut Nvidia AI GPU use by 82% with new pooling system

原始链接: https://www.tomshardware.com/tech-industry/semiconductors/alibaba-says-new-pooling-system-cut-nvidia-gpu-use-by-82-percent

## 阿里云Aegaeon提升LLM推理效率阿里云新推出的Aegaeon系统大幅提升了大型语言模型（LLM）的 serving 效率，在Model Studio的beta测试中，所需Nvidia GPU减少了82%。Aegaeon在经过同行评审的论文中详细介绍，它是一种推理时调度器，在token级别虚拟化GPU访问，允许多个模型共享单个加速器。这种“池化”方法显著提高了GPU利用率和“goodput”（有效输出），相比现有的serverless系统，提升高达九倍。在Nvidia H20 GPU上使用数十个LLM（参数高达720亿）进行测试时，所需的GPU数量从1192个减少到仅213个。性能提升源于将多个模型打包到每个GPU中，并根据输出生成动态分配计算资源。虽然结果可能针对阿里云的集成基础设施进行了优化，但该技术有望为面临GPU供应限制和推理需求增长的其他云提供商带来显著益处。

## 阿里云与中国科技创新 - Hacker News 总结 Hacker News 的讨论围绕阿里云宣称的 Nvidia GPU 使用量减少 82%，突显了中国日益增长的科技独立性。用户推测这是美国试图阻碍中国技术进步的结果，具有讽刺意味的是，这反而*迫使*创新朝不同方向发展。许多评论者认为这种“被迫创新”可能导致全球效率提高，特别是如果中国公司继续开源他们的进步。人们对地缘政治因素可能导致技术标准分歧表示担忧，类似于历史上的视频标准冲突（PAL 与 NTSC）。然而，也有人乐观地认为，分裂的技术格局可以通过竞争性的“A/B 测试”加速进步——一种“阴阳”方法。对话还指出了中国在 AI 模型开发方面的快速进展，现在有越来越多的 SOTA 模型来自中国实验室，而非西方实验室。人们将其与二战后的日本进行比较，当时的限制激发了足智多谋和创新，来自拉丁美洲的观察结果也证实了中国制造产品日益普及。相关主题的进一步阅读链接也被分享。

Alibaba Cloud claims its new Aegaeon pooling system reduces the number of Nvidia GPUs required to serve large language models by 82% during a multi-month beta test inside its Model Studio marketplace. The result, published in a peer-reviewed paper presented at the 2025 ACM Symposium on Operating Systems (SOSP) in Seoul, suggests that cloud providers may be able to extract significantly more inference capacity from existing silicon, especially in constrained markets like China, where the supply of Nvidia's latest H20s remains limited.

Unlike training-time breakthroughs that chase model quality or speed, Aegaeon is an inference-time scheduler designed to maximize GPU utilization across many models with bursty or unpredictable demand. Instead of pinning one accelerator to one model, Aegaeon virtualizes GPU access at the token level, allowing it to schedule tiny slices of work across a shared pool. This means one H20 could serve several different models simultaneously, with system-wide “goodput” — a measure of effective output — rising by as much as nine times compared to older serverless systems.

The system was tested in production over several months, according to the paper, which lists authors from both Peking University and Alibaba’s infrastructure division, including CTO Jingren Zhou. During that window, the number of GPUs needed to support dozens of different LLMs — ranging in size up to 72 billion parameters — fell from 1,192 to just 213.

While the paper does not break down which models contributed most to the savings, reporting by the South China Morning Post says the tests were conducted using Nvidia’s H20, one of the few accelerators still legally available to Chinese buyers under current U.S. export controls.

Whether those savings translate outside Alibaba’s stack remains to be seen. Alibaba Cloud’s paper does not specify the exact network fabric used in the beta test, but we know the company offers its own eRDMA elastic RDMA network and has a record of building highly‑integrated GPU serving stacks, suggesting the results may depend on an optimized, vertically integrated environment.

Follow Tom's Hardware on Google News, or add us as a preferred source, to get our latest news, analysis, & reviews in your feeds.

阿里云称降低英伟达GPU使用率82% Alibaba Cloud says it cut Nvidia AI GPU use by 82% with new pooling system

阿里云称降低英伟达GPU使用率82%
Alibaba Cloud says it cut Nvidia AI GPU use by 82% with new pooling system