解耦的DiLoCo:大规模下的弹性分布式AI训练
Decoupled DiLoCo: Resilient, Distributed AI Training at Scale

原始链接: https://deepmind.google/blog/decoupled-diloco/

## 解耦DiLoCo:在互联网上扩展AI训练 谷歌的解耦DiLoCo系统显著提升了大型AI模型训练的可扩展性和效率。它成功地使用标准互联网连接(2-5 Gbps)在四个美国区域预训练了一个120亿参数的模型,实现了**比传统方法快20多倍**的训练速度。 这种加速来自于将通信与计算*集成*,消除了因等待数据同步而导致性能瓶颈。DiLoCo还能够利用地理分散的、之前“闲置”的计算资源,甚至**在一次训练中结合不同世代的硬件**(例如TPU v6e和v5p)——在保持性能的同时延长硬件寿命。 这种全栈式的AI训练方法代表着迈向具有弹性的、互联网规模AI基础设施的重要一步,解决了后勤挑战,并最大限度地利用可用的计算能力,以应对AI模型持续增长的需求。

黑客新闻 新的 | 过去的 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 解耦的 DiLoCo:大规模的、有弹性的、分布式 AI 训练 (deepmind.google) 8 分,由 metadat 发表于 34 分钟前 | 隐藏 | 过去的 | 收藏 | 讨论 帮助 考虑申请 YC 的 2026 年夏季批次!申请截止至 5 月 4 日 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系方式 搜索:
相关文章

原文

Decoupled DiLoCo is not only more resilient to failures, but is also practical for executing production-level, fully distributed pre-training. We successfully trained a 12 billion parameter model across four separate U.S. regions using 2-5 Gbps of wide-area networking (a level relatively achievable using existing internet connectivity between datacenter facilities, rather than requiring new custom network infrastructure between facilities). Notably, the system achieved this training result more than 20 times faster than conventional synchronization methods. This is because our system incorporates required communication into longer periods of computation, avoiding the "blocking" bottlenecks where one part of the system must wait for another.

At Google, we take a full-stack approach to AI training, spanning hardware, software infrastructure and research. Increasingly, gains are coming from rethinking how these layers fit together.

Decoupled DiLoCo is one example. By enabling training jobs at internet-scale bandwidth, it can tap any unused compute wherever it sits, turning stranded resources into useful capacity.

Beyond efficiency and resilience, this training paradigm also unlocks the ability to mix different hardware generations, such as TPU v6e and TPU v5p, in a single training run. This approach not only extends the useful life of existing hardware, but also increases the total compute available for model training. In our experiments, chips from different generations running at different speeds still matched the ML performance of single-chip-type training runs, ensuring that even older hardware can meaningfully accelerate AI training.

What’s more, because new generations of hardware don’t arrive everywhere all at once, being able to train across generations can alleviate recurring logistical and capacity bottlenecks.

As we push the frontiers of AI infrastructure today, we’re continuing to explore approaches to resilient systems needed to unlock the next generation of AI.

联系我们 contact @ memedata.com