华为发布了一个在华为昇腾GPU上训练的开放权重模型。
Huawei releases an open weight model trained on Huawei Ascend GPUs

原始链接: https://arxiv.org/abs/2505.21411

研究人员提出了一种名为混合分组专家模型 (MoGE) 的新型架构,旨在提高混合专家 (MoE) 模型的效率,尤其是在分布式推理方面。MoGE 对专家进行分组并平衡其工作负载,确保每个 token 激活组内相同数量的专家,从而实现跨设备的更好负载均衡并提高吞吐量。 团队在盘古Pro MoE(一个具有720亿参数,每个token激活160亿参数的模型)中实现了MoGE,该模型针对昇腾NPU进行了优化。系统仿真为昇腾300I Duo和800I A2的模型配置提供了信息。实验结果证实了MoGE在训练和推理过程中具有优越的专家负载均衡和执行效率。 盘古Pro MoE在推理过程中每卡达到1148 tokens/s,通过推测性加速进一步提升至1528 tokens/s,超过了同等规模的320亿和720亿参数的稠密模型。该模型在昇腾300I Duo上展现出极佳的性价比。研究表明,昇腾NPU可以高效地训练盘古Pro MoE,使其成为百亿参数以下模型中的领先者,性能优于现有的开源模型。

The Hacker News discussion revolves around Huawei's release of an open-weight AI model trained on its Ascend GPUs. Commenters express excitement about the potential for smaller players with affordable GPUs to compete with larger AI developers, suggesting a decentralized, crowd-sourced AI future. Some argue that US sanctions, while intended to hinder China, may inadvertently boost diversification of compute and manufacturing. A major point of contention is the model's license, which prohibits use within the EU, likely due to concerns about the EU's AI Act. This sparks debate about the enforceability of such restrictions on individual users and the implications for European innovation. Concerns are also raised about security risks associated with open-weight models, such as potential backdoors and prompt injection vulnerabilities. Despite potential issues, there is a general consensus that this release represents a significant step towards democratizing AI development. The effectiveness and ethics of using sanctions as a tool to impede technological advancement were also discussed.
相关文章

原文

View a PDF of the paper titled Pangu Pro MoE: Mixture of Grouped Experts for Efficient Sparsity, by Yehui Tang and 21 other authors

View PDF HTML (experimental)
Abstract:The surgence of Mixture of Experts (MoE) in Large Language Models promises a small price of execution cost for a much larger model parameter count and learning capacity, because only a small fraction of parameters are activated for each input token. However, it is commonly observed that some experts are activated far more often than others, leading to system inefficiency when running the experts on different devices in parallel. Therefore, we introduce Mixture of Grouped Experts (MoGE), which groups the experts during selection and balances the expert workload better than MoE in nature. It constrains tokens to activate an equal number of experts within each predefined expert group. When a model execution is distributed on multiple devices, this architectural design ensures a balanced computational load across devices, significantly enhancing throughput, particularly for the inference phase. Further, we build Pangu Pro MoE on Ascend NPUs, a sparse model based on MoGE with 72 billion total parameters, 16 billion of which are activated for each token. The configuration of Pangu Pro MoE is optimized for Ascend 300I Duo and 800I A2 through extensive system simulation studies. Our experiments indicate that MoGE indeed leads to better expert load balancing and more efficient execution for both model training and inference on Ascend NPUs. The inference performance of Pangu Pro MoE achieves 1148 tokens/s per card and can be further improved to 1528 tokens/s per card by speculative acceleration, outperforming comparable 32B and 72B Dense models. Furthermore, we achieve an excellent cost-to-performance ratio for model inference on Ascend 300I Duo. Our studies show that Ascend NPUs are capable of training Pangu Pro MoE with massive parallelization to make it a leading model within the sub-100B total parameter class, outperforming prominent open-source models like GLM-Z1-32B and Qwen3-32B.
From: Hang Zhou [view email]
[v1] Tue, 27 May 2025 16:40:21 UTC (710 KB)
[v2] Wed, 28 May 2025 10:42:15 UTC (710 KB)
联系我们 contact @ memedata.com