英特尔Gaudi 3人工智能加速器

英特尔Gaudi 3人工智能加速器
Intel Gaudi 3 AI Accelerator

原始链接: https://www.intel.com/content/www/us/en/newsroom/news/vision-2024-gaudi-3-ai-accelerator.html

英特尔的 Gaudi 3 加速器是一款强大的人工智能解决方案，针对金融、制造和医疗保健等关键行业的企业。它采用 5 nm 工艺设计，比前代产品效率更高，并且可以处理深度学习算法必需的复杂矩阵运算。凭借并行运算能力和异构计算引擎，它可提供更高的人工智能性能和效率。 Gaudi 3 配备了增强的内存功能 - 128 GB HBMe2 内存、3.7 TB 内存带宽和 96 MB SRAM - 支持大语言和多模式模型并提高整体系统可扩展性。它与 24 个 200 Gb 以太网端口集成，可实现企业 AI 部署的高效扩展，并消除供应商锁定。此外，它还支持 PyTorch 和 Hugging Face 模型等流行的 AI 框架，确保开发人员的工作效率。 Gaudi 系列的新成员是 Gaudi 3 PCIe 卡，适用于微调、推理和 RAG 工作负载。与 Nvidia H100 相比，Gaudi 3 承诺在训练时间、推理吞吐量和功效方面大幅提升。 Dell Tech、HP Enterprise、Lenovo 和 Supermicro 等著名 OEM 计划采用它。将于 2024 年第三季度开始全面上市。此外，Gaudi 3 还为经济实惠的云 LLM 基础设施提供支持，带来具有竞争力的价格和选项。开发人员可以开始在开发人员云上探索 Intel Gaudi 2 实例。 Gaudi 3 为英特尔即将推出的用于 AI 和 HPC 的 Falcon Shores GPU 奠定了基础，将这两种技术结合在一个基于英特尔 oneAPI 规范的编程接口下。

英特尔的最新产品代号为 Gaudi 3，利用名为开放加速器模块 (OAM) 的标准接口将加速器连接到基板上。该接口类似于 Nvidia 的 SXM 连接，可实现与现有组件的潜在兼容性。以前的 Nvidia 型号（例如 P100 和 V100）在其 MegArray 连接器内具有半专有的 PCI-e 通道分布。然而，缺乏开放性阻碍了爱好者有效利用这些旧芯片。英特尔对其结构规格的开放战略是一个旨在促进竞争对手之间合作的转变。 Broadcom 是这种共享结构规范方法的受益者之一。此外，AMD 的历史证明了共享架构开发的价值——英特尔采用了 AMD 的数学协处理器设计（am9511、am9512），同时授权它们创建英特尔 8231 和 8232。相比之下，英特尔的 Gaudi 3 加速器将使用标准以太网而不是专有的 Infiniband 网络，确保易于采购和集成到系统中。这些加速器将构成高带宽、全方位网络的一部分，提供高效的通信能力。此外，Gaudi 3的发布标志着英特尔意图通过创新解决方案挑战Nvidia在人工智能和HPC市场的主导地位。关于 AMD 之前的言论，AMD 似乎很欣赏协作和共享架构开发的好处，承认其在 IBM 早期 PC 时代对英特尔的依赖。作者对英特尔对其产品的支持的长期性和承诺表示担忧，建议可能关注竞争而不是垄断。

原文

Why It Matters: Today, enterprises across critical sectors such as finance, manufacturing and healthcare are rapidly seeking to broaden accessibility to AI and transitioning generative AI (GenAI) projects from experimental phases to full-scale implementation. To manage this transition, fuel innovation and realize revenue growth goals, businesses require open, cost-effective and more energy-efficient solutions and products that meet return-on-investment (ROI) and operational efficiency needs.

The Intel Gaudi 3 accelerator will meet these requirements and offer versatility through open community-based software and open industry-standard Ethernet, helping businesses flexibly scale their AI systems and applications.

How Custom Architecture Delivers GenAI Performance and Efficiency: The Intel Gaudi 3 accelerator, architected for efficient large-scale AI compute, is manufactured on a 5 nanometer (nm) process and offers significant advancements over its predecessor. It is designed to allow activation of all engines in parallel — with the Matrix Multiplication Engine (MME), Tensor Processor Cores (TPCs) and Networking Interface Cards (NICs) — enabling the acceleration needed for fast, efficient deep learning computation and scale. Key features include:

AI-Dedicated Compute Engine: The Intel Gaudi 3 accelerator was purpose-built for high-performance, high-efficiency GenAI compute. Each accelerator uniquely features a heterogenous compute engine comprised of 64 AI-custom and programmable TPCs and eight MMEs. Each Intel Gaudi 3 MME is capable of performing an impressive 64,000 parallel operations, allowing a high degree of computational efficiency, making them adept at handling complex matrix operations, a type of computation that is fundamental to deep learning algorithms. This unique design accelerates speed and efficiency of parallel AI operations and supports multiple data types, including FP8 and BF16.
Memory Boost for LLM Capacity Requirements: 128 gigabytes (GB) of HBMe2 memory capacity, 3.7 terabytes (TB) of memory bandwidth and 96 megabytes (MB) of on-board static random access memory (SRAM) provide ample memory for processing large GenAI datasets on fewer Intel Gaudi 3s, particularly useful in serving large language and multimodal models, resulting in increased workload performance and data center cost efficiency.
Efficient System Scaling for Enterprise GenAI: Twenty-four 200 gigabit (Gb) Ethernet ports are integrated into every Intel Gaudi 3 accelerator, providing flexible and open-standard networking. They enable efficient scaling to support large compute clusters and eliminate vendor lock-in from proprietary networking fabrics. The Intel Gaudi 3 accelerator is designed to scale up and scale out efficiently from a single node to thousands to meet the expansive requirements of GenAI models.
Open Industry Software for Developer Productivity: Intel Gaudi software integrates the PyTorch framework and provides optimized Hugging Face community-based models – the most-common AI framework for GenAI developers today. This allows GenAI developers to operate at a high abstraction level for ease of use and productivity and ease of model porting across hardware types.

Gaudi 3 PCIe: New to the product line is the Gaudi 3 peripheral component interconnect express (PCIe) add-in card. Tailored to bring high efficiency with lower power, this new form factor is ideal for workloads such as fine-tuning, inference and retrieval-augmented generation (RAG). It is equipped as a full-height form factor at 600 watts, with a memory capacity of 128GB and a bandwidth of 3.7TB per second.

Intel Gaudi 3 accelerator will deliver significant performance improvements for training and inference tasks on leading GenAI models. Specifically, the Intel Gaudi 3 accelerator is projected to deliver on average versus Nvidia H100:

50% faster time-to-train¹ across Llama2 7B and 13B parameters, and GPT-3 175B parameter models.
50% faster inference throughput² and 40% greater inference power-efficiency³ across Llama 7B and 70B parameters, and Falcon 180B parameter models. An even greater inference performance advantage on longer input and output sequences.
30% faster inferencing⁴ on Llama 7B and 70B parameters, and Falcon 180B parameter models against Nvidia H200.

About Market Adoption and Availability: The Intel Gaudi 3 accelerator will be available to original equipment manufacturers (OEMs) in the second quarter of 2024 in industry-standard configurations of Universal Baseboard and open accelerator module (OAM). Among the notable OEM adopters that will bring Gaudi 3 to market are Dell Technologies, Hewlett Packard Enterprise, Lenovo and Supermicro. General availability of Intel Gaudi 3 accelerators is anticipated for the third quarter of 2024, and the Intel Gaudi 3 PCIe add-in card is anticipated to be available in the last quarter of 2024.

The Intel Gaudi 3 accelerator will also power several cost-effective cloud LLM infrastructures for training and inference, offering price-performance advantages and choices to organizations that now include NAVER.

Developers can get started today with access to Intel Gaudi 2-based instances on the developer cloud to learn, prototype, test, and run applications and workloads

What’s Next: Intel Gaudi 3 accelerators' momentum will be foundational for Falcon Shores, Intel’s next-generation graphics processing unit (GPU) for AI and high-performance computing (HPC). Falcon Shores will integrate the Intel Gaudi and Intel® Xe intellectual property (IP) with a single GPU programming interface built on the Intel® oneAPI specification.

More Context: Intel Unleashes Enterprise AI with Gaudi 3, AI Open Systems Strategy and New Customer Wins (News) | Intel Gaudi 3 AI Accelerator (Product Page) | Intel Gaudi 3 AI Accelerator (White Paper) | Intel Gaudi 2 Remains Only Benchmarked Alternative to NV H100 for GenAI Performance (News)

英特尔Gaudi 3人工智能加速器 Intel Gaudi 3 AI Accelerator

英特尔Gaudi 3人工智能加速器
Intel Gaudi 3 AI Accelerator