第八代TPU：架构深度解析

第八代TPU：架构深度解析
The eighth-generation TPU: An architecture deep dive

原始链接: https://cloud.google.com/blog/products/compute/tpu-8t-and-tpu-8i-technical-deep-dive

谷歌第八代TPU（TPU 8t 和 8i）旨在满足先进人工智能不断变化的需求，不仅仅是提高处理能力（FLOPS），更要解决特定工作负载的需求，例如长上下文窗口、复杂推理和“世界模型”——通过预测进行模拟和学习的人工智能。这些TPU有两个专业版本：**TPU 8t** 擅长大规模预训练，利用一个包含9,600个芯片的大型网络和“SparseCore”等创新技术来加速嵌入查找和原生FP4，从而提高吞吐量。 **TPU 8i** 针对服务和推理进行了优化。两者都是谷歌云AI超算的核心组成部分，并配备了集成的基于Arm的Axion CPU，以消除数据准备瓶颈，持续为TPU提供数据。最终，TPU 8旨在优化人工智能生命周期的每个阶段，从而能够高效地训练和部署越来越复杂的人工智能模型，例如谷歌DeepMind的Genie 3。

黑客新闻新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录第八代 TPU：架构深度解析 (cloud.google.com) 8 分，meetpateltech 发表于 14 分钟前 | 隐藏 | 过去 | 收藏 | 讨论帮助考虑申请 YC 的 2026 年夏季批次！申请截止至 5 月 4 日指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系方式搜索：

原文

At Google, our TPU design philosophy has always been centered on three pillars: scalability, reliability, and efficiency. As AI models evolve from dense large language models (LLMs) to massive Mixture-of-Experts (MoEs) and reasoning-heavy architectures, the hardware must do more than just add floating point operations per second (FLOPS); it must evolve to meet the specific operational intensities of the latest workloads.

The rise of agentic AI requires infrastructure that can handle long context windows and complex sequential logic. At the same time, world models have emerged as a necessary evolution from current next-sequence-of-data architectures, which means newer agents are simulating future scenarios, anticipating consequences, and learning through "imagination" rather than risky trial-and-error. The eighth-generation TPUs (TPU 8t and TPU 8i) are our answer to these challenges, ensuring that every workload, from the first token of training to the final step of a multi-turn reasoning chain, is running on the most efficient path possible. They are built to efficiently train and serve world models like Google DeepMind’s Genie 3, enabling millions of agents to practice and refine their reasoning in diverse simulated environments.

TPU 8: Specialized by design

Recognizing that the infrastructure requirements for pre-training, post-training, and real-time serving have diverged, our eighth-generation TPUs introduce two distinct systems: TPU 8t and TPU 8i. These new systems are key components of Google Cloud's AI Hypercomputer, an integrated supercomputing architecture that combines hardware, software, and networking to power the full AI lifecycle. While both systems share the core DNA of Google’s AI stack and support the full AI lifecycle, each is built to address distinct bottlenecks and optimize efficiency for critical stages of development. Additionally, by integrating Arm-based Axion CPU headers across our eighth-generation TPU system, we’ve removed the host bottleneck caused by data preparation latency. Axion provides the compute headroom to handle complex data preprocessing and orchestration, so that TPUs stay fed and don’t stall.

TPU 8t: The pre-training powerhouse

Optimized for massive-scale pre-training and embedding-heavy workloads, TPU 8t utilizes our proven 3D torus network topology at an even larger scale of 9,600 chips in a single superpod. TPU 8t is designed for maximum throughput across hundreds of superpods, ensuring that training runs stay on schedule.

Here are some key advancements of TPU 8t over prior-generation TPUs:

The SparseCore advantage: Central to TPU 8t is the SparseCore, a specialized accelerator designed to handle the irregular memory access patterns of embedding lookups. While the Matrix Multiply Unit (MXU) handles matrix math, the SparseCore offloads data-dependent all-gather operations, amongst other collectives, preventing the zero-op bottlenecks that often plague general-purpose chips.
VPU/MXU overlap and balanced scaling: TPU 8t is designed to maximize provisioned FLOPs utilization. By implementing more balanced Vector Processing Unit (VPU) scaling, the architecture minimizes exposed vector operation time. This allows for better overlapping of quantization, softmax, and layernorms with the matrix multiplications in the MXU, helping the chip stay busy rather than waiting on sequential vector tasks.
Native FP4: TPU 8t introduces native 4-bit floating point (FP4) to overcome memory bandwidth bottlenecks, doubling MXU throughput while maintaining accuracy for large models even at lower-precision quantization. By reducing the bits per parameter, the platform minimizes energy-intensive data movement and allows larger model layers to fit within local hardware buffers for peak compute utilization.