面向机器学习系统的现代 GPU 编程

面向机器学习系统的现代 GPU 编程
Modern GPU Programming for MLSys

原始链接: https://mlc.ai/modern-gpu-programming-for-mlsys/

本书旨在为构建面向现代人工智能工作负载（包括大语言模型和混合专家模型）的高性能 GPU 内核提供全面指南。本书取材于卡内基梅隆大学的机器学习系统课程，弥合了复杂的 GPU 硬件架构（特别是 Blackwell 一代）与前沿实际应用之间的鸿沟。本书强调建立稳固的硬件认知模型，重点关注数据布局、异步内存移动和任务调度等关键优化技术。为了便于学习，书中采用了基于 Python 的领域特定语言（DSL）TIRx，让读者能够通过可运行的循序渐进的代码示例，深入理解底层硬件控制。全书分为四个部分： 1. **GPU 基础**：GPU 架构的核心概念与优化方法。 2. **TIRx 概览**：编程模型介绍。 3. **GEMM 实现**：通过 TMA 流水线和 Warp 特化等高级技术优化矩阵乘法的系统指南。 4. **FlashAttention 4**：深入剖析如何构建生产级的注意力机制内核。通过将硬件直觉与实践编程相结合，本书为工程师提供了掌握驱动现代人工智能的高性能内核所需的必要工具。

抱歉。

原文

Machine learning systems sit at the heart of modern AI workloads. In these systems, performance often comes down to the quality of a small number of GPU kernels. Attention kernels, LLM prefill and decode kernels, low-precision block-scaled GEMMs, fused MoE layers, and other large fused kernels all directly shape end-to-end speed in both training and serving.

To make these kernels fast, however, we need more than a list of optimization tricks. Modern GPUs are no longer simple variations of the same old design. Recent architectures introduce richer memory spaces, new access patterns, and increasingly specialized execution units. To program them well, we need both a clear mental model of the hardware and a practical understanding of how high-performance kernels are built. This book is about developing both.

The book follows a simple progression: first understand the GPU hardware, then learn the programming model we will use, and finally build state-of-the-art kernels step by step. Our main target is the Blackwell generation, and our main running examples are fast matrix multiplication (GEMM) and FlashAttention. Along the way, we will also study the core ingredients behind GPU optimization: data layout, asynchronous data movement, and asynchronous coordination.

The material grows out of the Machine Learning Systems course series at Carnegie Mellon University. To make the ideas easier to study and easier to run, this book uses the TIRx Python DSL to build real GPU kernel examples step by step. TIRx stays close to the hardware, which lets us reason about low-level control while still learning through runnable code.

Part I, Understanding the GPU. This part introduces the overall organization of the GPU, general recipes for writing fast kernels, and key concepts such as data layout, asynchronous memory operations, and coordination. It builds the hardware intuition that the rest of the book relies on.
Part II, TIRx Overview. This part introduces the key elements of TIRx, which serve as the foundation for the code examples throughout the book.
Part III, GEMM: Tiled to SOTA. A complete guide to optimizing a tiled GEMM, built up through TMA pipelining, persistent scheduling, warp specialization, and 2-CTA clusters.
Part IV, Flash Attention 4. A complete attention kernel built from the Part III techniques: two MMAs with softmax between them, online-softmax rescaling, causal masking, and GQA.
Reference. TIRx language reference and compiler internals.

Part I, Understanding the GPU

Part III, GEMM: Tiled to SOTA

Part IV, Flash Attention 4

面向机器学习系统的现代 GPU 编程 Modern GPU Programming for MLSys

面向机器学习系统的现代 GPU 编程
Modern GPU Programming for MLSys