GPU 运行 Brrr

GPU 运行 Brrr
GPUs Go Brrr

原始链接: https://hazyresearch.stanford.edu/blog/2024-05-12-tk

近年来，研究人员致力于减少人工智能的计算需求并提高现有硬件的效率。这包括 Base、Monarch Mixer、H3、Hyena、S4、FlashAttention、FlashAttention-2 和 FlashFFTConv 等成果。然而，退一步质疑硬件的基本要求。这篇文章讨论了有关优化最新 Nvidia H100 GPU 使用的实践和理念，主要关注其独特的特性。 H100 配备了多个独特的元素，包括 80GB HBM3 内存、50MB L2 缓存、132 个流式多处理器 (SM)、张量内存加速器 (TMA) 和 989TFLOP 半精度矩阵乘法计算。有效利用张量核心至关重要，因为它决定了 GPU 的总体使用情况——空闲张量核心相当于最低的硬件利用率。为了确保一致的张量核心活动，采用了各种策略，包括仔细管理 WGMMA 指令和处理共享内存访问。其他因素（例如正确的寻址生成和有效利用占用）对优化 GPU 性能有很大贡献。此外，研究人员还开发了 ThunderKittens，这是一种 CUDA 中的嵌入式领域特定语言 (DSL)，专门用于简化针对某些硬件功能的开发过程。通过使用此工具，开发人员可以创建适合现代 GPU 规格的高性能自定义内核，最终最大限度地发挥硬件的潜力。

最近的一场争论质疑小于 16x16 的矩阵乘法是否符合人工智能过程的资格。一些人认为更大的寄存器（例如 1024 位向量）才是未来。他们建议转变视角，将“寄存器”视为 16x16 的数据块。人工智能的硬件要求不断发展，趋势倾向于更小、更简单的设备。 GPU 最初是为不同的任务而设计的，现在由于其矩阵乘法功能而广泛用于人工智能。争论包括对不同数值精度的讨论，范围从 1 位到 16 位浮点。未来的发展可能会带来针对特定人工智能功能的简化、更便宜的设备。尽管人们担心 GPU 等旧技术是否适合现代人工智能需求，但人工智能硬件的进步导致 GPU 性能持续提高。新趋势表明对短数量处理的偏好，可能会带来进一步的进步。该论点表明，针对 16x16 块进行优化的硬件与 AI 流程非常契合。此外，谈话中还提到了功耗的重要性及其对硬件开发的潜在影响。人们对使用额外 VRAM 扩展 GPU 的重要性以及对市场策略的影响进行了争论。一些人建议创建专注于高效人工智能处理的定制硬件，这可能在降低成本和提高与软件框架的兼容性方面提供优势。总体而言，人工智能硬件的不断进步，加上消费者偏好和市场动态的变化，为该领域的未来发展带来了有趣的可能性。

AI uses an awful lot of compute.

In the last few years we’ve focused a great deal of our work on making AI use less compute (e.g. Based, Monarch Mixer, H3, Hyena, S4, among others) and run more efficiently on the compute that we have (e.g. FlashAttention, FlashAttention-2, FlashFFTConv). Lately, reflecting on these questions has prompted us to take a step back, and ask two questions:

What does the hardware actually want?
And how can we give that to it?

This post is a mixture of practice and philosophy. On the practical side, we’re going to talk about what we’ve learned about making GPUs go brr -- and release an embedded DSL, ThunderKittens, that we’ve built to help us write some particularly speedy kernels (which we are also releasing). On the philosophical side, we’ll briefly talk about how what we’ve learned has changed the way we think about AI compute.

What's in an H100?

For this post, we’re going to focus on the NVIDIA H100 for two reasons. First, it represents an awful lot of new compute going online. Second, we think the trends it implies are going to continue in future generations, and probably from other manufacturers, too. But bear in mind (and we will repeat in case you forget) that most of this post applies in some form to other GPUs, too.

Advance apologies for restating the data sheet, but the details of the hardware are important for the discussion to come. An H100 SXM GPU contains, for our purposes:

80 GB of HBM3 with 3 TB/s of bandwidth. (A bit less bandwidth in practice.)
50 MB of L2 cache with 12 TB/s of bandwidth, split across the GPU into two 25MB sections connected by a crossbar. (The crossbar sucks.)
132 streaming multiprocessors (SM’s), where each has:
- up to 227 KB of shared memory within a 256 KB L1 cache. (Together, these have about 33 TB/s of bandwidth.)
- a tensor memory accelerator (TMA) -- a new chunk of hardware in Hopper that can do asynchronous address generation and fetch memory. It also does other things like facilitate the on-chip memory network (distributed shared memory) but we’re not going to focus on this much, today.
- 4 quadrants, where each quadrant has:
  - A warp scheduler
  - 512 vector registers (each containing 32 4-byte words)
  - A tensor core for matrix multiplies
  - A bunch of built-in instructions like sums, multiplies, that operate in parallel on these vector registers.

There’s a lot of other stuff, too (memory controllers, instruction caches, etc) but we don’t care about any of that right now.

All of the compute happens in the SM’s. Most of it happens in the registers.

Great, how do I make it go brr?

Keep the tensor core fed. That’s it.

Wait, really?

Yes. That’s the game.

An H100 GPU has 989 TFLOPs of half-precision matrix multiply compute, and ~60 TFLOPs of “everything else”. So, every cycle the tensor core is in use, you’re getting at least 94% utilization of the hardware. And every cycle the tensor core is not in use, you’re getting no more than 6% utilization of the hardware. Put another way:

% utilization H100 = % tensor cores active cycles +/- 6%.

Now it turns out that keeping the tensor core fed is easier said than done. We’ve discovered a number of quirks to the hardware that are important to keeping the matrix multiplies rolling. Much of this also applies to non-H100 GPUs, but the H100 is particularly tricky to keep fed so we focus on it here. (The RTX 4090, by comparison, is very easy to work with as illustrated in figure 2.)

WGMMA instructions are necessary but also really irritating to use.
Shared memory is not actually that fast and also requires great care.
Address generation is expensive.
Occupancy remains helpful, and registers are generally the key resource.

Figure 2: NVIDIA GPUs (H100 and 4090) and their spirit animals (canadian goose and golden retriever puppy).

Let’s go through each of these in order.

WGMMA Instructions

The H100 has a new set of instructions called “warp group matrix multiply accumulate” (wgmma.mma_async in PTX, or HGMMA/IGMMA/QGMMA/BGMMA in SASS). To understand what makes them special, we need to look briefly at how you used to have to use tensor cores. The tensor core instructions available on previous GPUs were wmma.mma.sync and mma.sync instructions. With these instructions a warp of 32 threads on a single quadrant of an SM would synchronously feed their chunk of the data into the tensor core and await the result. Only then could they move on.

Not so with wgmma.mma_async instructions. Here, 128 consecutive threads -- split across all quadrants of the SM -- collaboratively synchronize, and asynchronously launch a matrix multiply directly from shared memory (and optionally also registers.) These warps can then go do other things with their registers while the matrix multiply happens, and await the result whenever they want.

In our microbenchmarks, we found that these instructions are necessary to extract the full compute of the H100. Without them, the GPU seems to top out around 63% of its peak utilization; we suspect this is because the tensor cores want a deep hardware pipeline to keep them fed, even from local resources.

Unfortunately, the memory layouts for these instructions are quite complicated. The unswizzled shared memory layouts suffer from very poor coalescing, and so they require substantial additional bandwidth from L2. The swizzled memory layouts are flat-out incorrectly documented, which took considerable time for us to figure out. They’re also brittle, in that they appear to only work for specific matrix shapes and do not play well with other parts of the wgmma.mma_async instructions. For example, the hardware can transpose sub-matrices on its way to the tensor cores -- but only if the layout is not swizzled.

Figure 3: NVIDIA’s *lies*. This is an extraordinarily misleading representation of the actual 128b swizzled wgmma layout. This diagram cost us three weeks of life that we will not get back, hence the public shaming.

We’ve also found that unswizzled wgmma layouts have both poor memory coalescing as well as bank conflicts. On kernels such as flash attention, TMA and the L2 cache are both fast enough so as to hide these problems reasonably well. But to make the full use of the hardware, memory request must be coalesced and bank conflicts avoided, and then controlling layouts very carefully becomes critical.

Despite these pains, these instructions really are necessary to make full use of the H100. Without them, you’ve already lost 37% of the potential performance of the GPU!

Shared memory

Shared memory appears to have a single-access latency of around 30 cycles (this matches our observations, too). That doesn’t sound like much, but in that time the SM’s tensor cores could have done almost two full 32x32 square matrix multiplies.

In previous work (like Flash Attention), we’ve focused more on the HBM-SRAM bottleneck. And indeed: this really used to be the bottleneck! But as HBM has gotten faster and the tensor cores continue to grow out of proportion with the rest of the chip, even relatively small latencies like those from shared memory have also become important to either remove or hide.

Shared memory can be tricky to work with because it is “banked” into 32 separate stores of memory. If one is not careful, this can lead to something called “bank conflicts”, where the same memory bank is being asked to simultaneously provide multiple different pieces of memory. This leads to requests being serialized, and in our experience this can disproportionately slow down a kernel -- and the register layouts required by wgmma and mma instructions would naively suffer from these bank conflicts. The solution is to rearrange shared memory with various “swizzling” patterns so as to avoid these conflicts, but it is an important detail to get right.

More generally, we have found it very valuable to avoid movement between registers and shared memory when possible, and otherwise to use the built-in hardware (wgmma and TMA instructions) to do data movement asynchronously when possible. Synchronous movement using the actual warps is a worst-case fallback with the greatest generality.

Address Generation

One interesting quirk of the H100 is that the tensor cores and memory are both fast enough that merely producing the memory addresses to fetch takes a substantial fraction of the resources of the chip. (This is even more the case when complicated interleaved or swizzling patterns are added in.)

NVIDIA appears to understand this, as they have bestowed on us the Tensor Memory Accelerator (or TMA, as it likes to be called). TMA allows you to specify a multi-dimensional tensor layout in global and shared memory, tell it to asynchronously fetch a subtile of that tensor, and trip a barrier when it’s done. This saves all of the address generation costs, and additionally makes it much easier to construct pipelines.

We have found TMA to be, like wgmma.mma_async, completely indispensable in achieving the full potential of the H100. (Probably moreso than wgmma, in our experience.) It saves register resources and instruction dispatches, and also has useful features such as the ability to perform reductions onto global memory asynchronously, too -- this is particularly useful in complex backwards kernels. As with wgmma, the main quirk of it is that its swizzling modes are a bit difficult to decipher without some reverse engineering, but we had substantially less pain on this point.

Occupancy

For those newer to CUDA, occupancy refers to the number of co-scheduled threads on the exact same execution hardware. Each cycle, the warp scheduler on that quadrant of the SM will try to issue an instruction to a warp of threads that are ready for an instruction. NVIDIA uses this model because it can enable the hardware to be more easily kept full. For example, while one warp of threads is waiting for a matrix multiply, another can receive an instruction to use the fast exponential hardware.

In some ways, the H100 is less reliant on occupancy than previous generations of the hardware. The asynchronous features of the chip mean that even a single instruction stream can keep many parts of the hardware busy -- fetching memory, running matrix multiplies, doing shared memory reductions, and still simultaneously running math on the registers.

But occupancy is very good at hiding both sins and sync’s. A perfectly designed pipeline might run reasonably fast even without any additional occupancy, but our observations suggest that NVIDIA really has designed their GPUs with occupancy in mind. And there are enough synchronizations -- and enough ways to make mistakes -- that finding ways to increase occupancy has, in our experience, usually yielded good returns at increasing the realized utilization of the hardware.

Finally, while occupancy is merely useful on the H100, we have found it to be increasingly important on the A100 and RTX 4090, respectively, likely because they rely increasingly on synchronous instruction dispatches, relative to the H100.

ThunderKittens

Based on the above, we asked ourselves how we might make it easier to write the kinds of kernels we care about while still extracting the full capabilities of the hardware. Motivated by a continuing proliferation of new architectures within the lab (and the fact that Flash Attention is like 1200 lines of code), we ended up designing a DSL embedded within CUDA -- at first for our own internal use.

But then we decided it was useful enough that, with love in our hearts, we cleaned it up and have released it for you. ThunderKittens is that embedded DSL. It is named ThunderKittens because we think kittens are cute, and also we think it is funny to make you type kittens:: in your code.

Figure 4: A ThunderKitten. Look at her big eyes! Are you not be entranced!?!?

It is meant to be as simple as possible, and contains four templated types:

Register tiles -- 2D tensors on the register file.
Register vectors -- 1D tensors on the register file.
Shared tiles -- 2D tensors in shared memory.
Shared vectors -- 1D tensors in shared memory.

Tiles are parameterized by a height, width, and layout. Register vectors are parameterized by a length and a layout, and shared vectors just by a length. (They don’t generally suffer from bank conflicts.)

We also give operations to manipulate them, either at the warp level or at the level of a collaborative group of warps. Examples include:

Initializers -- zero out a shared vector, for example.
Unary ops, like exp
Binary ops, like mul
Row / column ops, like a row_sum

Since ThunderKittens is embedded within CUDA (contrasting libraries like Triton which we also love very much and rely on heavily), the abstractions fail gracefully. If it’s missing something, just extend it to do what you want!

To show an example of these primitives in action, consider Tri’s lovely flash attention -- a beautiful algorithm, but complicated to implement in practice, even on top of NVIDIA’s wonderful Cutlass library.

Here's a simple forward flash attention kernel for an RTX 4090, written in ThunderKittens.

Altogether, this is about 60 lines of CUDA sitting at 75% hardware utilization -- and while it is fairly dense, most of the complexity is in the algorithm, rather than in swizzling patterns or register layouts. And what of all of the complexity of TMA, WGMMA, swizzling modes, and descriptors? Here’s a FlashAttention-2 forward pass for the H100, written with ThunderKittens.

So how does it do?

This kernel is just 100 lines, and it actually outperforms FlashAttention-2 on the H100 by about 30%. ThunderKittens takes care of wrapping up the layouts and instructions, and gives you a mini-pytorch to play with on the GPU.

Figure 5: FA2 (via Pytorch) versus TK for a wide range of configs on the H100 SXM.

We also release kernels for Based linear attention and other forthcoming architectures, too. Our Based linear attention kernel runs at 215 TFLOPs (or more than 300 TFLOPs when the recompute inherent in the algorithm is considered). And while linear attention is of course theoretically more efficient, historically, they have been dramatically less efficient on real hardware. So we feel this could open up a broad range of high-throughput applications -- more to come on this point later.

Figure 6: Linear attention can be quite quick with TK!

If this seems up your alley, feel free to play with it!

Tiles Seem Like a Good Idea

In our view, what has made ThunderKittens work well for us is that it does not try to do everything. CUDA is indeed far more expressive than ThunderKittens. ThunderKittens is small and dumb and simple.

Figure 7: the main message of this blog post.

But ThunderKittens has good abstractions -- small tiles -- that match where both AI and hardware are going. ThunderKittens doesn’t support any dimension less than 16. But in our view, this doesn’t really matter, since the hardware doesn’t particularly want to, either. And we ask: if your matrix multiply is smaller than 16x16, are you sure what you’re doing is AI?

From a philosophical point of view, we think a frame shift is in order. A “register” certainly shouldn’t be a 32-bit word like on the CPUs of old. And a 1024-bit wide vector register, as CUDA uses, is certainly a step in the right direction. But to us a “register” is a 16x16 tile of data. We think AI wants this -- after all this time, it’s still just matrix multiplies, reductions, and reshapes. And we think the hardware wants this, too -- small matrix multiplies are just begging for hardware support beyond just the systolic mma.

In fact, more broadly we believe we should really reorient our ideas of AI around what maps well onto the hardware. How big should a recurrent state be? As big can fit onto an SM. How dense should the compute be? No less so than what the hardware demands. An important future direction of this work for us is to use our learnings about the hardware to help us design the AI to match.

Tiles Seem Pretty General

Coming soon -- ThunderKittens on AMD hardware!

GPU 运行 Brrr GPUs Go Brrr