HipKittens: 快速且强大的AMD内核

HipKittens: 快速且强大的AMD内核
HipKittens: Fast and furious AMD kernels

原始链接: https://hazyresearch.stanford.edu/blog/2025-11-09-hk

## HipKittens：弥合AMD AI软件差距目前，AI开发受到硬件限制，扩大计算格局至关重要。虽然AMD GPU提供具有竞争力的性能，但其软件生态系统落后于NVIDIA，阻碍了其能力的充分利用。本文介绍**HipKittens**，这是一系列优化的AMD内核和编程原语，旨在简化和加速AMD GPU开发。研究表明，现有的AMD软件（AITER、PyTorch、Triton、Mojo和TileLang等编译器）通常无法达到峰值性能，需要对低级汇编进行大量的手动优化——这是可扩展性的重大障碍。HipKittens通过利用NVIDIA开发中成功使用的**基于切片的抽象**（ThunderKittens），并将其适应于AMD独特的架构来解决这个问题。主要发现表明，这些基于切片的原语在不同架构上具有良好的泛化性，能够实现可读、易于维护的代码，在注意力机制和GEMM操作中**优于AMD现有的基线**——包括手动优化的汇编内核。HipKittens旨在普及AMD GPU在AI中的使用，摆脱繁琐的汇编编程，为更加多样和开放的AI硬件生态系统铺平道路。代码和研究已公开发布。

黑客新闻新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录 HipKittens：快速且强大的AMD内核 (stanford.edu) 24点由 dataminer 1小时前 | 隐藏 | 过去 | 收藏 | 讨论指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系搜索：

原文

Team: William Hu, Drew Wadsworth, Sean Siddens, Stanley Winata, Daniel Fu, Ryan Swann, Muhammad Osama, Christopher Ré, Simran Arora
Links: Arxiv | Code

AI is gated by hardware. We think that opening up AI’s compute landscape is one of the most important problems to be working on right now. Building towards this goal, we present HipKittens: SoTA AMD kernels and a collection of opinionated programming primitives to make AMD kernel dev easier!

Named after AMD's CUDA equivalent, called HIP.

Building towards multi-silicon AI systems

While AI has largely used a single hardware vendor to get to its current stage, AMD GPU hardware now offers state-of-the-art peak compute and memory bandwidth. However, this performance is locked away from AI workflows due to the lack of mature AMD software.

Spec	NVIDIA B200 SXM5	AMD MI355X OAM
BF16 matrix / tensor	2.2 PFLOPs	2.5 PFLOPs
MXFP8 matrix / tensor	4.5 PFLOPs	5.0 PFLOPs
MXFP6 matrix / tensor	4.5 PFLOPs	10.1 PFLOPs
MXFP4 matrix / tensor	9.0 PFLOPs	10.1 PFLOPs
Memory capacity	180 GB	288 GB
Memory bandwidth	8.0 TB/s	8.0 TB/s

Table 1: Hardware overview. Peak memory and compute speeds for the latest generation GPU platforms.

The AMD software ecosystem includes AITER, a high performance AI kernel library; PyTorch and a few compilers (Triton, Mojo, TileLang); and Composable Kernel (CK), AMD's C++ based programming model for writing kernels. However, despite gigawatt-scale AMD deployments, the software remains brittle.

The existing software offerings fail to consistently achieve peak performance. CK kernels frequently underperform (see our evaluations below). AITER and PyTorch are volatile; for instance, AITER and PyTorch SDPA Llama GQA backwards kernels achieve just 30% and 24% of SoTA performance respectively on AMD MI355X GPUs. And the compilers currently significantly sacrifice performance and have not yet demonstrated reusable programming primitives for AMD. Further, we find that some critical aspects of hardware functionality around bank conflict avoidance are undocumented in the CDNA ISA, discussed in our technical deep dive blog.

Details: expand to learn more about our current compiler observations

We expand on a few observations about current compilers:

Mojo's MHA forwards kernel suffers from bank conflicts and offers just 50% of peak performance on the MI355X (e.g., 430 TFLOPs at B=16, H=16, N=2048, D=128).
TileLang is currently limited to CDNA3 and the MHA kernel “is competitive with PyTorch”, which is amongst the slowest baselines (see results below). TileLang lacks many features that we believe are important for AMD kernels--multiple matrix core shapes like 32x32x16, buffer_load_dwordx4, XCD chiplet swizzling--and also has dependencies on Composable Kernel, adding complexity to the library.
On AMD, Triton struggles with register lifetime tracking and lowering memory accesses to the most performant intrinsics. For example, it may fail to reclaim registers or lower vectorized loads. In our evaluations, we find that Triton kernels from AMD’s ROCm/Triton tend to underperform (even on a vanilla BF16 GEMM).

While the Pythonic interface and portability of such compilers is helpful, our goal is to identify the principles that lead to peak performance as well.

As a result, AMD's most performant AI kernels need to be hand-optimized by experts in raw assembly. It is very difficult to scale to the breadth of AI workloads and as a result, most widely used AI workloads are unsupported/under-optimized on AMD (e.g., we see this on some attention problem shapes, non-causal gqa backwards pass, memory bound kernels).

Figure: what is raw assembly? can't understand it? that's the point!

With all of this, it remains up in the air what the best path forwards is for multi-silicon kernel development!

As a result, the AI community says that there’s a CUDA moat in AI software: tweet #1, tweet #2, tweet #3 and many more.

But that being said, developing performant NVIDIA kernels was also painstakingly tedious a few years ago. Using low level CUDA/CUTLASS, it took two years between the H100 GPU’s release and the release of peak performance open-source attention kernels. Compilers and LLMs-for-kernel-development on NVIDIA have so far sacrificed performance for simplicity and struggled to quickly support new hardware features.

Opinionated primitives are simplifying the process of writing performant NVIDIA kernels! Amazingly through the community’s effort on all these DSLs and AI assistants, NVIDIA kernel development is starting to get easier! Last year, we shared our opinionated take on kernel DSLs — keep the familiar PyTorch feel, but make the primitives C++ embedded to get peak performance, simplicity and extensibility to hardware platforms and AI workloads. In May 2024, we shared ThunderKittens (TK), and have been excited to see its ideas used in a wave of frameworks this year like CuTe DSL in Sept 2025, Tiny Grad “tinykittens”, TileLang in April 2025 and Gluon in June 2025. It’s been fun to see TK used in companies like Together AI, Jump Trading, and Cursor and in academic research.

So then we were curious whether entirely new programming primitives are needed to simplify AMD kernel development, or whether existing primitives suffice. It wasn't obvious to us where this exploration would end up; most modern kernels are designed around NVIDIA-specific hardware features. AMD hardware differs meaningfully (no wgmma/tcgen05 pipelined async matmuls, no tma, no mbarriers, no register reallocation, smaller shared memory, chiplet instead of monolithic, etc.) and we weren't sure where performance would end up nor how different the primitives might look compared to NVIDIA frameworks.

Our exploration resulted in HipKittens, a minimal, opinionated collection of C++ embedded programming primitives for fast AMD kernels. We find:

The tile abstraction generalizes across architectures. The core tile-based primitives we identified as effective on NVIDIA GPUs—including tile types, PyTorch-like bulk compute operators over tiles, and composable load/store interfaces—translate naturally to AMD.
Backend implementations are architecture-specific. The underlying memory access patterns (e.g., swizzling schemes, register scheduling) that realize the tile interface differ between AMD and NVIDIA due to hardware differences.
Scheduling strategies adapt to hardware constraints. The scheduling patterns both within a processor and across processors differ on AMD compared to NVIDIA, reflecting fundamental architectural distinctions. Wave specialization underperforms on CDNA3 and CDNA4. However, reasoning about schedules at the tile granularity—rather than at the level of individual registers or memory transactions—continues to simplify development, maintain code readability, and enable peak performance.

Ultimately, we see that tile-based abstractions remain general across architectures, providing evidence that a unified, performant programming model for AI accelerators is achievable. The key insight is separating the interface (tiles and operations on tiles) from the implementation (how tiles map to hardware), allowing the same high-level programming model to target diverse GPU architectures.

Climbing out of the CUDA moat: Introducing HipKittens

We first explored ThunderKittens for NVIDIA, then ThunderMittens on Apple Silicon and now we’re excited to share HipKittens (HK) for AMD!

Figure: The Kittens cinematic universe! Towards multi-silicon AI!

HK kernels are performant, while remaining easy to read and modify! We might not need raw assembly for peak performance AMD kernels any more! Life is good!

Figure: HipKittens riding the ~wave~ (not warp).

Let's go through the results:

Our attention forwards kernels are written in ~500 lines of code and outperform all of AMDs baselines on average, including the AITER kernels which are written in hand-optimized assembly! We show different head dimensions ( $D$

Our GEMM kernel features a hot loop that's $<100$

We also get speedy attention backwards pass, rotary, and fused dropout-residual-layernorm kernels compared to the strongest available baselines! These results use head dimension $128$

Multi-silicon AI is coming!

Realizing AI's full potential requires diverse, open hardware.^{Today, that means making AMD GPUs truly accessible.}

We want more AI in the world. AI has relied on and innovated on a single hardware provider, but we need to be able to use and experiment with all the kinds of compute we can. We need to be able to use the fastest hardware out there. We’re happy to help address these problems with HipKittens!

Checkout part two for a technical deep dive on HK.

Links: Arxiv | Code