HipKittens: 快速且强大的AMD内核
HipKittens: Fast and furious AMD kernels

原始链接: https://hazyresearch.stanford.edu/blog/2025-11-09-hk

## HipKittens:弥合AMD AI软件差距 目前,AI开发受到硬件限制,扩大计算格局至关重要。虽然AMD GPU提供具有竞争力的性能,但其软件生态系统落后于NVIDIA,阻碍了其能力的充分利用。本文介绍**HipKittens**,这是一系列优化的AMD内核和编程原语,旨在简化和加速AMD GPU开发。 研究表明,现有的AMD软件(AITER、PyTorch、Triton、Mojo和TileLang等编译器)通常无法达到峰值性能,需要对低级汇编进行大量的手动优化——这是可扩展性的重大障碍。HipKittens通过利用NVIDIA开发中成功使用的**基于切片的抽象**(ThunderKittens),并将其适应于AMD独特的架构来解决这个问题。 主要发现表明,这些基于切片的原语在不同架构上具有良好的泛化性,能够实现可读、易于维护的代码,在注意力机制和GEMM操作中**优于AMD现有的基线**——包括手动优化的汇编内核。HipKittens旨在普及AMD GPU在AI中的使用,摆脱繁琐的汇编编程,为更加多样和开放的AI硬件生态系统铺平道路。代码和研究已公开发布。

黑客新闻 新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 HipKittens:快速且强大的AMD内核 (stanford.edu) 24点 由 dataminer 1小时前 | 隐藏 | 过去 | 收藏 | 讨论 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系 搜索:
相关文章

原文

Team: William Hu, Drew Wadsworth, Sean Siddens, Stanley Winata, Daniel Fu, Ryan Swann, Muhammad Osama, Christopher Ré, Simran Arora
Links: Arxiv | Code

AI is gated by hardware. We think that opening up AI’s compute landscape is one of the most important problems to be working on right now. Building towards this goal, we present HipKittens: SoTA AMD kernels and a collection of opinionated programming primitives to make AMD kernel dev easier!

Named after AMD's CUDA equivalent, called HIP.

Building towards multi-silicon AI systems

While AI has largely used a single hardware vendor to get to its current stage, AMD GPU hardware now offers state-of-the-art peak compute and memory bandwidth. However, this performance is locked away from AI workflows due to the lack of mature AMD software.

SpecNVIDIA B200 SXM5AMD MI355X OAM
BF16 matrix / tensor2.2 PFLOPs2.5 PFLOPs
MXFP8 matrix / tensor4.5 PFLOPs5.0 PFLOPs
MXFP6 matrix / tensor4.5 PFLOPs10.1 PFLOPs
MXFP4 matrix / tensor9.0 PFLOPs10.1 PFLOPs
Memory capacity180 GB288 GB
Memory bandwidth8.0 TB/s8.0 TB/s

Table 1: Hardware overview. Peak memory and compute speeds for the latest generation GPU platforms.

The AMD software ecosystem includes AITER, a high performance AI kernel library; PyTorch and a few compilers (Triton, Mojo, TileLang); and Composable Kernel (CK), AMD's C++ based programming model for writing kernels. However, despite gigawatt-scale AMD deployments, the software remains brittle.

  1. The existing software offerings fail to consistently achieve peak performance. CK kernels frequently underperform (see our evaluations below). AITER and PyTorch are volatile; for instance, AITER and PyTorch SDPA Llama GQA backwards kernels achieve just 30% and 24% of SoTA performance respectively on AMD MI355X GPUs. And the compilers currently significantly sacrifice performance and have not yet demonstrated reusable programming primitives for AMD. Further, we find that some critical aspects of hardware functionality around bank conflict avoidance are undocumented in the CDNA ISA, discussed in our technical deep dive blog.
Details: expand to learn more about our current compiler observations

We expand on a few observations about current compilers:

While the Pythonic interface and portability of such compilers is helpful, our goal is to identify the principles that lead to peak performance as well.

  1. As a result, AMD's most performant AI kernels need to be hand-optimized by experts in raw assembly. It is very difficult to scale to the breadth of AI workloads and as a result, most widely used AI workloads are unsupported/under-optimized on AMD (e.g., we see this on some attention problem shapes, non-causal gqa backwards pass, memory bound kernels).

Figure: what is raw assembly? can't understand it? that's the point!

With all of this, it remains up in the air what the best path forwards is for multi-silicon kernel development!

As a result, the AI community says that there’s a CUDA moat in AI software: tweet #1, tweet #2, tweet #3 and many more.

But that being said, developing performant NVIDIA kernels was also painstakingly tedious a few years ago. Using low level CUDA/CUTLASS, it took two years between the H100 GPU’s release and the release of peak performance open-source attention kernels. Compilers and LLMs-for-kernel-development on NVIDIA have so far sacrificed performance for simplicity and struggled to quickly support new hardware features.

Opinionated primitives are simplifying the process of writing performant NVIDIA kernels! Amazingly through the community’s effort on all these DSLs and AI assistants, NVIDIA kernel development is starting to get easier! Last year, we shared our opinionated take on kernel DSLs — keep the familiar PyTorch feel, but make the primitives C++ embedded to get peak performance, simplicity and extensibility to hardware platforms and AI workloads. In May 2024, we shared ThunderKittens (TK), and have been excited to see its ideas used in a wave of frameworks this year like CuTe DSL in Sept 2025, Tiny Grad “tinykittens”, TileLang in April 2025 and Gluon in June 2025. It’s been fun to see TK used in companies like Together AI, Jump Trading, and Cursor and in academic research.

So then we were curious whether entirely new programming primitives are needed to simplify AMD kernel development, or whether existing primitives suffice. It wasn't obvious to us where this exploration would end up; most modern kernels are designed around NVIDIA-specific hardware features. AMD hardware differs meaningfully (no wgmma/tcgen05 pipelined async matmuls, no tma, no mbarriers, no register reallocation, smaller shared memory, chiplet instead of monolithic, etc.) and we weren't sure where performance would end up nor how different the primitives might look compared to NVIDIA frameworks.

Our exploration resulted in HipKittens, a minimal, opinionated collection of C++ embedded programming primitives for fast AMD kernels. We find:

  1. The tile abstraction generalizes across architectures. The core tile-based primitives we identified as effective on NVIDIA GPUs—including tile types, PyTorch-like bulk compute operators over tiles, and composable load/store interfaces—translate naturally to AMD.
  2. Backend implementations are architecture-specific. The underlying memory access patterns (e.g., swizzling schemes, register scheduling) that realize the tile interface differ between AMD and NVIDIA due to hardware differences.
  3. Scheduling strategies adapt to hardware constraints. The scheduling patterns both within a processor and across processors differ on AMD compared to NVIDIA, reflecting fundamental architectural distinctions. Wave specialization underperforms on CDNA3 and CDNA4. However, reasoning about schedules at the tile granularity—rather than at the level of individual registers or memory transactions—continues to simplify development, maintain code readability, and enable peak performance.

Ultimately, we see that tile-based abstractions remain general across architectures, providing evidence that a unified, performant programming model for AI accelerators is achievable. The key insight is separating the interface (tiles and operations on tiles) from the implementation (how tiles map to hardware), allowing the same high-level programming model to target diverse GPU architectures.

Climbing out of the CUDA moat: Introducing HipKittens

We first explored ThunderKittens for NVIDIA, then ThunderMittens on Apple Silicon and now we’re excited to share HipKittens (HK) for AMD!

Figure: The Kittens cinematic universe! Towards multi-silicon AI!

HK kernels are performant, while remaining easy to read and modify! We might not need raw assembly for peak performance AMD kernels any more! Life is good!

Figure: HipKittens riding the ~wave~ (not warp).

Let's go through the results:

  1. Our attention forwards kernels are written in ~500 lines of code and outperform all of AMDs baselines on average, including the AITER kernels which are written in hand-optimized assembly! We show different head dimensions (DD) and sequence lengths NN, for both causal and non-causal settings.
  1. Our GEMM kernel features a hot loop that's <100<100
  1. We also get speedy attention backwards pass, rotary, and fused dropout-residual-layernorm kernels compared to the strongest available baselines! These results use head dimension 128128 and we vary the sequence length.

Multi-silicon AI is coming!

Realizing AI's full potential requires diverse, open hardware. Today, that means making AMD GPUs truly accessible.

We want more AI in the world. AI has relied on and innovated on a single hardware provider, but we need to be able to use and experiment with all the kinds of compute we can. We need to be able to use the fastest hardware out there. We’re happy to help address these problems with HipKittens!

Checkout part two for a technical deep dive on HK.

Links: Arxiv | Code

联系我们 contact @ memedata.com