在 AMD MI300X 上部署 DeepSeek-V4-Flash
Bringing Up DeepSeek-V4-Flash on AMD MI300X

原始链接: https://fergusfinn.com/blog/deepseek-v4-flash-mi300x/

Doubleword 正在利用 AMD 的 MI300X 加速器来应对持续的 GPU 短缺问题。尽管与 NVIDIA 的 H100 相比,MI300X 提供了更高的 HBM 容量和更低的成本,但其应用仍受到软件摩擦的阻碍,特别是在处理 DeepSeek-V4 等复杂模型时。 在最近的一次技术深入分析中,Doubleword 详细介绍了在为 MI300X 优化 vLLM 时遇到的“棘手问题”。主要挑战包括: * **FP8 不兼容:** MI300X 使用一种“fnuz”FP8 方言,与标准的 OCP 兼容硬件相比,会导致两倍的数值误差。 * **AITER 库的局限性:** AMD 的调优内核库在 CDNA3 架构上对 DeepSeek-V4 的专用内核缺乏全面覆盖,需要自定义特定于 ROCm 的回退方案。 * **HIP 图的限制:** 成功实现 HIP 图需要重构内核,以避免主机端的动态分配。 通过手动优化和内核调优,Doubleword 实现了 8.6% 的性能提升,并成功在 MI300X 硬件上部署了该模型。虽然工作量巨大,但作者认为,随着 AMD 不断更新其标准,软件差距正在缩小。因此,他们认为 MI300X 是 NVIDIA 在大规模推理方面一个可行且具有成本效益的替代方案。

Hacker News 新闻 | 过往 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 在 AMD MI300X 上部署 DeepSeek-V4-Flash (fergusfinn.com) 17 点 由 kkm 1 小时前发布 | 隐藏 | 过往 | 收藏 | 3 条评论 帮助 mezark 2 分钟前 | 下一条 [–] 我们在 doubleword 非常看好 AMD 在低交互推理方面的表现——只是在软件层面需要付出更多努力…… 回复 kkm 3 分钟前 | 上一条 | 下一条 [–] 还有博客文章附带的 vllm 补丁: https://github.com/doublewordai/vllm-amd-blog-doubleword 回复 benlm 10 分钟前 | 上一条 [–] 干得漂亮!DeepSeek V4 Pro 在 8xMI300X 上能用这些补丁吗?回复 准则 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文

At Doubleword we are building an inference cloud designed for volume. To do that we have to reckon with the enveloping compute shortage.

AMD’s MI300X launched in December 2023

The reason is software. The problems with running AI workloads on AMD have been written about elsewhere exhaustively, and there are signs the gap is closing on AMD’s newer chips

On paper MI300X is an excellent accelerator. We want it to work. This post is a worklog of all the sharp edges and winding paths we found when we tried to get it working.

FP8 dialect§

The MI300X was part of the accelerator generation that kicked off the march toward lower bitwidths. LLM weights, and to a lesser extent activations and KV caches, are less sensitive to numerical imprecision than typical HPC workloads, so the Hopper generation of NVIDIA chips and the first Instinct chips added hardware support for sub-16-bit precision for the first time. The result is twice as many FLOPs applied to workloads that correspondingly transfer half as much data.

The problem is that there was disagreement on the best way to build an FP8 datatype. Graphcore and AMD proposed one standard in a 2022 preprint, backed by Qualcomm. Arm, Intel, and NVIDIA proposed another through the Open Compute Project. In a rehash of some of the forks in the road that led to IEEE 754

Perhaps unsurprisingly given the list of backers on each side, the AMD / Graphcore standard didn’t make it. AMD’s newer MI325, MI350, and MI355X chips all moved over to OCP-standard FP8. But MI300X still only works in the fnuz dialect

Lots of vLLM’s FP8 paths are aware of e4m3 versus e5m2 but not of fnuz versus OCP. The two share their bit layout but differ in exponent bias by one, so the same byte read as the wrong dialect comes back off by exactly a factor of two. MI300X is the only major accelerator where that distinction matters in practice

Missing attention fast paths§

DeepSeek v4’s attention is sparse. Each query attends to a top-k subset of the KV cache picked by a learned indexer, with sliding-window context handled separately.

It’s got a lot of moving pieces: KV compression, the indexer, the sliding-window path, FP8 caches feeding each. In a production deployment for maximum performance, each piece needs special attention (no pun intended) in the form of a tuned kernel.

The source of fast tuned kernels on AMD is AITER. AITER is AMD’s tuned-kernel library, roughly the analog of what NVIDIA users get from cuBLAS, cuDNN, FlashAttention, and Transformer Engine combined. vLLM falls back to generic Triton when AITER doesn’t have a path for a given shape, and generic Triton attention is several times slower than a tuned kernel. AITER’s coverage for DSV4 is uneven, and what coverage exists tends to target later AMD parts (CDNA4) rather than the CDNA3 (gfx942) cores in MI300X.

The fallout from this has two different shapes. Some pieces are missing AITER paths entirely on gfx942: paged MQA logits, sparse MLA prefill, sparse MLA decode. For each we need to put in a ROCm-specific helper that calls into AITER where it exists and falls through to a Triton implementation where it doesn’t. Some pieces have AITER paths that exist but break specifically on gfx942: AITER prefill MQA logits and AITER sparse prefill logits both fall here. The fix is to refuse to dispatch into them when current_platform reports gfx942 and let the Triton fallback handle the call instead

HIP graphs§

HIP graphs are AMD’s analog of CUDA graphs, with effectively the same semantics: record the stream of operations once at warmup, replay the recorded graph on every subsequent step. The win is removing per-launch Python overhead from the decode loop, which matters a lot when you launch hundreds of small kernels per token. Since DeepSeek v4 has so many moving parts, there would be a lot of kernel launches if we didn’t leverage graphs.

The price is that the captured region has to be a pure function of its device inputs. Anything that reads from the host, allocates a ragged tensor whose shape depends on the live batch, or synchronises inside the captured region gets recorded with whatever value it had at warmup and replayed forever after.

The AITER tuned kernels compose with this by construction. AITER kernels are C++ launches that take device pointers and sizes; they don’t allocate ragged scratch from Python and they don’t read host scalars mid-stream. It’s pretty easy to write a Triton kernel that doesn’t work nicely, we did that a couple times

Loose ends§

We ran into a bunch of smaller issues:

  • An MoE routing bug where the expert-mask shape was gated on whether ROCm AITER was globally enabled, not on whether the matmul about to be called was actually AITER’s. With AITER globally on but MXFP4 falling through to the emulation backend, the kernel got the wrong mask and tokens routed to the wrong experts. 8b5f7aa2c.
  • A Triton kernel that masks padded lanes against the global tensor bound rather than the logical block size. At high concurrency the padded lanes scribbled across the MoE routing bitmatrix. c32932bb9.

Tuning it up§

With correctness sorted, we can do some basic optimization.

The first profile of a working DSV4-Flash on MI300X shows that the expensive layers are the sparse MLA path and the MXFP4 MoE path. This is good — if it wasn’t the case we’d be really screwed.

However, after first bring up a meaningful slice of the time is not in the matmuls themselves but in the bookkeeping & tuning around them

On our simple benchmark that takes the box from 2485 to 2699 output tok/s per GPU, about +8.6%.

Was it worth it?§

After bringing up the model, optimizing it, and testing it, we get pretty good numbers:

This is a win: MI300X rents for roughly half the price of the NVIDIA capacity it competes with, carries more than twice the HBM per card, and is available on-demand right now, even as H100 and H200 lead times stretch out. We haven’t done the maths to prove that we can get a win on tokens per second per dollar over NVIDIA hardware, but we’ve proven that with hard work we can get close enough to make it useful.

Most of what made it so hard is temporary. The FP8 dialect problem is specific to CDNA3: MI325, MI350, and MI355X all moved to OCP-standard FP8, so the off-by-a-factor-of-two trap does not exist on newer parts. The AITER coverage gaps will fill in over time as AMD’s kernel work catches up to its own hardware. And since we did this work, even while we prepared to open-source it, vLLM’s performance & stability on this model have improved.

AMD’s hardware has been good for a while. The reason the software gap is finally closing is partly AMD’s own focus, and partly that the cost of doing this kind of work

联系我们 contact @ memedata.com