GLM5.2 在 AMD MI355X 上实现每节点 2626 token/s 的速度，且成本低于 Blackwell 的一半。

GLM5.2 在 AMD MI355X 上实现每节点 2626 token/s 的速度，且成本低于 Blackwell 的一半。
GLM5.2 on AMD MI355X at 2626 tok/s/node at over 2x lower cost than Blackwell

原始链接: https://www.wafer.ai/blog/glm52-amd

由于人工智能推理需求超过了英伟达 Blackwell GPU 的供应，相关成本正在飙升。Wafer 认为，AMD 的 Instinct MI350 系列提供了一个极具吸引力且性价比更高的替代方案，其每块 GPU 的价格约为英伟达竞品的 2.75 分之一，且拥有相当的硬件规格。尽管英伟达的“首日”软件支持仍然是一项重要优势，但“CUDA 护城河”正在迅速瓦解。Wafer 表明，在 AMD 平台上实现高性能推理，更多地取决于优化，而非硬件本身的局限性。通过采用 AMD Quark 的 MXFP4 量化、选择战略性框架（sglang）以及对内核回退（kernel fallbacks）进行自定义调整等技术，Wafer 在 GLM5.2 模型上实现了惊人的吞吐量。在基准测试中，Wafer 在 MI355X 上达到了 2626 tok/s/node 的总吞吐量，以不到一半的成本实现了 B200 性能的 80%。尽管目前 AMD 的技术栈需要更多的手动配置和工程投入，但 Wafer 证明，通过适当的优化，AMD 硬件能够提供最佳的推理性能价格比，有效地缩短了与英伟达在实时性能上的差距。

最近一篇 Hacker News 帖子指出，Wafer.ai 正在 AMD MI355X 硬件上运行 GLM5.2，速度达到每节点 2,626 tokens/s，并声称其成本比英伟达的 Blackwell 架构低 50% 以上。用户间的讨论主要集中在 AMD 作为英伟达替代方案的可行性上，特别是在电力成本高昂且英伟达供应受限的国际市场。主要观点包括： * **性能指标：** 用户要求提供“每瓦性能”数据，并指出能源效率和软件可靠性对于在美国境外运营的数据中心至关重要。 * **软件解锁：** 一位评论者建议，“代理编码驱动程序”（agentic coding drivers）的兴起可能会推动替代架构支持的普及，使工程师能够比过去更轻松地针对 AMD 等硬件进行优化。 * **商业模式：** 讨论中提出了有关底层经济性的问题，特别是较低的成本究竟是由于更高的毛利率，还是硬件利用率的限制所致。总体而言，该讨论反映出人们对 AMD 作为英伟达统治地位潜在挑战者的兴趣日益浓厚，前提是其软件支持和能源效率能够在大规模应用中得到验证。

原文

Have you noticed we like AMD?

The demand for inference is skyrocketing and outpacing supply. With frontier models being released almost every other week — Claude Fable, GLM5.2, and Minimax M3, to name a few — the token craze is only getting crazier, and there aren’t enough Blackwells going around to support it. Thus, NVIDIA GPU prices are climbing fast, and tokens are getting really expensive.

In comes AMD. At around 2.75x cheaper per GPU on average (MI355X vs B300) with comparable hardware specs, the solution to cheap inference is hiding in plain sight — a message we at Wafer have been preaching for months. But although AMD’s Instinct MI350 series competes with Blackwells at the silicon level, NVIDIA’s software advantage and day-0 support typically allows providers to serve inference much faster on their hardware with much less friction.

Conversely, on the MI355X / ROCm stack SOTA performance rarely comes out of the box for these frontier models (sometimes it does!). In fact, you’re lucky if you can find an image that runs them at all. Without this day-0 support, building and optimizing for the newest models can require weeks of engineering and compute. By then, the newest model has already been released, making it so AMD is always playing catch-up.

But as agents improve at kernel and model optimization, this gap is closing in real time. At Wafer, we’ve proven this time and time again.

And again — on a 20k in / 1k out, 60% cache hit rate workload, we hit an aggregate throughput of 2626 tok/s/node @ 2.4 rps with a defined knee of ≤5s TTFT — only 80% of the performance measured on a B200, despite being over 2x cheaper.

Sustained RPS	Aggregate tok/s/node	TTFT p50 / p95	Success
0.5	449	0.59s / 0.60s	100%
1.0	974	0.60s / 0.81s	100%
1.5	1913	0.62s / 1.03s	100%
2.0	1944	0.62s / 1.05s	100%
2.25	2089	0.63s / 1.23s	100%
2.4 (saturation)	2626	0.81s / 2.22s	100%

We also hit 213 tok/s on GLM5.2 on 10k input tokens / 1.5k output tokens single stream, following Artificial Analysis standards, served on AMD MI355X capacity from TensorWave. Though this number doesn’t top the AA leaderboard, it still wins on performance per dollar.

How we did it

The first step with any model work is to choose a quantization and framework. We quantized the base bf16 GLM-5.2 to MXFP4 with AMD Quark. In comparison to z-ai’s official FP8 quantization, our MXFP4 was lossless (GPQA-Diamond, tau2, GSM8K).

Eval	FP8 baseline	MXFP4	Δ (MXFP4 − FP8)
GSM8K (200q, 5-shot, greedy)	0.965 ± 0.013	0.955 ± 0.014	−0.010
GPQA-Diamond (198q × 2 seeds, temp 1.0)	0.9217 ± 0.027	0.9026 ± 0.029	−0.019
tau2 macro	0.819	0.834	+0.015

As for the inference framework, we had three options — vLLM, ATOM, and sglang. Among the three, we chose sglang — vLLM had no working MXFP4 + GlmMoeDsa path so the MXFP4 weights provided no benefit, and ATOM’s output degraded at long context. Sglang was the inference engine with the least friction to native support, able to take advantage of the quantization while remaining coherent.

The next natural step to improving throughput was enabling speculative decode on sglang. However, the sglang ROCm image does not support this out of the box. There were two fixes needed before MTP worked properly.

First, the MTP head, like every other layer, keeps its single shared expert stored in bf16, not MXFP4. However, the MTP head is registered under a different module prefix than the main decoder stack (Quark names its bf16 shared expert model.layers.78.mlp.shared_experts.*, while the MTP layer’s real prefix is model.decoder.*). Because of the mismatch, sglang’s quantization lookup fails and defaults to building that shared expert as MXFP4. At load it then tries to read a full-width bf16 weight into a half-width 4-bit slot and the init crashes on a shape mismatch. Quark records which weights to leave un-quantized as a list of layer names, so we copied over the layer 78 entries to that list a second time under the decoder name sglang actually uses. This fix unblocked speculative decode, netting us close to a 3x gain in single stream throughput.

Second, deep speculative decode (such as the 5/1/6 config z-ai suggests) was still blocked. The fused multi-step metadata kernel needed for draft depth ≥4 writes #include <cuda_runtime.h> with no ROCm guard. Fix: one #ifdef USE_ROCM guard.

Two trivial, but necessary changes to take full advantage of speculative decode. With spec dec working properly, alongside a few config optimizations (such as --kv-cache-dtype fp8_e4m3 and --enable-aiter-allreduce-fusion), we reached our headline single stream decode number at 213 tok/s.

But for aggregate throughput, especially with our defined workload, decode optimizations are necessary but insufficient. At 20k in @ 60% cache, the workload is primarily prefill bound.

At TP8, which was the configuration optimized for single stream decode, the MI355X can run GLM5.2-MXFP4 at 1461 tok/s/node. Switching to TP4×DP2 netted a massive improvement on this workload, getting us to 1944 tok/s/node at 2.0 RPS — still relatively slow compared to our measured Blackwell performance, which hit 3192 tok/s/node at 3.0 RPS. A big reason for the poor prefill performance on the MI355X is that on the sglang image, GLM-5.2’s fp4 MoE was silently on a slow FlyDSL heuristic fallback (aiter only shipped tuned configs for the a8w8/fp8 path). We tuned the MoE kernel selection ourselves on GLM’s fp4 shapes (model_dim 6144, moe_inter 2048, E=256, topk=8), which allowed us to reach 2626 tok/s/node at 2.4 RPS. Much better.

Why this matters

Although there was some degree of friction, achieving the best performance per dollar ratio on the MI355X wasn’t particularly hard — though there were some framework related bugs, unlike our work with Qwen3.5 397B, you’ll notice that we didn’t actually write any custom kernels this time. Though this study doesn’t take multi-node performance into consideration, single-node deployments still remain highly prevalent in practice.

SOTA on AMD is becoming more a matter of support, not software. The CUDA moat is eroding in real time.