DeepSeek开源FlashMLA - MLA解码内核hopper GPU
DeepSeek Open Source FlashMLA – MLA Decoding Kernel for Hopper GPUs

原始链接: https://github.com/deepseek-ai/FlashMLA

FlashMLA是一种高性能解码内核,设计用于在NVIDIA HOPPER GPU上使用的可变长度序列,利用CUDA 12.3+(针对12.8+优化)。它最多可在H800 SXM5上实现3000 GB/s的内存带宽和580个TFLOPS计算,并使用BF16/FP16和分页的KV CACHE,块大小为64。启发了FlashAttention 2&3和Cutlass和Cutlass,flashmla与Pytorch 2.0+集成了FlashMLA。 最佳版本可用于Metax(Metax-Maca/flashMLA),Moore Threads(MooreThreads/Mt-Flashmla),Hygon DCU(Opendas/Mlattention),Intellifusion NNP(Intellifusion/tyllm)和Iluvatar Corex corex gpus gpus gpus gpus gpus(Illifusion/tyllm)。这些优化的版本可以在其各自的官方网站以及Github或Gitee存储库中找到。 原始的FlashMLA项目是在Github上开源的标题,标题是Jiashi Li的标题“ FlashMLA:高效MLA解码内核”,并于2025年发行。


原文

FlashMLA is an efficient MLA decoding kernel for Hopper GPUs, optimized for variable-length sequences serving.

Currently released:

  • BF16, FP16
  • Paged kvcache with block size of 64
python tests/test_flash_mla.py

Achieving up to 3000 GB/s in memory-bound configuration and 580 TFLOPS in computation-bound configuration on H800 SXM5, using CUDA 12.8.

from flash_mla import get_mla_metadata, flash_mla_with_kvcache

tile_scheduler_metadata, num_splits = get_mla_metadata(cache_seqlens, s_q * h_q // h_kv, h_kv)

for i in range(num_layers):
    ...
    o_i, lse_i = flash_mla_with_kvcache(
        q_i, kvcache_i, block_table, cache_seqlens, dv,
        tile_scheduler_metadata, num_splits, causal=True,
    )
    ...
  • Hopper GPUs
  • CUDA 12.3 and above
    • But we highly recommend 12.8 or above for the best performance
  • PyTorch 2.0 and above

FlashMLA is inspired by FlashAttention 2&3 and cutlass projects.

For MetaX GPUs, visit the official website: MetaX.

The corresponding FlashMLA version can be found at: MetaX-MACA/FlashMLA

For the Moore Threads GPU, visit the official website: Moore Threads.

The corresponding FlashMLA version is available on GitHub: MooreThreads/MT-flashMLA.

For the Hygon DCU, visit the official website: Hygon Developer.

The corresponding FlashMLA version is available here: OpenDAS/MLAttention.

For the Intellifusion NNP, visit the official website: Intellifusion.

The corresponding FlashMLA version is available on Gitee: Intellifusion/tyllm.

For Iluvatar Corex GPUs, visit the official website: Iluvatar Corex.

The corresponding FlashMLA version is available on GitHub: Deep-Spark/FlashMLA

@misc{flashmla2025,
      title={FlashMLA: Efficient MLA decoding kernels},
      author={Jiashi Li},
      year={2025},
      publisher = {GitHub},
      howpublished = {\url{https://github.com/deepseek-ai/FlashMLA}},
}
联系我们 contact @ memedata.com