DeepSeek开源FlashMLA - MLA解码内核hopper GPU

DeepSeek开源FlashMLA - MLA解码内核hopper GPU
DeepSeek Open Source FlashMLA – MLA Decoding Kernel for Hopper GPUs

原始链接: https://github.com/deepseek-ai/FlashMLA

FlashMLA是一种高性能解码内核，设计用于在NVIDIA HOPPER GPU上使用的可变长度序列，利用CUDA 12.3+（针对12.8+优化）。它最多可在H800 SXM5上实现3000 GB/s的内存带宽和580个TFLOPS计算，并使用BF16/FP16和分页的KV CACHE，块大小为64。启发了FlashAttention 2＆3和Cutlass和Cutlass，flashmla与Pytorch 2.0+集成了FlashMLA。最佳版本可用于Metax（Metax-Maca/flashMLA），Moore Threads（MooreThreads/Mt-Flashmla），Hygon DCU（Opendas/Mlattention），Intellifusion NNP（Intellifusion/tyllm）和Iluvatar Corex corex gpus gpus gpus gpus gpus（Illifusion/tyllm）。这些优化的版本可以在其各自的官方网站以及Github或Gitee存储库中找到。原始的FlashMLA项目是在Github上开源的标题，标题是Jiashi Li的标题“ FlashMLA：高效MLA解码内核”，并于2025年发行。

通过优化内存带宽使用情况，Flashattention（FA）可显着加快大语言模型的训练/预填充阶段。但是，由于需要为每个令牌生成加载大型KV-Caches，因此它并不能直接改善推理（解码）时间。尽管FA使训练计算机结合，但解码仍然存在，尤其是在与计算功率相比的内存带宽有限的硬件上。诸如FlashDecoding和Multi-Query Guate（MQA）之类的新技术旨在通过优化KV-CACHENTLING来解决此问题。目的是将解码从内存限制转换为计算结合，从而提高了整体效率。辩论的重点是目前的方法是否真正实现了这一目标，以及理论上的优势是否转化为墙壁锁定时间的实际改进，尤其是在大型模型上。必须在解码阶段使用K/V缓存。

DeepGemm：具有细粒度缩放的清洁有效的FP8 GEMM内核 2025-02-27

DeepSeek开源DEEPEP - 培训和推理的图书馆 2025-02-26

（评论） 2025-02-25

（评论） 2025-02-27

原文

FlashMLA is an efficient MLA decoding kernel for Hopper GPUs, optimized for variable-length sequences serving.

Currently released:

BF16, FP16
Paged kvcache with block size of 64

python tests/test_flash_mla.py

Achieving up to 3000 GB/s in memory-bound configuration and 580 TFLOPS in computation-bound configuration on H800 SXM5, using CUDA 12.8.

from flash_mla import get_mla_metadata, flash_mla_with_kvcache

tile_scheduler_metadata, num_splits = get_mla_metadata(cache_seqlens, s_q * h_q // h_kv, h_kv)

for i in range(num_layers):
    ...
    o_i, lse_i = flash_mla_with_kvcache(
        q_i, kvcache_i, block_table, cache_seqlens, dv,
        tile_scheduler_metadata, num_splits, causal=True,
    )
    ...

Hopper GPUs
CUDA 12.3 and above
- But we highly recommend 12.8 or above for the best performance
PyTorch 2.0 and above

FlashMLA is inspired by FlashAttention 2&3 and cutlass projects.

For MetaX GPUs, visit the official website: MetaX.

The corresponding FlashMLA version can be found at: MetaX-MACA/FlashMLA

For the Moore Threads GPU, visit the official website: Moore Threads.

The corresponding FlashMLA version is available on GitHub: MooreThreads/MT-flashMLA.

For the Hygon DCU, visit the official website: Hygon Developer.

The corresponding FlashMLA version is available here: OpenDAS/MLAttention.

For the Intellifusion NNP, visit the official website: Intellifusion.

The corresponding FlashMLA version is available on Gitee: Intellifusion/tyllm.

For Iluvatar Corex GPUs, visit the official website: Iluvatar Corex.

The corresponding FlashMLA version is available on GitHub: Deep-Spark/FlashMLA

@misc{flashmla2025,
      title={FlashMLA: Efficient MLA decoding kernels},
      author={Jiashi Li},
      year={2025},
      publisher = {GitHub},
      howpublished = {\url{https://github.com/deepseek-ai/FlashMLA}},
}

DeepSeek开源FlashMLA - MLA解码内核hopper GPU DeepSeek Open Source FlashMLA – MLA Decoding Kernel for Hopper GPUs

DeepSeek开源FlashMLA - MLA解码内核hopper GPU
DeepSeek Open Source FlashMLA – MLA Decoding Kernel for Hopper GPUs