DeepSeek Open Instra：5天内开源5个AI存储库

DeepSeek Open Instra：5天内开源5个AI存储库
DeepSeek Open Infra: Open-Sourcing 5 AI Repos in 5 Days

原始链接: https://github.com/deepseek-ai/open-infra-index

DeepSeek AI是一支专注于AGI勘探的小型团队，从2025年2月24日开始开放式五个由生产测试的存储库。这项透明的计划旨在促进社区驱动的创新。发行版包括： 1。** flashmla：** hopper GPU的有效的MLA解码内核，可针对具有BF16支持的可变长度序列优化，并进行了kv cache。 2。** DEEPEP：**一个开源EP通信库，用于MOE模型培训和推理，具有优化的全能通信，NVLink/RDMA支持和FP8调度。 3。** DeepGemm：** FP8 Gemm库支持密集和Moe Gemms，以最小的依赖关系和JIT汇编而在Hopper GPU上获得高性能。 4。**优化的并行性策略（DualPipe和EPLB）：** DualPipe是双向管道并行性算法，EPLB是一种专家平行的负载均衡器，均设计用于在V3/R1培训中进行计算 - 通信重叠。 5。** Fire-Flyer AI-HPC：**一份研究论文，概述了具有成本效益的软件硬件共同设计，以进行深度学习。

这项讨论围绕中国人工智能公司DeepSeek及其开源AI模型的潜在地缘政治影响。最初的论点暗示了对中国公司的怀疑，理由是数据共享问题和对GPU使用的潜在捏造，这受到了CCP权力和经济优势的影响。开源被视为偏转审查和促进宣传的战略举动。反驳发生了，这表明中国可能优先考虑长期研究领导力，而不是短期利润，研究人员可能会真正重视开源捐款。进一步的辩论探讨了DeepSeek的首席执行官的动机，他与CCP的联系以及开源AI的广泛含义。有人认为LLM本身不是可销售的产品，而是“数字共享”的一部分。最后，关于公关策略和在公司开源计划中“利他主义”的潜力的讨论。

DeepSeek开源DEEPEP - 培训和推理的图书馆 2025-02-26

DeepGemm：具有细粒度缩放的清洁有效的FP8 GEMM内核 2025-02-27

（评论） 2025-02-26

（评论） 2025-02-22

原文

We're a tiny team @deepseek-ai pushing our limits in AGI exploration.

Starting this week , Feb 24, 2025 we'll open-source 5 repos – one daily drop – not because we've made grand claims, but simply as developers sharing our small-but-sincere progress with full transparency.

These are humble building blocks of our online service: documented, deployed and battle-tested in production. No vaporware, just sincere code that moved our tiny yet ambitious dream forward.

Why? Because every line shared becomes collective momentum that accelerates the journey. Daily unlocks begin soon. No ivory towers - just pure garage-energy and community-driven innovation 🔧

Stay tuned – let's geek out in the open together.

Efficient MLA Decoding Kernel for Hopper GPUs
Optimized for variable-length sequences, battle-tested in production

🔗 FlashMLA GitHub Repo
✅ BF16 support
✅ Paged KV cache (block size 64)
⚡ Performance: 3000 GB/s memory-bound | BF16 580 TFLOPS compute-bound on H800

Excited to introduce DeepEP - the first open-source EP communication library for MoE model training and inference.

🔗 DeepEP GitHub Repo
✅ Efficient and optimized all-to-all communication
✅ Both intranode and internode support with NVLink and RDMA
✅ High-throughput kernels for training and inference prefilling
✅ Low-latency kernels for inference decoding
✅ Native FP8 dispatch support
✅ Flexible GPU resource control for computation-communication overlapping

Introducing DeepGEMM - an FP8 GEMM library that supports both dense and MoE GEMMs, powering V3/R1 training and inference.

🔗 DeepGEMM GitHub Repo
⚡ Up to 1350+ FP8 TFLOPS on Hopper GPUs
✅ No heavy dependency, as clean as a tutorial
✅ Fully Just-In-Time compiled
✅ Core logic at ~300 lines - yet outperforms expert-tuned kernels across most matrix sizes
✅ Supports dense layout and two MoE layouts

Day 4 - Optimized Parallelism Strategies

✅ DualPipe - a bidirectional pipeline parallelism algorithm for computation-communication overlap in V3/R1 training.
🔗 GitHub Repo

✅ EPLB - an expert-parallel load balancer for V3/R1.
🔗 GitHub Repo

📊 Analyze computation-communication overlap in V3/R1.
🔗 GitHub Repo

Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning

📄 Paper Link
📄 Arxiv Paper Link