UCCL-EP:基于任何网卡的 DeepEP 风格专家并行,无需 GPU 发起通信
UCCL-EP: DeepEP-style expert parallelism on any NIC, no GPU-initiated comms

原始链接: https://fergusfinn.com/blog/uccl-ep-without-owning-the-nic/

UCCL-EP 是一个旨在实现异构硬件配置下高性能专家并行(Expert Parallelism)的库,它绕过了原 DeepEP 库对硬件的严格限制。 DeepEP 依赖于“GPU 发起通信”(IBGDA),即由 GPU 直接指挥网卡(NIC)。这种方式在 NVIDIA 专用硬件上运行良好,但在缺乏 GPU 可控队列的其他互联架构(如 AWS EFA 或 HPE Slingshot)上则无法使用。 UCCL-EP 通过将通信逻辑与硬件解耦来填补这一空白。它保留了 DeepEP 的“契约”——即单向写入、有序信号以及静默/栅栏(quiet/fence)操作,但用 CPU 代理(CPU proxy)取代了硬件直接触发机制。在这种模式下,GPU 将命令描述符写入主机内存中的环形缓冲区,CPU 上一个专用的“代理”线程会监控这些队列,并将相应的命令分发给网卡。 关键在于,数据路径依然保持优化:网卡仍执行直接内存访问(DMA)至 GPU 高带宽内存(HBM),确保 CPU 不会触及实际的模型激活值。这一设计使得 UCCL-EP 能够在任何加速器与网卡的组合上实现高性能专家并行,在非 NVIDIA 基础设施上可带来高达 2 倍的吞吐量提升。

Hacker News 最新 | 过往 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 UCCL-EP:在任何网卡上实现 DeepEP 风格的专家并行,无需 GPU 发起的通信 (fergusfinn.com) 5 点,由 kkm 发布于 2 小时前 | 隐藏 | 过往 | 收藏 | 讨论 | 帮助 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 加入 YC | 联系 搜索:
相关文章

原文

In the last post we looked at how expert parallel communications kernels work. This was a story of how the original DeepEP library from DeepSeek was organized. That library relies on GPU-initiated communication: the GPU has to be able to tell the NIC directly what to transfer and when.

The primitives that library introduces are sufficiently general and powerful that others have built on them to expand support across NICs and across GPU types. This is the story of UCCL, specifically UCCL-EP, which takes DeepEP-style communication patterns and makes them work for arbitrary NIC-accelerator pairs.

We’re interested in heterogeneous hardware here at Doubleword. We want the most tokens for the lowest price, regardless of what makes them

Isambard-AI is a great facility. But its chips are connected with the HPE Slingshot interconnect. No GPU-initiated communication, no DeepEP

This post is about how UCCL-EP gets expert parallelism to work across arbitrary interconnects.

The DeepEP contract§

The fast parallel structure that DeepEP builds on relies on the existence of a few simple remote communication primitives:

  1. A one-sided write: put these bytes at that address on that rank. The receiver doesn’t post a matching receive or run any code to accept the data
  2. An ordered signal: an atomic add into a known slot, telling the receiver the data has arrived. We need to be able to do this and know that it will land after the data has landed, so there’s a strict ordering requirement.
  3. A quiet: confirmation on the sender-side that all of its writes have completed, needed before it can reuse a source buffer or signal completion to anyone else.

This contract is NVSHMEM’s device API: put_nbi for the write, amo_nonfetch_add for the signal, quiet for the fence. DeepEP calls these functions from NVSHMEM. The problem is that:

  1. On the accelerator side, NVSHMEM is NVIDIA only.
  2. On the NIC side, IBGDA

UCCL bridges the gaps by implementing the exact same contract

How DeepEP does it§

An RDMA NIC is driven through queues in memory. To send something, a process writes a work queue entry, a small descriptor carrying the opcode, source address, destination address, and length, into a queue pair on the NIC

IBGDA moves the whole arrangement onto the GPU. The queue pair and completion queue are allocated in GPU memory, and the NIC’s doorbell register is mapped into the GPU’s address space. A warp inside the dispatch kernel builds the work queue entry itself, issues a memory fence, and writes the doorbell over PCIe. The NIC then pulls the payload directly out of HBM

This satisfies the contract above trivially. The one-sided write is a write descriptor. The signal is an atomic-add descriptor posted to the same queue pair, and because a queue pair executes its descriptors in order, the signal-after-data guarantee comes for free. The quiet is the GPU polling the completion queue until everything it posted has completed.

This is the fastest implementation you can reasonably build: a token’s send costs one descriptor and one doorbell write from the warp that owns it. But it all depends on the NIC cooperating with the GPU. The NIC has to work with its queues living in GPU memory, its doorbell being written by a GPU, and its DMA engine reaching into HBM.

The requirement that it does so creates an MxN problem: every GPU and NIC pairing has to be engineered to work together, vendor by vendor. Hyperscalers, as much as NVIDIA might want them to, won’t just buy NVIDIA GPUs and NVIDIA NICs and call it a day.

On AWS’s EFA, Broadcom’s NICs, or HPE’s Slingshot, there is no GPU-ownable queue to build any of this on, and the contract can’t be satisfied in this way.

Keep the contract, swap the transport§

UCCL-EP’s starting observation is that nothing in the DeepEP dispatch and combine kernels depends on how the contract is implemented. The queues, the layouts, the formula addressing all live above it. So UCCL keeps DeepEP’s kernels nearly as they are and reimplements the three functions underneath them: nvshmemi_ibgda_put_nbi_warp, nvshmemi_ibgda_amo_nonfetch_add, and nvshmemi_ibgda_quiet.

The challenge is how to make those functions do what they’re supposed to. If we can’t get the GPU to drive the NIC directly, what can we do? UCCL manages it by pointing the GPU at a queue it can always write: ordinary host memory.

Since the CPU can drive the NIC (otherwise, what would the NIC be designed to interface with), UCCL runs a constantly spinning CPU thread, the proxy, that monitors that queue and picks up and dispatches commands from the GPU.

To run a put, the warp packs a 16-byte command

The command carries addresses, not data: the activations stay in HBM, and the CPU never touches them. When the proxy posts the real descriptor, the NIC still pulls the payload directly out of GPU memory, exactly as in the IBGDA picture.

There are many warps and few rings, so a warp picks its ring by hashing its expert index across the proxy threads and their channels, and if the ring is full it spins until the CPU catches up. The structure is the same descriptor-queue arrangement the NIC offered, with the doorbell replaced by polling: the consumer on the other end is software, watching the ring’s head pointer instead of waiting for a register write.

The result is that the control path now requires nothing from the NIC at all: a write to pinned host memory over PCIe is something every GPU can do. The data path still needs one thing from the hardware, the NIC reaching into GPU memory, which we’ll come back to. But everything else that was hardware-specific has moved to the CPU.

The proxy: GPU-initiated, CPU-executed§

The proxy threads on the other end of the rings are started when the buffer is initialized: four by default, each owning eight rings

The contract’s guarantees now belong to the proxy. The signal-after-data ordering holds because the proxy posts a ring’s commands to the network in ring order, and a queue pair executes its descriptors in that order, so a signal posted after its data completes after it

Structurally, the proxy has two halves: a generic front that drains rings and tracks completions, and a backend that turns commands into calls on the NIC’s own API. The backend is the only code in the stack that knows which NIC is present.

Any NIC a CPU can drive§

Porting UCCL-EP means writing a new backend for the proxy: the code that turns a 16-byte command into network operations. The kernels, the shim, and the rings don’t change.

What does a backend actually need from its NIC? Happily, much less than in IBGDA. Four things, in decreasing order of necessity:

  1. DMA access to GPU memory, in both directions. On send the NIC reads the payload straight out of HBM; on receive it writes arriving tokens straight into HBM. The command that crossed to the host carried addresses, not data, and this is what keeps the data path off the CPU. The NIC has to accept GPU memory registrations
  2. A reliable one-sided write. Bytes delivered to a remote registered address, exactly once, with no remote software involved.
  3. Completions that imply remote delivery. The quiet never reaches the wire: the proxy synthesizes it by counting completions. For that to be sound, a completion has to imply that “this write has landed and is readable in remote memory”.
  4. Ordering and atomics, if you can get them. If the NIC offers ordered connections, the signal-after-data guarantee comes free, as it did on the verbs path. If it offers remote atomics, the signal maps onto one directly. Neither is required: both can be rebuilt in the proxy.

UCCL-EP ships a backend for AWS’s EFA and a generic RDMA verbs backend covering Broadcom’s Thor, AMD’s Pollara, and NVIDIA’s own ConnectX. And because the device side of the contract is nothing but writes to host memory, the GPU doesn’t need to be NVIDIA’s either: the same shim compiles under ROCm, which is how the whole stack runs on AMD GPUs.

Performance§

The proxy-thread design has some costs. Each message’s control information crosses PCIe to host memory and waits for a proxy thread to notice it, where IBGDA paid a single doorbell write. And the proxies are four pinned CPU threads per GPU, spinning.

But the work the proxy adds is per-command, not per-byte: a fixed cost for the host-thread pickup, amortized over however much data the descriptor moves. The published numbers

Conclusion§

So that’s UCCL-EP: a clean way to do m+nm+n

联系我们 contact @ memedata.com