强化学习后训练的权重迁移,小于2秒。
Weight Transfer for RL Post-Training in under 2 seconds

原始链接: https://research.perplexity.ai/articles/weight-transfer-for-rl-post-training-in-under-2-seconds

## 万亿参数模型快速权重传输 研究人员实现了对1T参数Kimi-K2模型的1.3秒跨机器参数更新,将权重从256个训练GPU传输到128个推理GPU。这种速度对于需要频繁权重更新的异步强化学习至关重要。 关键创新在于利用**RDMA WRITE**,一种单边通信原语,允许训练GPU直接写入推理GPU内存,*无需*推理引擎干预。这避免了传统方法中常见的瓶颈,这些瓶颈会将数据集中到单个GPU或依赖RPC。 该系统采用**静态预计算调度**进行权重传输,最大限度地减少控制平面开销。权重传输通过**流水线**进一步优化,重叠主机-设备内存拷贝、GPU计算(如量化)、RDMA传输和以太网同步。**网格组**策略实现了对不相交参数集的并行传输。 这种方法优先考虑简单性和可维护性,通过清晰的分工——每个步骤(元数据收集、传输等)都是一个独立的组件。结果是一个快速、可靠且易于扩展的大规模模型更新解决方案。

黑客新闻 新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 强化学习后训练的权重转移,小于2秒 (perplexity.ai) 18 分,由 jxmorris12 发布 4 小时前 | 隐藏 | 过去 | 收藏 | 讨论 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系 搜索:
相关文章

原文

We recently achieved 1.3-second cross-machine parameter updates for Kimi-K2 (1T parameters), transferring weights from 256 training GPUs (BF16) to 128 inference GPUs (FP8).

In asynchronous reinforcement learning fine-tuning, training and inference run on separate GPUs. After each training step, new weights must be pushed to inference nodes. Many existing frameworks take several seconds—or even minutes—for trillion-parameter models.

By leveraging RDMA point-to-point communication, we are able to make the weight transfer blazing fast, without changing inference engine, and make the code easier to write and maintain.

RDMA WRITE: one-sided transfers

Our solution is built on RDMA WRITE, a one-sided primitive where the source directly writes into the destination’s GPU memory.

def rdma_write(src_ptr, dst_ptr, size, src_mr, dst_mr):
    
    
    ...

The destination side won’t even get notified for the transfer. This gives us low-latency, high-throughput, zero-copy transfers driven by the training nodes without any control logic on the inference nodes.

High-level workflow

  1. Metadata collection – Controller gathers parameter metadata from all training and inference GPUs.

  2. Schedule computation – Controller computes a static weight transfer schedule, mapping which training GPU sends which parameter to which inference GPU, and in what order.

  3. Schedule distribution – Controller sends the schedule to all training GPUs.

  4. Execution – After each training step, the controller signals training GPUs to start transfers.

Weight transfer execution

With the high-level workflow defined, the key challenge is how to execute weight transfers efficiently at trillion-parameter scale. Here we describe the details of the execution path.

DeviceMesh and Mesh Groups

Parameters in training are distributed according to FSDP placements. Using full_tensor(), all GPUs in a DeviceMesh can reconstruct the full parameter, hence all can serve as a source for weight transfer.

Multiple disjoint DeviceMeshes form a mesh group. Because DeviceMeshes in the same group are disjoint, their transfers don’t interfere and can run fully in parallel. Between mesh groups, we insert a global barrier to enforce ordering.

Task pipeline

We treat the transfer of each parameter tensor as a task. The weight transfer process utilizes multiple types of hardware sources, hence we split a weight transfer task into different pipeline stages which overlap in time:

  1. Host to device memcpy — If FSDP offloads weight to CPU

  2. Parameter preparation — Reconstruct full weight with full_tensor(), apply projection fusion, quantize if needed.

  3. RDMA transfer — Zero-copy write to remote inference GPU memory

  4. Global barrier — After all full_tensor() calls are done, synchronize across mesh groups using GLOO via Ethernet.

In implementation, we maintain a FIFO queue of tasks for each pipeline stage. Whenever the head of queue task completes the stage, it is moved to the tail of the next stage queue.

GPU memory usage control

full_tensor() and other GPU operations introduces extra GPU memory usage. To avoid out of memory error, we start the execution of a task only if the current on-the-fly tasks occupies less temporary GPU memory than a configurable watermark.

Why it’s fast and simple

Several design choices make our system significantly faster to run and easier to maintain than common open-source solutions.

Point-to-point communication

A common pattern is to funnel all parameters through rank-0 GPUs: gather on training rank-0, send to inference rank-0, then scatter again. This quickly becomes a choke point, limited by a single GPU’s PCIe bandwidth and NIC (e.g., 400 Gbps ≈ 50 GB/s).

In contrast, our point-to-point setup allows every training GPU to send directly to every inference GPU, saturating the full network fabric rather than a single link.

One-sided data transfer

Some systems rely on calling into the inference engine’s update_weight() method for each tensor. That means intrusive changes to the inference code, plus overhead from RPCs, serialization, and control-plane coordination.

With RDMA WRITE primitive, we update weights silently on inference GPU memory, without extra copies. No control plane message and no CPU control logic is involved. No modification to inference engine is required.

Pipelining

The weight transfer process can leverage four types of hardware resources: (1) Host-device data movement (2) GPU computation for projection fusion and quantization (3) RDMA network for data plane (4) Ethernet for control plane.

Our design split weight transfer tasks into pipeline stages, allowing easy overlapping across different hardware resources.

Static Schedule

Some implementations recompute a transfer schedule at every training step, repeatedly collecting metadata and distributing instructions. This adds unnecessary control-plane latency.

Our schedule is computed once at initialization. Each training iteration simply replays the plan: the controller issues a “go” signal, and GPUs follow their pre-assigned routes. Execution is predictable and lightweight.

Clean separation

It’s tempting to entangle the whole weight update process in one monolithic function: collect metadata, name matching, intra-node gathering, projection fusion, quantization, subslicing communication world, inter-node network transfer. It’s hard to program correctly, and even harder to optimize.

In our engineering, we separate these steps as individual components. Each components can be unit tested, reasoned about, and optimized in isolation.

Conclusion

Fast, reliable weight transfer is a critical building block for large-scale RL fine-tuning. By combining the RDMA WRITE primitive, a static transfer schedule, and pipelined execution, we reduced trillion-parameter updates to just 1.3 seconds on Kimi-K2. The approach is simple to reason about, easy to maintain, and avoids the bottlenecks of traditional designs.

联系我们 contact @ memedata.com