LPLB：基于线性规划的早期研究阶段MoE负载均衡器

LPLB：基于线性规划的早期研究阶段MoE负载均衡器
LPLB: An early research stage MoE load balancer based on linear programming

原始链接: https://github.com/deepseek-ai/LPLB

LPLB是一种研究阶段的并行负载均衡器，旨在动态解决混合专家（MoE）模型中的负载不平衡问题，超越了其前身EPLB的静态均衡。它利用线性规划（LP）来优化每个批次内专家之间的工作负载分配，并利用NVIDIA的cuSolverDx和cuBLASDx来提高速度。 LPLB使用EPLB重新排序专家，然后基于定义的拓扑结构（立方体、超立方体、环面或自定义）策略性地复制最重的专家。它通过沿连接原始专家和冗余专家的“边”重新分配token，同时遵守容量限制，来最小化不平衡。实时工作负载同步通过NVLINK/NVSHMEM优化，为了获得最佳性能，需要DeepEP。目前，LPLB仅基于token数量进行负载均衡，可能忽略了非线性计算成本。虽然它提供了约100µs的节点内优化，但这种开销会影响小批次。在极端全局不平衡的情况下，它也可能表现不如EPLB。前提条件包括CUDA Toolkit >= 12.6.3，并且最好使用DeepEP。

黑客新闻新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录 LPLB：一个基于线性规划的早期研究阶段的MoE负载均衡器 (github.com/deepseek-ai) 5 分，由 simonpure 1小时前发布 | 隐藏 | 过去 | 收藏 | 讨论指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系搜索：

原文

LPLB is a parallel load balancer that leverages linear programming to optimize expert parallel workload distribution for MoE (Mixture-of-Experts) models. It dynamically reorders experts based on workload statistics, constructs replicas considering static topology, and solves optimal token assignments for each batch to achieve dynamic load balancing. The reordering process is facilitated by EPLB, and real-time workload statistics can be provided by the user, collected via torch.distributed, or obtained through the internal communicators of a Deep-EP buffer. Its embedded LP solver implements single-SM Interior Point Method (IPM) and leverages NVIDIA's cuSolverDx and cuBLASDx libraries for efficient linear algebra operations.

LPLB is currently in the early research stage, and performance improvements are still under evaluation.

Prerequisites:

CUDA Toolkit >= 12.6.3 (with cuSolverDx dependencies).
DeepEP is optional but strongly recommended for practical use.
EPLB is embedded.

./download-mathdx.sh
# export NVSHMEM_DIR=...  # Optional
pip install --no-build-isolation .

For testing, an editable installation is recommended:

pip install --no-build-isolation --editable .
pytest tests

# Global successes counter
avail_counter = torch.zeros(1, dtype=torch.int64, device="cuda")
# Define topology of redundant experts
r2o = torch.tensor(
    [
        [3, 0, 1, 2, 7, 4, 5, 6],
        [6, 7, 4, 5, 0, 1, 2, 3],
    ]
).T.int().cuda()

planner = Planner(
    r2o,
    n_logical_experts + n_redundants_per_rank * ep_size,
    n_logical_experts,
    group=ep_group,
)
# Initialize from a DeepEP `buffer` (optional)
# planner.init_from_deep_ep(buffer)

N_SMS = 100
# Logical expert indices selected by the model
indices = ...
# Planner returns physical expert indices
redirected_indices = planner.run(indices, avail_counter, N_SMS)

LPLB extends EPLB (Expert Parallelism Load Balancer) to address dynamic load imbalance in Mixture-of-Experts (MoE) training. While EPLB handles static imbalances (e.g., consistently overloaded experts due to data distribution), LPLB targets per-batch fluctuations caused by small-batch randomness during training.

Redundant Experts: Each redundant expert is linked to an original expert, forming edges between GPUs.
Edge Capacity: The capacity of an edge is the number of tokens assigned to its redundant expert in the current batch, defining the maximum token flow for balancing.
LP Optimization: LPLB solves a linear programming (LP) problem to redistribute tokens along these edges, minimizing load imbalance within an expert-parallel (EP) group while respecting edge capacities.

Experts to be replicated are selected via EPLB (reordering only, no replication). The heaviest experts are then replicated based on the chosen LPLB topology. Real-time workload synchronization is optimized using NVLINK and NVSHMEM instead of torch.distributed.allreduce, reducing communication overhead. This requires DeepEP as a prerequisite.

The current planner balances only total token count, not accounting for non-linearity in grouped matrix multiplication time costs, which may lead to suboptimal performance.
The solver takes ~100 µs for intra-node optimization (longer for inter-node), which may be non-negligible for small batches.
Under extreme global load imbalance, LPLB may perform worse than EPLB due to differences in assigning redundant experts (LPLB avoids assigning multiple replicas to the same original expert).

Cube: Replicates experts on a subset of GPUs, forming a cube graph with diagonal edges. Requires at least 2 experts per GPU. Ideal for balancing within an 8-GPU EP subgroup without sacrificing inter-node communication.
Hypercube: Similar to Cube but excludes diagonal edges and requires 16 GPUs. Suitable for expert parallelism across 16 GPUs.
Torus: Replicates one expert on a neighbor GPU in the same node and another on a neighbor node, forming a torus graph. Requires at least 2 experts per GPU. Effective for global balancing but less efficient due to intra-node communication than Cube.

Custom topologies can be explored by modifying the r2o matrix.

LPLB：基于线性规划的早期研究阶段MoE负载均衡器 LPLB: An early research stage MoE load balancer based on linear programming

LPLB：基于线性规划的早期研究阶段MoE负载均衡器
LPLB: An early research stage MoE load balancer based on linear programming