thunderbolt-ibverbs:我们家有 InfiniBand
thunderbolt-ibverbs: We have InfiniBand at home

原始链接: https://blog.hellas.ai/blog/thunderbolt-ibverbs/

为了实现高性能计算的普及,我开发了一个 Linux 内核模块,能够让消费级 AMD 迷你主机上普通的 USB4/雷电(Thunderbolt)接口充当 InfiniBand 设备。这种实验性的 RDMA-over-USB4 实现方案,让普通家庭用户无需昂贵的企业级网络硬件,即可运行分布式 AI 工作负载,如张量并行推理和 FSDP 训练。 通过绕过标准网络协议栈,该方案在 Strix Halo 迷你主机上取得了令人瞩目的性能表现: * **吞吐量:** 双向原始 RDMA 吞吐量约 95 Gb/s(远超标准 2.5 GbE 约 2.3 Gb/s 的极限)。 * **延迟:** 单向延迟约 7 微秒,显著优于传统的软件方案。 * **效率:** 将 Gemma 3 27B 模型 LoRA FSDP 步骤的训练时间从以太网连接下的 1,359 秒缩短至 126 秒。 该项目成功实现了在消费级硬件上进行多节点 AI 训练。但需要注意的是,这是一个包含 AI 生成代码的实验性研究项目。它仅供测试使用,不提供任何担保,且涉及不稳定的内核模块,请谨慎使用。

Hacker News最新 | 往日 | 评论 | 提问 | 展示 | 招聘 | 提交登录thunderbolt-ibverbs:我们家有 InfiniBand(hellas.ai)9 分,由 zdw 发布于 2 小时前 | 隐藏 | 往日 | 收藏 | 1 条评论帮助 mkesper 5 分钟前 [–] 为这个想法点赞,也赞赏你对项目现状的完全公开(AI 生成代码,预期会有故障)!回复 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文

I spent the past few weeks building a linux kernel module that makes ordinary USB4/Thunderbolt ports on AMD mini PCs pretend to be InfiniBand devices. The goal is simple: let existing AI runtimes like vLLM/RCCL split inference or training across multiple boxes at home, without buying enterprise networking gear.

TL;DR. We built experimental RDMA-over-USB4 for 128GB Strix Halo mini PCs. It lets two consumer boxes talk fast enough to run tensor-parallel inference and FSDP workloads across both machines: ~95 Gb/s bidirectional raw RDMA, ~7 µs one-way latency, a MiniMax-M2.7 TP=2 inference run that does not fit on one box, and a Gemma 3 27B LoRA FSDP step falling from 1359 s over Ethernet to 126 s over 4-HCA USB4 RDMA.

Two Strix Halo mini-PCs (strix-1, strix-2) connected by USB4

  • ~48 Gb/s per direction (~95 Gb/s bidi total) sustained ib_write_bw, 4-HCA aggregate at 1 MiB / 8 QPs with IOMMU off — vs ~2.3 Gb/s over the onboard 2.5 GbE and ~9 Gb/s for soft-RoCE on top of thunderbolt-net at the per-rail level.
  • ~7 µs one-way ib_write_lat at 64 B, single QP — vs ~28 µs over RXE/2.5 GbE and ~65 µs over RXE/TBnet.

ib_write_bw between strix-1 and strix-2, by transport and QPs

DISCLAIMER: this is research code, most of it AI-generated, and it loads experimental kernel modules on machines I was willing to crash repeatedly. I made an effort to understand enough of it to keep it on-track, but there are almost certainly false assumptions and sharp edges throughout. No warranty, no support promise, not production software.

联系我们 contact @ memedata.com