从地下室为 AI 提供服务

从地下室为 AI 提供服务 – 192GB VRAM 设置 Serving AI from the Basement – 192GB of VRAM Setup

Ahmed Osmar 最近分享了他利用最先进的技术和硬件创建名为 LLM 的高级人工智能模型的经验。 3 月份，他的初始设备（一台配备 48 GB 显卡内存 (VRAM) 的计算机）遇到了限制，这促使他寻求更好的解决方案。他选择构建自己的 AI 服务器，由八块高性能 RTX 3090 显卡组成，提供高达 192 GB 的 VRAM。为了满足新系统的需求，他选择了 Asrock Rack ROMED8-2T 主板，该主板具有 7 个 PCIe 4.0 x16 插槽、128 个 PCIe 通道、AMD Epyc Milan 7713 CPU 和 512 GB DDR4-3200 RAM。添加了三个强大的1600瓦电源以确保平稳运行。施工过程提出了多项挑战，包括安装断路器等额外电气部件、钻孔金属框架以及其他复杂的组装任务。 Ahmed 计划通过多篇博客文章分享对这些问题的见解，讨论从 PCIe 转接卡故障和无错误 PCIe 连接的重要性到对各种推理引擎和训练技术进行基准测试等主题。他还反思了随着时间的推移技术的快速进步，并想知道未来 20 年我们将处于什么位置。这项事业旨在突破当前人工智能能力的界限，同时为未来的进步做出贡献。通过分享他的经验，艾哈迈德希望其他人能够学到宝贵的经验教训，并最终为塑造未来的技术做出贡献。

Ahmed Osmar recently shared his experience creating an advanced AI model called LLM, utilizing state-of-the-art technology and hardware. In March, he faced limitations with his initial equipment, a single computer equipped with 48 GB of graphics card memory (VRAM), which led him to seek a superior solution. He chose to construct his own AI server, consisting of eight high-performance RTX 3090 graphics cards, offering a remarkable total of 192 GB of VRAM. To handle the demands of his new system, he selected the Asrock Rack ROMED8-2T motherboard, featuring seven PCIe 4.0 x16 slots, along with 128 PCIe lanes, an AMD Epyc Milan 7713 CPU, and 512 GB DDR4-3200 RAM. Three powerful 1600-watt power supplies were added to ensure smooth operation. The construction process presented several challenges, including installing additional electrical components such as breakers, drilling metal frames, and other complex assembly tasks. Ahmed plans to share insights on these issues through multiple blog posts, discussing topics ranging from PCIe riser failures and the importance of error-free PCIe connections, to benchmarking various inference engines and training techniques. He also reflects on the rapid technological advancements over time and wonders about where we'll be in another 20 years. This undertaking aims to push the boundaries of current AI capabilities while contributing to future advancements. By sharing his experiences, Ahmed hopes others may learn valuable lessons and ultimately contribute to shaping the technologies of tomorrow.

AI from The Basement: My latest side project, a dedicated LLM server powered by 8x RTX 3090 Graphic Cards, boasting a total of 192GB of VRAM. I built this with running Meta’s Llamma-3.1 405B in mind.

This blogpost was originally posted on my LinkedIn profile in July 2024.

Backstory: Sometime in March I found myself struggling to keep up with the mere 48GB of VRAM I had been relying on for almost a year in my LLMs experimentations. So, in a geeky-yet-stylish way, I decided to spend my money to build this thing of beauty. Questions swirled: Which CPU/Platform to buy? Does memory speed really matter? And why the more PCIe Lanes we have the better? Why 2^n number of GPUs matter in multi-GPU node setup (Tensor Parallelism, anyone?) How many GPUs, and how can I get all the VRAM in the world? Why are Nvidia cards so expensive and why didn’t I invest in their stock earlier? What inference engine to use (hint: it’s not just llama.cpp and not always the most well-documented option)?

After so many hours of research, I decided on the following platform:

Asrock Rack ROMED8-2T motherboard with 7x PCIe 4.0x16 slots and 128 lanes of PCIe
AMD Epyc Milan 7713 CPU (2.00 GHz/3.675GHz Boosted, 64 Cores/128 Threads)
512GB DDR4-3200 3DS RDIMM memory
A mere trio of 1600-watt power supply units to keep everything running smoothly
8x RTX 3090 GPUs with 4x NVLinks, enabling a blistering 112GB/s data transfer rate between each pair

Now that I kinda have everything in order, I’m working on a series of blog posts that will cover the entire journey, from building this behemoth to avoiding costly pitfalls. Topics will include:

The challenges of assembling this system: from drilling holes in metal frames and adding 30amp 240volt breakers, to bending CPU socket pins (don’t try this at home, kids!).
Why PCIe Risers suck and the importance of using SAS Device Adapters, Redrivers, and Retimers for error-free PCIe connections.
NVLink speeds, PCIe lanes bandwidth and VRAM transfer speeds, and Nvidia’s decision to block P2P native PCIe bandwidth on the software level.
Benchmarking inference engines like TensorRT-LLM, vLLM, and Aphrodite Engine, all of which support Tensor Parallelism.
Training and fine-tuning your own LLM.

Stay tuned.

P.S. I’m sitting here staring at those GPUs, and I just can’t help but think how wild tech progress has been. I remember being so excited to get a 60GB HDD back in 2004. I mean, all the movies and games I could store?! Fast forward 20 years, and now I’ve got more than triple that storage capacity in just one machine’s graphic cards… It makes me think, what will we be doing in another 20 years?!

Anyway, that’s why I’m doing this project. I wanna help create some of the cool stuff that’ll be around in the future. And who knows, maybe someone will look back on my work and be like “haha, remember when we thought 192GB of VRAM was a lot?”