展示HN:通过NVMe到GPU绕过CPU,在单张RTX 3090上运行Llama 3.1 70B
Show HN: Llama 3.1 70B on a single RTX 3090 via NVMe-to-GPU bypassing the CPU

原始链接: https://github.com/xaskasdf/ntransformer

该项目展示了一个高效的C++/CUDA LLM推理引擎,旨在在消费级硬件(例如,配备24GB VRAM的RTX 3090)上运行大型语言模型,如Llama 70B。它通过创新的层流式传输技术实现这一点,利用PCIe带宽和可选的NVMe直接I/O来绕过CPU。 该引擎采用三层自适应缓存系统——驻VRAM层、固定RAM和NVMe/mmap回退,其大小根据可用硬件自动调整。这使得70B模型的速度比基线方法快33倍。一个关键特性是`gpu-nvme-direct`后端,它能够直接将NVMe读取内容导入GPU内存,从而消除了CPU瓶颈。 该引擎支持GGUF模型格式,具有各种量化级别(Q4_0、Q8_0等),并利用自定义CUDA内核。它专为Linux(Ubuntu 6.17+)设计,需要CUDA Toolkit 13.1,并采用BSD-2-Clause许可证。未来的开发重点是高级量化、新型架构和优化。

一位开发者在Hacker News分享了一个项目,展示了在单个RTX 3090 GPU上运行大型语言模型Llama 3.1 70B的能力,**完全绕过CPU和RAM**,直接从NVMe存储访问数据。最初的灵感来自于一个关于复古游戏和高效模型加载的问题。 虽然可行,但目前的性能仅限于大约每秒0.2个token。评论员指出,这个速度不适合交互式使用,建议使用更小、量化的模型可能提供更好的速度和质量平衡。 讨论的重点是这种方法是否真正优化——一些人质疑报告的计算瓶颈——以及它的主要好处在于具有有限RAM的系统。DirectX的直接到GPU的资源加载能力也被提及为一种潜在的替代方案。该项目的代码可在GitHub上找到。
相关文章

原文

High-efficiency C++/CUDA LLM inference engine. Runs Llama 70B on a single RTX 3090 (24GB VRAM) by streaming model layers through GPU memory via PCIe, with optional NVMe direct I/O that bypasses the CPU entirely.

Model Mode Decode VRAM Notes
Llama 3.1 8B Q8_0 Resident 48.9 tok/s 10.0 GB All layers in VRAM
Llama 3.1 8B Q8_0 Tiered (auto) 48.8 tok/s 10.3 GB 32/32 layers auto-promoted to VRAM
Llama 3.1 70B Q6_K Streaming (mmap) 0.006 tok/s 7.3 GB Page cache thrashing (53 GB > 48 GB RAM)
Llama 3.1 70B Q6_K Tiered (auto) 0.2 tok/s 23.1 GB 29 VRAM + 51 RAM + 0 NVMe

3-tier adaptive caching auto-sizes from hardware: VRAM-resident layers (zero I/O) + pinned RAM (H2D only) + NVMe/mmap fallback. Achieves 33x speedup over mmap baseline for 70B on consumer hardware (RTX 3090 + 48 GB RAM).

Bottleneck is PCIe H2D bandwidth at Gen3 x8 (~6.5 GB/s). With Gen4 x16 (B550/X570), tier B layers would be compute-bound, yielding ~0.5 tok/s.

  • Zero external dependencies beyond CUDA Toolkit (no PyTorch, no cuBLAS)
  • GGUF model format with Q4_0, Q8_0, Q4_K_M, Q6_K, F16, F32 quantization
  • 3-Tier Adaptive Caching: auto-sized VRAM resident + pinned RAM + NVMe/mmap tiers
  • SLEP streaming: double-buffered layer pipeline overlaps NVMe reads, PCIe DMA, and GPU compute
  • gpu-nvme-direct backend: userspace NVMe driver reads model weights directly to pinned GPU-accessible memory
  • Four data paths (auto-selected): VRAM resident > pinned RAM H2D > mmap pinned > CPU worker memcpy
  • Llama architecture: RoPE, GQA, SwiGLU, RMSNorm, KV cache
  • Linux (tested on Ubuntu, kernel 6.17+)
  • CUDA Toolkit 13.1
  • gcc-14 / g++-14
  • NVIDIA GPU with Compute Capability 8.0+ (RTX 3090 tested)
  • CMake 3.24+
  • (Optional) NVMe SSD on separate PCIe slot + gpu-nvme-direct library
# Build
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_C_COMPILER=gcc-14 \
  -DCMAKE_CXX_COMPILER=g++-14 \
  -DCMAKE_CUDA_COMPILER=/usr/local/cuda-13.1/bin/nvcc
cmake --build . -j

# Run (resident mode — model fits in VRAM)
./ntransformer -m /path/to/llama-8b-q8_0.gguf -p "Hello" -n 128

# Run (streaming mode — model larger than VRAM)
./ntransformer -m /path/to/llama-70b-q6_k.gguf -p "Hello" -n 32 --streaming

# Chat mode
./ntransformer -m /path/to/model.gguf --chat

# Benchmark
./ntransformer -m /path/to/model.gguf --benchmark -n 64

For models that don't fit in VRAM, the NVMe backend eliminates the CPU from the data path:

NVMe SSD → (DMA) → Pinned Staging → (PCIe H2D) → GPU Buffers → Compute
# Build with NVMe support (requires gpu-nvme-direct library)
cmake .. -DCMAKE_BUILD_TYPE=Release -DUSE_GPUNVME=ON \
  -DCMAKE_C_COMPILER=gcc-14 -DCMAKE_CXX_COMPILER=g++-14 \
  -DCMAKE_CUDA_COMPILER=/usr/local/cuda-13.1/bin/nvcc
cmake --build . -j

# Write GGUF model to NVMe raw device
sudo ./scripts/restore_nvme.sh           # ensure kernel driver is bound
sudo dd if=model.gguf of=/dev/nvme0n1 bs=1M oflag=direct status=progress

# Bind NVMe to VFIO for userspace access
sudo ./scripts/setup_nvme.sh             # loads VFIO, forces D0, enables BusMaster

# Run with NVMe backend
sudo GPUNVME_PCI_BDF=0000:01:00.0 GPUNVME_GGUF_LBA=0 \
  ./build/ntransformer -m /path/to/model.gguf -p "Hello" -n 32 --streaming

# Restore NVMe to kernel driver when done
sudo ./scripts/restore_nvme.sh
  1. The GGUF model file is written to raw NVMe blocks via dd
  2. setup_nvme.sh binds the NVMe to VFIO, forces PCIe D0 power state, enables BusMaster
  3. gpu-nvme-direct initializes the NVMe controller from userspace (admin queues, I/O queues)
  4. During inference, each layer (~670 MB for 70B Q6_K) is read via 670 NVMe commands in ~202 ms
  5. Data lands in CUDA pinned staging memory, then async DMA to GPU compute buffers
  6. Pipeline overlaps NVMe reads, H2D DMA, and GPU compute across double buffers
src/
├── core/           # Tensor, allocator, GPU device management
├── cuda/           # CUDA kernels: GEMV, RMSNorm, RoPE, SwiGLU, softmax
├── memory/         # SLEP layer streaming engine (NVMe + mmap backends)
├── model/          # Transformer: config, GGUF loader, attention, FFN, norms
├── inference/      # Tokenizer, sampler, engine
├── utils/          # Timer, logger
├── main.cpp        # CLI entry point
scripts/
├── setup_nvme.sh   # Bind NVMe to VFIO, configure for gpu-nvme-direct
├── restore_nvme.sh # Restore NVMe to kernel driver
tests/              # Unit tests (tensor, GEMM kernels, NVMe layer loader)
forward_tiered() — hybrid pipeline:

Tier A (VRAM resident, layers 0..28):
  GPU Compute:  [layer 0][layer 1]...[layer 28]     (zero I/O, weights permanent)

Tier B (pinned RAM, layers 29..79, double-buffered):
  H2D DMA:     [L29→gpu0][L30→gpu1][L31→gpu0]...   (async from pinned RAM)
  GPU Compute: [         ][layer 29][layer 30]...    (overlapped with H2D)

Tier C (NVMe/mmap fallback, if needed):
  NVMe/memcpy: [read L→stg0][read L→stg1]...
  H2D DMA:     [            ][stg0→gpu0  ]...
  GPU Compute: [            ][            ][layer]...

Tier sizes auto-computed from cudaMemGetInfo() + /proc/meminfo MemAvailable.

Format Bits/Weight Block Size Supported
Q4_0 4.5 32 Yes
Q8_0 8.5 32 Yes
Q4_K_M 4.5 256 Yes
Q6_K 6.6 256 Yes
F16 16 1 Yes
F32 32 1 Yes
  • Phase 1 - Foundation (complete): Llama 8B Q8_0, custom CUDA kernels, 48.9 tok/s
  • Phase 2 - SLEP Streaming (complete): 70B on single GPU, 3-tier caching, 33x speedup
  • Phase 3 - Advanced Quantization: RotateKV (INT2 KV-cache), adaptive per-layer precision
  • Phase 4 - Novel Architectures: MLA, Mamba/SSM, speculative decoding
  • Phase 5 - Polish: optimization, benchmarks, public C API

BSD-2-Clause

联系我们 contact @ memedata.com