展示HN：通过NVMe到GPU绕过CPU，在单张RTX 3090上运行Llama 3.1 70B

展示HN：通过NVMe到GPU绕过CPU，在单张RTX 3090上运行Llama 3.1 70B
Show HN: Llama 3.1 70B on a single RTX 3090 via NVMe-to-GPU bypassing the CPU

原始链接: https://github.com/xaskasdf/ntransformer

该项目展示了一个高效的C++/CUDA LLM推理引擎，旨在在消费级硬件（例如，配备24GB VRAM的RTX 3090）上运行大型语言模型，如Llama 70B。它通过创新的层流式传输技术实现这一点，利用PCIe带宽和可选的NVMe直接I/O来绕过CPU。该引擎采用三层自适应缓存系统——驻VRAM层、固定RAM和NVMe/mmap回退，其大小根据可用硬件自动调整。这使得70B模型的速度比基线方法快33倍。一个关键特性是`gpu-nvme-direct`后端，它能够直接将NVMe读取内容导入GPU内存，从而消除了CPU瓶颈。该引擎支持GGUF模型格式，具有各种量化级别（Q4_0、Q8_0等），并利用自定义CUDA内核。它专为Linux（Ubuntu 6.17+）设计，需要CUDA Toolkit 13.1，并采用BSD-2-Clause许可证。未来的开发重点是高级量化、新型架构和优化。

一位开发者在Hacker News分享了一个项目，展示了在单个RTX 3090 GPU上运行大型语言模型Llama 3.1 70B的能力，**完全绕过CPU和RAM**，直接从NVMe存储访问数据。最初的灵感来自于一个关于复古游戏和高效模型加载的问题。虽然可行，但目前的性能仅限于大约每秒0.2个token。评论员指出，这个速度不适合交互式使用，建议使用更小、量化的模型可能提供更好的速度和质量平衡。讨论的重点是这种方法是否真正优化——一些人质疑报告的计算瓶颈——以及它的主要好处在于具有有限RAM的系统。DirectX的直接到GPU的资源加载能力也被提及为一种潜在的替代方案。该项目的代码可在GitHub上找到。

原文

High-efficiency C++/CUDA LLM inference engine. Runs Llama 70B on a single RTX 3090 (24GB VRAM) by streaming model layers through GPU memory via PCIe, with optional NVMe direct I/O that bypasses the CPU entirely.

Model	Mode	Decode	VRAM	Notes
Llama 3.1 8B Q8_0	Resident	48.9 tok/s	10.0 GB	All layers in VRAM
Llama 3.1 8B Q8_0	Tiered (auto)	48.8 tok/s	10.3 GB	32/32 layers auto-promoted to VRAM
Llama 3.1 70B Q6_K	Streaming (mmap)	0.006 tok/s	7.3 GB	Page cache thrashing (53 GB > 48 GB RAM)
Llama 3.1 70B Q6_K	Tiered (auto)	0.2 tok/s	23.1 GB	29 VRAM + 51 RAM + 0 NVMe

3-tier adaptive caching auto-sizes from hardware: VRAM-resident layers (zero I/O) + pinned RAM (H2D only) + NVMe/mmap fallback. Achieves 33x speedup over mmap baseline for 70B on consumer hardware (RTX 3090 + 48 GB RAM).

Bottleneck is PCIe H2D bandwidth at Gen3 x8 (~6.5 GB/s). With Gen4 x16 (B550/X570), tier B layers would be compute-bound, yielding ~0.5 tok/s.

Zero external dependencies beyond CUDA Toolkit (no PyTorch, no cuBLAS)
GGUF model format with Q4_0, Q8_0, Q4_K_M, Q6_K, F16, F32 quantization
3-Tier Adaptive Caching: auto-sized VRAM resident + pinned RAM + NVMe/mmap tiers
SLEP streaming: double-buffered layer pipeline overlaps NVMe reads, PCIe DMA, and GPU compute
gpu-nvme-direct backend: userspace NVMe driver reads model weights directly to pinned GPU-accessible memory
Four data paths (auto-selected): VRAM resident > pinned RAM H2D > mmap pinned > CPU worker memcpy
Llama architecture: RoPE, GQA, SwiGLU, RMSNorm, KV cache

Linux (tested on Ubuntu, kernel 6.17+)
CUDA Toolkit 13.1
gcc-14 / g++-14
NVIDIA GPU with Compute Capability 8.0+ (RTX 3090 tested)
CMake 3.24+
(Optional) NVMe SSD on separate PCIe slot + gpu-nvme-direct library

# Build
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_C_COMPILER=gcc-14 \
  -DCMAKE_CXX_COMPILER=g++-14 \
  -DCMAKE_CUDA_COMPILER=/usr/local/cuda-13.1/bin/nvcc
cmake --build . -j

# Run (resident mode — model fits in VRAM)
./ntransformer -m /path/to/llama-8b-q8_0.gguf -p "Hello" -n 128

# Run (streaming mode — model larger than VRAM)
./ntransformer -m /path/to/llama-70b-q6_k.gguf -p "Hello" -n 32 --streaming

# Chat mode
./ntransformer -m /path/to/model.gguf --chat

# Benchmark
./ntransformer -m /path/to/model.gguf --benchmark -n 64

For models that don't fit in VRAM, the NVMe backend eliminates the CPU from the data path:

NVMe SSD → (DMA) → Pinned Staging → (PCIe H2D) → GPU Buffers → Compute

# Build with NVMe support (requires gpu-nvme-direct library)
cmake .. -DCMAKE_BUILD_TYPE=Release -DUSE_GPUNVME=ON \
  -DCMAKE_C_COMPILER=gcc-14 -DCMAKE_CXX_COMPILER=g++-14 \
  -DCMAKE_CUDA_COMPILER=/usr/local/cuda-13.1/bin/nvcc
cmake --build . -j

# Write GGUF model to NVMe raw device
sudo ./scripts/restore_nvme.sh           # ensure kernel driver is bound
sudo dd if=model.gguf of=/dev/nvme0n1 bs=1M oflag=direct status=progress

# Bind NVMe to VFIO for userspace access
sudo ./scripts/setup_nvme.sh             # loads VFIO, forces D0, enables BusMaster

# Run with NVMe backend
sudo GPUNVME_PCI_BDF=0000:01:00.0 GPUNVME_GGUF_LBA=0 \
  ./build/ntransformer -m /path/to/model.gguf -p "Hello" -n 32 --streaming

# Restore NVMe to kernel driver when done
sudo ./scripts/restore_nvme.sh

The GGUF model file is written to raw NVMe blocks via dd
setup_nvme.sh binds the NVMe to VFIO, forces PCIe D0 power state, enables BusMaster
gpu-nvme-direct initializes the NVMe controller from userspace (admin queues, I/O queues)
During inference, each layer (~670 MB for 70B Q6_K) is read via 670 NVMe commands in ~202 ms
Data lands in CUDA pinned staging memory, then async DMA to GPU compute buffers
Pipeline overlaps NVMe reads, H2D DMA, and GPU compute across double buffers

src/
├── core/           # Tensor, allocator, GPU device management
├── cuda/           # CUDA kernels: GEMV, RMSNorm, RoPE, SwiGLU, softmax
├── memory/         # SLEP layer streaming engine (NVMe + mmap backends)
├── model/          # Transformer: config, GGUF loader, attention, FFN, norms
├── inference/      # Tokenizer, sampler, engine
├── utils/          # Timer, logger
├── main.cpp        # CLI entry point
scripts/
├── setup_nvme.sh   # Bind NVMe to VFIO, configure for gpu-nvme-direct
├── restore_nvme.sh # Restore NVMe to kernel driver
tests/              # Unit tests (tensor, GEMM kernels, NVMe layer loader)

forward_tiered() — hybrid pipeline:

Tier A (VRAM resident, layers 0..28):
  GPU Compute:  [layer 0][layer 1]...[layer 28]     (zero I/O, weights permanent)

Tier B (pinned RAM, layers 29..79, double-buffered):
  H2D DMA:     [L29→gpu0][L30→gpu1][L31→gpu0]...   (async from pinned RAM)
  GPU Compute: [         ][layer 29][layer 30]...    (overlapped with H2D)

Tier C (NVMe/mmap fallback, if needed):
  NVMe/memcpy: [read L→stg0][read L→stg1]...
  H2D DMA:     [            ][stg0→gpu0  ]...
  GPU Compute: [            ][            ][layer]...

Tier sizes auto-computed from cudaMemGetInfo() + /proc/meminfo MemAvailable.

Format	Bits/Weight	Block Size	Supported
Q4_0	4.5	32	Yes
Q8_0	8.5	32	Yes
Q4_K_M	4.5	256	Yes
Q6_K	6.6	256	Yes
F16	16	1	Yes
F32	32	1	Yes

Phase 1 - Foundation (complete): Llama 8B Q8_0, custom CUDA kernels, 48.9 tok/s
Phase 2 - SLEP Streaming (complete): 70B on single GPU, 3-tier caching, 33x speedup
Phase 3 - Advanced Quantization: RotateKV (INT2 KV-cache), adaptive per-layer precision
Phase 4 - Novel Architectures: MLA, Mamba/SSM, speculative decoding
Phase 5 - Polish: optimization, benchmarks, public C API

BSD-2-Clause

展示HN：通过NVMe到GPU绕过CPU，在单张RTX 3090上运行Llama 3.1 70B Show HN: Llama 3.1 70B on a single RTX 3090 via NVMe-to-GPU bypassing the CPU

展示HN：通过NVMe到GPU绕过CPU，在单张RTX 3090上运行Llama 3.1 70B
Show HN: Llama 3.1 70B on a single RTX 3090 via NVMe-to-GPU bypassing the CPU