在配备48GB内存的Mac上运行397B参数模型

在配备48GB内存的Mac上运行397B参数模型
Flash-MoE: Running a 397B Parameter Model on a Laptop

原始链接: https://github.com/danveloper/flash-moe

## Qwen3.5-397B-A17B 在 MacBook Pro 上的 24 小时成果研究人员成功地在配备 48GB 内存的 MacBook Pro 上运行了大型 Qwen3.5-397B-A17B（3970 亿参数）混合专家模型，实现了 4.4+ token/秒的运行速度，并输出了高质量的结果，包括工具调用。这仅用纯 C/Metal 推理引擎在 24 小时内完成，绕过了 Python 和传统框架。 209GB 的模型从 SSD 流式传输，利用定制的 Metal 计算流水线并“信任操作系统”进行缓存（达到 71% 的命中率）。关键优化包括 FMA 优化的去量化内核（+12% 速度）和手工调整的 Metal 着色器。虽然 2 位量化提供了更快的速度，但它损害了 JSON/工具调用的可靠性，因此 4 位成为首选配置。该项目优先考虑高效的内存管理，仅使用约 6GB 的内存并按需流式传输专家权重。重要的是，团队发现重叠的 SSD DMA 和 GPU 计算*降低*了性能，原因是内存控制器争用，因此选择串行流水线。代码总计约 8200 行，已公开可用，展示了低级别优化的卓越成就。

## Flash-MoE：在笔记本电脑上运行397B参数模型 - 摘要这个Hacker News讨论围绕一个项目，该项目使用一种名为Flash-MoE的技术，成功地在笔记本电脑上运行了Qwen 3.5 397B参数语言模型。核心成就是从SSD流式传输模型，并利用定制的Metal计算着色器进行处理。虽然令人印象深刻，但性能约为每秒4.4个token，一些用户认为这对于实际使用来说太慢了。讨论强调，实现这一点需要大量的硬件（高端Macbook Pro），并且涉及权衡，例如4位量化，这可能会影响输出质量。讨论了使用更低位量化（2位）或减少激活专家数量等替代方案，并对性能和输出连贯性严重下降表示担忧。许多评论者分享了他们使用不同量化级别和硬件配置运行Qwen 3.5的经验，并指出4位量化通常能在速度和质量之间提供更好的平衡。对话还涉及SSD磨损的可能性、更大RAM容量的好处，以及在消费级硬件与专用服务器基础设施上运行大型模型的更广泛影响。

原文

Read the paper — Full technical details, 90+ experiments, and the story of how an AI and a human built this in 24 hours.

Pure C/Metal inference engine that runs Qwen3.5-397B-A17B (a 397 billion parameter Mixture-of-Experts model) on a MacBook Pro with 48GB RAM at 4.4+ tokens/second with production-quality output including tool calling.

The entire 209GB model streams from SSD through a custom Metal compute pipeline. No Python. No frameworks. Just C, Objective-C, and hand-tuned Metal shaders.

Configuration	tok/s	Quality	Notes
4-bit experts, FMA kernel	4.36	Excellent	Current best. Full tool calling. 209GB on disk.
4-bit experts, baseline	3.90	Excellent	Before FMA kernel optimization.
2-bit experts, trust OS	5.74	Good*	120GB on disk. *Breaks JSON/tool calling.
2-bit peak single token	7.05	Good*	Warm cache burst. *Not suitable for tool use.

*2-bit quantization produces \name\ instead of "name" in JSON output, making tool calling unreliable. 4-bit is the production configuration.

Machine: MacBook Pro, Apple M3 Max
Chip: 16-core CPU (12P + 4E), 40-core GPU, 16-core ANE
Memory: 48 GB unified (~400 GB/s bandwidth)
SSD: 1TB Apple Fabric, 17.5 GB/s sequential read (measured)
macOS: 26.2 (Darwin 25.2.0)

The model has 60 transformer layers: 45 GatedDeltaNet (linear attention) + 15 standard full attention. Each layer has 512 experts, of which K=4 are activated per token (plus one shared expert). Hidden dimension is 4096.

SSD Expert Streaming — Expert weights (209GB at 4-bit) are read from NVMe SSD on demand via parallel pread() with GCD dispatch groups. Only the K=4 active experts per layer are loaded (~6.75MB each). The OS page cache manages caching — no custom cache needed ("Trust the OS" principle). Inspired by Apple's "LLM in a Flash" paper.
FMA-Optimized Dequant Kernel — The inner loop of the 4-bit dequantized matrix-vector multiply rearranges the math from (nibble * scale + bias) * x to fma(nibble, scale*x, bias*x). Pre-computing scale*x and bias*x lets the GPU fused multiply-add unit do dequant+multiply in one instruction. 12% faster than the naive formulation.
Metal Compute Shaders — Hand-written Metal kernels for:
- 4-bit and 2-bit dequantized matrix-vector multiply (tiled, SIMD-reduced, shared input cache, FMA-optimized)
- Fused SwiGLU activation
- RMS normalization (two-pass: sum-of-squares reduction + apply)
- Batched GPU attention (Q@K^T, softmax, scores@V) for full attention layers
- GPU RoPE (fused with Q deinterleave and K normalization)
- MoE combine + residual + sigmoid gate (fused kernel)
Deferred GPU Expert Compute — CMD3 (expert forward pass) is submitted without waiting. The GPU executes it while the CPU prepares the next layer. The combine + residual + norm are also on GPU, feeding directly into the next layer's attention projections.
Accelerate BLAS for Linear Attention — The GatedDeltaNet recurrence uses cblas_sscal, cblas_sgemv, and cblas_sger for the 64-head × 128×128 state matrix update. 64% faster than scalar code.
Trust the OS — No custom expert cache. The OS page cache (~35GB) manages expert data caching via standard LRU. Every custom caching approach we tested (Metal LRU, malloc cache, LZ4 compressed cache) was slower due to GPU memory pressure or overhead. The page cache achieves ~71% hit rate naturally.

Pipeline Per Layer (4.28ms average at 4-bit)

CMD3(prev) → CMD1: attention projections + delta-net  [1.22ms GPU]
           → CPU: flush results                       [0.01ms CPU]
           → CMD2: o_proj + norm + routing + shared    [0.55ms GPU]
           → CPU: softmax + topK routing               [0.003ms]
           → I/O: parallel pread K=4 experts           [2.41ms SSD]
           → CMD3: expert forward + combine + norm     [0.04ms encode, DEFERRED]

Unified Memory Constraint

On Apple Silicon, SSD DMA and GPU compute share the same memory controller and cannot be profitably overlapped. The GPU's dequant kernels are bandwidth-saturated at ~418 GiB/s. Even small background SSD DMA causes disproportionate GPU latency spikes through memory controller arbitration. The serial pipeline (GPU → SSD → GPU) is hardware-optimal.

cd metal_infer
make
# 4-bit inference (needs packed_experts/ directory)
./infer --prompt "Explain quantum computing" --tokens 100

# 2-bit inference (faster but breaks tool calling)
./infer --prompt "Explain quantum computing" --tokens 100 --2bit

# Interactive chat with tool calling
./chat

# Per-layer timing breakdown
./infer --prompt "Hello" --tokens 20 --timing

metal_infer/
  infer.m              # Complete inference engine (~7000 lines)
  shaders.metal        # Metal compute kernels (~1200 lines)
  chat.m               # Interactive chat TUI with tool calling
  tokenizer.h          # C BPE tokenizer (single-header, 449 lines)
  main.m               # MoE-only benchmark
  Makefile             # Build system
  extract_weights.py   # Creates model_weights.bin from safetensors
  repack_experts_2bit.py  # 4-bit → 2-bit expert requantization
  train_predictor.py   # Expert routing prediction analysis
  model_weights.bin    # Non-expert weights (5.5GB, mmap'd)
  model_weights.json   # Tensor manifest
  vocab.bin            # Vocabulary for token decoding
  tokenizer.bin        # Pre-exported BPE tokenizer data

repack_experts.py      # 4-bit expert packing from safetensors
progress.py            # Results visualization (Q2/Q4 tracks)
results.tsv            # Experiment log (58 experiments)

What We Tried (and What Worked)

Approach	Result	Impact
FMA dequant kernel	GPU compute -12%	+12% tok/s
Trust OS page cache	Deleted Metal LRU → +38%	Foundational
GPU combine+norm in CMD3	Eliminates CPU round-trip	Pipeline
BLAS delta-net (Accelerate)	cpu_attn 0.78→0.28ms	+64% attn
F_NOCACHE for 2-bit	+3% from avoiding page thrash	2-bit only
GPU fused attention (RoPE)	+2% for full-attn layers	Small
C BPE tokenizer	180ms vs 3500ms startup	20x startup
Deferred CMD3 execution	GPU/CPU overlap	Pipeline

Discarded (58 experiments, highlights)

Approach	Result	Why
LZ4 expert compression	-13%	Decompress overhead > warm cache savings
F_RDADVISE prefetch	net 0%	Unified memory: SSD DMA slows GPU -73%
Temporal expert prediction	-18%	25% hit rate, SSD bandwidth waste
MLP routing predictor	31% accuracy	Worse than temporal baseline
GPU LUT dequant kernel	-2%	Indirect register access serializes
GPU private buffer compression	-20% pipeline	Blit cost 4×7MB > matvec savings
Spin-poll GPU wait	-23%	CPU thermal competes with GPU
Expert file clustering	0%	NVMe ignores scatter at 7MB granularity
dispatch_io	-70%	dispatch_data management overhead
mmap expert files	-5x	Per-page fault overhead on cold data
Speculative early routing	-38%	Cache pollution + overhead
MTP speculative decoding	break-even	MoE I/O scales per-token (unlike dense)

This is a primary development machine. The engine explicitly controls memory:

Non-expert weights: 5.5GB (mmap'd, read-only)
Metal scratch buffers: ~200MB
Total: ~6GB, leaving 42GB for OS + page cache
No OOM risk. Expert data streams from SSD on demand.
No custom caches. Trust the OS.