TurboQuant KV 压缩和 SSD 专家流媒体，适用于 M5 Pro 和 IOS

TurboQuant KV 压缩和 SSD 专家流媒体，适用于 M5 Pro 和 IOS
TurboQuant KV Compression and SSD Expert Streaming for M5 Pro and IOS

原始链接: https://github.com/SharpAI/SwiftLM

## SwiftLM：快速原生 MLX 推理服务器 SwiftLM 是一款高性能、原生 Swift 推理服务器，专为 Apple Silicon 设计，用于运行 MLX 模型，并提供与 OpenAI 兼容的 API。它消除了 Python 和全局解释器锁 (GIL) 的开销，从而实现闪电般的速度。主要特性包括直接加载 HuggingFace 模型并使用 Safetensors 解析，以及集成的 TurboQuantization——一种混合量化技术，在显著压缩内存（KV 缓存高达 3.5 倍）的同时，几乎没有精度损失。一项实验性的 SSD 流式传输功能允许通过直接从 NVMe SSD 交换 MoE 层到 GPU 来运行大型模型（122B+ 参数）。 SwiftLM 提供精细的内存控制，并包含用于设备端推理的原生 iOS 应用程序。它在 Apple M 系列芯片上进行了基准测试，并优先考虑稳定性，避免了可能导致系统崩溃的大型模型问题。它提供预构建的二进制文件以供快速设置，并支持标准的 OpenAI API 端点。

## TurboQuant & SSD 流式传输用于大型模型 SharpAI 发布了一个项目 (github.com/sharpai)，该项目能够在 Apple Silicon 上运行大型的 100B+ 参数混合专家 (MoE) 模型，特别是 M5 Pro 和 iPhone 13 Pro。他们通过两种关键技术实现这一点：**TurboQuant KV 压缩**，通过优化的 C++ 和 Metal 着色器将 KV 缓存大小减少 4.3 倍，以及 **SSD 专家流式传输**，以 9 GB/s 的速度仅将活跃的专家权重从 NVMe 存储流式传输到 GPU。这使得仅使用 2.69GB 的 GPU VRAM 即可进行 122B 参数模型的推理。虽然该项目展示了技术可行性，但评论员指出缺乏全面的基准测试，特别是关于每秒令牌 (tokens/second) 性能的测试。一个反复讨论的焦点是“氛围编码”的兴起——使用像 Claude 这样的工具从论文中快速生成实现，以及缺乏彻底测试和基准测试的贡献是否有价值。尽管存在这种争论，但其他人强调了在当前资源限制下优化越来越重要的意义。

原文

A blazingly fast, native Swift inference server that serves MLX models with a strict OpenAI-compatible API.

No Python runtime, no Global Interpreter Lock (GIL), no unnecessary memory copies. Just bare-metal Apple Silicon performance compiled to a single binary.

🍎 100% Native Apple Silicon: Powered natively by Metal and Swift.
🔌 OpenAI-compatible: Drop-in replacement for OpenAI SDKs (/v1/chat/completions, streaming, etc).
🧠 Smart Model Routing: Loads HuggingFace format models directly, with native Safetensors parsing.
⚡️ TurboQuantization Integrated: Custom low-level MLX Metal primitives that apply extremely fast quantization for KV caching out-of-the-box.
💾 SSD Expert Streaming: Experimental zero-copy streaming that swaps Mixture of Experts (MoE) layers directly from the NVMe SSD to the GPU command buffer without trashing macOS Unified Memory (prevents Watchdog OS kernel panics on 122B+ models).
🎛️ Granular Memory Control: Integrated Layer Partitioning (--gpu-layers) and Wisdom Auto-Calibration for squeezing massive models into RAM.

⚡️ TurboQuantization: KV Cache Compression

SwiftLM implements a hybrid V2+V3 TurboQuant architecture for on-the-fly KV cache compression. At roughly ~3.6 bits per coordinate overall, the KV cache is compressed ~3.5× vs FP16 with near-zero accuracy loss.

By combining V2 Speed with V3 Quality:

Recent reproductions of the TurboQuant algorithm (e.g., turboquant-mlx) revealed two distinct paths:

V2 (Hardware-Accelerated): Fast, but uses linear affine quantization which degrades quality at 3-bit.
V3 (Paper-Correct): Excellent quality using non-linear Lloyd-Max codebooks, but painfully slow due to software dequantization.

We built the "Holy Grail" hybrid: We ported the V3 non-linear Lloyd-Max codebooks directly into the native C++ encoding path, and process the dequantization natively in fused Metal (bggml-metal) shaders. This achieves V3 quality at V2 speeds, completely detached from Python overhead.

K-Cache (3-bit PolarQuant + 1-bit QJL) = 4.25 bits/dim

Extract L2 norm and normalize: x̂ = x / ‖x‖
Apply Fast Walsh-Hadamard Transform (WHT) rotation to distribute outliers evenly.
Quantize each coordinate using 3-bit non-linear Lloyd-Max centroids.
Compute the residual error between the original vector and the quantized approximation.
Project the residual via a random Johnson-Lindenstrauss (QJL) matrix and store the 1-bit signs. (Why QJL? QJL acts as an additional regularizer that prevents centroid resolution loss from degrading the attention dot-product.)

V-Cache (3-bit PolarQuant) = 3.125 bits/dim Because the V-cache matrix is not used for inner-product attention scoring, the QJL error correction provides no benefit. We cleanly disable QJL for the V-cache, extracting an additional 25% memory savings without sacrificing quality.

Reference implementations: turboquant-mlx | turboquant_plus | Paper: TurboQuant, Google 2504.19874

💻 Tested Hardware & Benchmarks

To reliably run massive 122B parameter MoE models over SSD streaming, SwiftLM was designed and benchmarked natively on the following hardware:

Machine: MacBook Pro, Apple M5 Pro
Memory: 64 GB Unified Memory
Model: Qwen3.5-122B-A10B-4bit
SSD: Internal Apple NVMe (Zero-Copy Streaming)

⚠️ Quantization Disclaimer: While heavier quantization shrinks the required memory footprint, 4-bit quantization remains the strict production standard for MoE models. Our metrics indicated that aggressive 2-bit quantization heavily destabilizes JSON grammars—routinely producing broken keys like \name\ instead of "name"—which systematically breaks OpenAI-compatible tool calling.

📱 SwiftLM Chat — iOS App

A native iPhone & iPad companion app that downloads MLX models directly from HuggingFace and runs inference on-device via MLX Swift.

Tab UI: Chat · Models · Settings
Live download progress with speed indicator and circular progress ring
Model catalog: Qwen3, Phi-3.5, Mistral, Llama — with on-device RAM fit indicators
HuggingFace search — find any mlx-community model by name
Context-aware empty states — downloading ring, loading spinner, idle prompt
iOS lifecycle hardened — model unload only fires on true background (not notification banners); 30-second grace period on app-switch

cd SwiftLMChat
python3 generate_xcodeproj.py       # Generates SwiftLMChat.xcodeproj
open SwiftLMChat.xcodeproj

Then in Xcode:

Select the SwiftLMChat target → Signing & Capabilities
Set your Team (your Apple Developer account)
Select your iPhone as the run destination
⌘R to build and run

Note for contributors: The .xcodeproj is git-ignored (it contains your personal Team ID). Run generate_xcodeproj.py after cloning to regenerate it locally. Your Team ID is never committed.

🛠️ Quick Start (macOS Server)

Fastest: Download Pre-built Binary

The absolute fastest way to get started is to download the latest pre-compiled macOS binary directly from the Releases page. Just extract it and run!

Run (Downloads model natively on first launch)

.build/release/SwiftLM \
  --model Qwen3.5-122B-A10B-4bit \
  --stream-experts true \
  --port 5413

(Note: Add --stream-experts=true if you are attempting to run oversized MoE models like Qwen3.5 122B to bypass macOS virtual memory swapping!)

Endpoint	Method	Description
`/health`	GET	Server health + loaded model capabilities
`/v1/models`	GET	List available models
`/v1/chat/completions`	POST	Chat completions (LLM and VLM support, multi-turn, system prompts)

Chat Completion (Streaming)

Drop-in compatible with standard OpenAI HTTP consumers:

curl http://localhost:5413/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3.5-122B-A10B-4bit",
    "stream": true,
    "messages": [
      {"role": "system", "content": "You are Aegis-AI, a local home security agent. Output strictly in JSON format."},
      {"role": "user", "content": "Clip 1: Delivery person drops package at 14:02. Clip 2: Delivery person walks away down driveway at 14:03. Do these clips represent the same security event? Output a JSON object with a `duplicate` boolean and a `reason` string."}
    ]
  }'

Option	Default	Description
`--model`	(required)	HuggingFace model ID or local path
`--port`	`5413`	Port to listen on
`--host`	`127.0.0.1`	Host to bind
`--max-tokens`	`2048`	Max tokens limit per generation
`--gpu-layers`	`model_default`	Restrict the amount of layers allocated to GPU hardware
`--stream-experts`	`false`	Enable experimental SSD streaming for MoE model expert matrices

macOS 14.0+
Apple Silicon (M1/M2/M3/M4/M5)
Xcode Command Line Tools
Metal Toolchain (xcodebuild -downloadComponent MetalToolchain)

📄 Dependencies & License

Built entirely on the hard work of the Apple MLX community.

The TurboQuant KV cache compression implemented in SwiftLM is directly based on the following open-source work and research:

TheTom/llama-cpp-turboquant — The primary reference for the C and Metal GPU implementation. The turbo-wht.h Fast Walsh-Hadamard kernel, WHT sign arrays (seed=42), Lloyd-Max centroid tables, and the ggml-turbo-quant.c quantize/dequantize logic were ported directly from this repository into our MLX C++ and Metal backend.
TheTom/turboquant_plus — Python reference implementation used to validate the algorithm math, codebook construction (Lloyd's algorithm for N(0, 1/d)), and KV cache integration design.
TurboQuant Paper — "TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate", Zandieh et al., AISTATS/ICLR 2026. The two-stage PolarQuant + QJL algorithm described in Section 3 and Appendix A is the mathematical foundation of this implementation.
amirzandieh/QJL — Original Quantized Johnson-Lindenstrauss (QJL) 1-bit residual correction implementation by the paper authors.

MIT License