KVarN:华为原生 vLLM KV Cache 量化后端
KVarN: Native vLLM backend for KV-cache quantization by Huawei

原始链接: https://github.com/huawei-csl/KVarN

**KVarN** 是专为 vLLM 设计的高性能 KV 缓存量化解决方案,旨在克服容量、速度和准确性之间传统的权衡问题。 与现有的牺牲吞吐量或精度来换取缓存容量的方法不同,KVarN 在保持 FP16 级准确性的同时,相比 FP16 实现了约 3–5 倍的容量提升,且吞吐量相当甚至更高。它通过一种新颖的“方差归一化”处理来实现这一点,该过程利用 Hadamard 旋转和迭代归一化,在低位舍入前将量化误差降至最低。 **主要功能:** * **即插即用:** 作为原生 vLLM 注意力后端运行。无需更改模型或进行校准,只需通过单个标志(`--kv-cache-dtype kvarn_k4v2_g128`)即可启用。 * **高性能:** 始终位于性能指标的“右上角”——在提供更长上下文的同时,同步实现更高的并发请求容量和改进的吞吐量。 * **部署便捷:** 可作为 vLLM 的分支轻松安装,支持标准的服务工作流程。 KVarN 基于 Apache 2.0 许可证发布,是论文《KVarN: Variance-Normalized KV-Cache Quantization Mitigates Error Accumulation in Reasoning Tasks》的官方实现。对于需要在不妥协质量或速度的前提下进行长上下文推理的生产工作负载,它是理想的解决方案。

Hacker News | 最新 | 往期 | 评论 | 提问 | 展示 | 招聘 | 提交 | 登录 KVarN:华为推出的原生 vLLM KV 缓存量化后端 (github.com/huawei-csl) 12 点,由 theanonymousone 发布于 42 分钟前 | 隐藏 | 往期 | 收藏 | 3 条评论 | 帮助 throwa356262 5 分钟前 | 下一条 [-] 性能优于 TQ,质量优于 FP16?我没看错吧??回复 v3ss0n 7 分钟前 | 上一条 [-] 为什么这不是一个 vLLM 的合并请求 (PR)?回复 esafak 0 分钟前 | 父评论 [-] 这是研究论文的产物;作者并非旨在构建 vLLM。回复 准则 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文

Built on vLLM License arXiv hf-space GitHub stars

KVarN

⚡️ Built for agentic and long-context workloads.

💡 KVarN delivers 3-5x more KV-cache capacity and up to ~1.3x the throughput of FP16, so you fit far longer contexts and serve more concurrent requests, with FP16-level accuracy.

🔌 Calibration-free, plug-and-play with vLLM. A native vLLM attention backend: add one flag, no model changes, no calibration.

🥊 Up to ~2.4× TurboQuant throughput, same capacity, higher accuracy.


Why KVarN (Variance Normalized KV-Cache)?

kvarn /kvɑːɳ/  ·  noun (Swedish)

  1. A grinding apparatus used to reduce substances into smaller particles or powder, especially grains, seeds, spices, coffee beans, KV-caches.

KV-cache quantization usually comes with a catch. As the vLLM TurboQuant blog shows, existing methods buy extra KV-cache capacity but give up throughput (TurboQuant reports 40 to 52% lower throughput for 2.3-3.7x capacity), and aggressive low-bit quantization also tends to cost accuracy. Losing both speed and quality is the main reason KV-cache quantization is rarely turned on in production.

KVarN is built to keep both. On Qwen3-32B (AIME25, 16K-context burst, TP=2) it matches FP16 accuracy and beats its throughput while delivering ~4× the KV-cache capacity:

KVarN vs FP16 vs TurboQuant: accuracy, throughput and capacity

KVarN stays in the upper-right corner the blog's methods can't reach: FP16-level accuracy, FP16-or-better throughput, and several times the context.


KVarN ships as a vLLM fork. Install it like vLLM, then select the KVarN KV-cache dtype.

# 1. Clone
git clone https://github.com/huawei-csl/KVarN.git
cd KVarN

# 2. Install (uses the upstream precompiled wheel; KVarN kernels are Triton, JIT-compiled at runtime)
VLLM_USE_PRECOMPILED=1 pip install -e .
from vllm import LLM, SamplingParams

llm = LLM(
    model="Qwen/Qwen3-32B",
    dtype="float16",                    # KVarN runs in float16
    kv_cache_dtype="kvarn_k4v2_g128",   # enable KVarN
    block_size=128,                     # KVarN tile size
)
print(llm.generate("Explain KV-cache quantization in one sentence.",
                    SamplingParams(max_tokens=64))[0].outputs[0].text)

Serving works the same way:

vllm serve Qwen/Qwen3-32B --dtype float16 --kv-cache-dtype kvarn_k4v2_g128 --block-size 128

Note: KVarN runs in float16 compute. The tile / page size is currently fixed at 128 (one vLLM block = one KVarN tile); other page sizes are coming soon.

Tip (capacity): KVarN realizes its full KV-cache capacity when there is room to amortize a small fixed decode workspace. On multi-GPU or generous --gpu-memory-utilization setups this is automatic. On a tight single-GPU budget, vLLM's CUDA-graph memory profiler can over-reserve and shrink the KV pool; set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0 (and/or raise --gpu-memory-utilization) to recover the full capacity.


KVarN pipeline: Cache, Rotated Cache, Normalized Cache, Quantized Cache

KVarN quantizes the KV cache one fixed-size token tile at a time, walking each tile through the four stages above:

  1. Cache: the raw fp16 KV tile (channels × tokens), straight from attention.

  2. Rotated Cache: a Hadamard rotation along the channel dimension mixes channels so that per-channel outliers are spread out, making the tile easier to quantize. The rotation is orthonormal, so attention scores are preserved.

  3. Normalized Cache: iterative variance normalization (Sinkhorn-like) alternates column- and row-wise standard-deviation normalization in log space, equalizing variance across the tile and shrinking quantization error before any rounding happens.

  4. Quantized Cache: asymmetric round-to-nearest at low bit-width, with the scales folded back in at read time (keys per channel, values per token).

The shipped preset spends more bits on keys than values (kvarn_k4v2_g128: 4-bit keys, 2-bit values). We chose to release this configuration because it meets the strictest accuracy bar, matching FP16, that the most demanding production deployments and vLLM require, while still delivering throughput above FP16.


KVarN is the official vLLM implementation of our paper:

📄 KVarN: Variance-Normalized KV-Cache Quantization Mitigates Error Accumulation in Reasoning Tasks (arXiv:2606.03458)

If you use KVarN, please cite:

@misc{muller2026kvarn,
      title={KVarN: Variance-Normalized KV-Cache Quantization Mitigates Error Accumulation in Reasoning Tasks}, 
      author={Lorenz K. Muller and Philippe Bich and Chiara Boretti and Hyun-Min Chang and Jiawei Zhuang and Lukas Cavigelli},
      year={2026},
      eprint={2606.03458},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={http://arxiv.org/abs/2606.03458}
}

KVarN is built on vLLM (v0.22.0) and is released under the Apache 2.0 License. The original vLLM README is preserved as README_vLLM.md.

联系我们 contact @ memedata.com