投机式 KV 编码：将 KV 缓存无损压缩至原来的 1/4

投机式 KV 编码：将 KV 缓存无损压缩至原来的 1/4
Speculative KV coding: losslessly compressing KV cache by up to ~4×

原始链接: https://fergusfinn.com/blog/kv-entropy-coder/

随着大语言模型（LLM）的规模不断扩大，KV缓存的内存需求已成为瓶颈。尽管有损压缩（如降低位宽）可以缓解这一问题，但会带来模型质量下降的风险。“推测式KV编码”（Speculative KV coding）通过提供一种无损压缩方法解决了这一问题，在现有FP8方法的基础上实现了约4倍的额外压缩（总计约8倍）。其核心思想是利用一个更快的“预测器”模型来估计目标模型的缓存。编码器和解码器并行运行此预测器，为每个缓存值生成统计模型（均值 $\mu$ 和方差 $\sigma^2$）。随后，算术编码器利用这些预测值将实际缓存值压缩为位流。由于双方都能根据提示词确定性地重建相同的 $(\mu, \sigma)$，因此原始缓存可以被精确恢复。初步结果显示，尤其是在与预量化的FP8缓存结合使用时，该方法在比特率方面有显著提升。这种方法在带宽受限的场景下前景广阔，例如跨数据中心的分离式LLM服务或扩展主机RAM前缀缓存。下一阶段的开发重点将集中在更复杂的残差建模，并利用跨模型预测器，以进一步优化计算开销与内存节省之间的权衡。

Hacker News | 最新 | 过往 | 评论 | 提问 | 展示 | 招聘 | 提交 | 登录推测性 KV 编码：将 KV 缓存无损压缩高达 ~4 倍 (fergusfinn.com) 10 分，作者 kkm，2 小时前 | 隐藏 | 过往 | 收藏 | 1 条评论 | 帮助 hypfer 13 分钟前 [–] 如果这能行得通，且我理解无误的话，那意味着 24GB 的 RTX 4090 可以在运行 Qwen3.6-27B (IQ4_NL 量化) 的同时，容纳 256k 的 q8 上下文。这将是巨大的突破。尤其是考虑到这块显卡有如此充沛的算力冗余。回复指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索：

The size of LLM context grows by the day. KV caching is what makes running those long contexts affordable: it trades compute for memory so the model doesn’t re-prefill work it has already done. But as agentic workflows push contexts ever longer, storing and moving the cache starts to dominate everything. To get to the next order of magnitude of LLM capability, we need it to be smaller.

You can make it smaller lossily. TurboQuant is a (somewhat controversialPerformance exploration, accusations of academic misconduct.) recent example, dropping the bit-width of K and V and absorbing the resulting quality loss. The cost of that route is that the loss isn’t something you specify in advance: you find out what degraded by running evals and hoping they catch whatever you killed. Lossless compression sidesteps the question entirely, by reconstructing the cache exactly.

We explored lossless compression of LLM weights in a recent post. The numbers for KV cache look much the same: empirically, the bytewise entropy of a bf16 cache is about 11 bits per scalar, around 30% smaller than the raw representation. Worse, as we showed thenThat post was about weight compression, but KV cache is reasonably similar. See here for a nice paper., this slack collapses as the bitwidth comes down. FP4 weights are within 5-7% of saturating their format, and the same goes for caches stored at lower precision. Given that low bitwidths have such obvious benefits for performance, we ought to treat them as the baseline.

In this post we introduce Speculative KV coding, a method for losslessly compressing the KV cache of a large target model by up to ~ $4\times$

A KV cache isn’t really random§

Quick refresher on entropy codingI’ve written about the mechanics of arithmetic coding before — see the rANS and tANS posts. For this post all you need is the bitrate formula above; the coder is a black box that achieves it.. You have a stream of symbols drawn from some distribution $p$

The KV cache isn’t really a list of samples from a random source. The whole cache is the deterministic output of a forward pass through known weights on a known prompt. There’s exactly one tensor it can be. So the “true” distribution $p$

What we need, then, is a calibrated model of the specific forward pass that produced one, in a form an arithmetic coder can consume.

What should $q$

Suppose for the moment that we had access to something that, given the prompt, produced a reasonable per-scalar prediction $\mu$

-\ln q(\text{KV}_\text{full}) \;=\; \underbrace{\tfrac{1}{2}\ln(2\pi\sigma^2)}_{\text{spread cost}} \;+\; \underbrace{\frac{(\text{KV}_\text{full} - \mu)^2}{2\sigma^2}}_{\text{miss cost}}.

target	$\mathcal{N}(\mu,\sigma)$	mixture	ratio
0.6B	6.8651	6.7419	2.37×
1.7B	6.6385	6.5303	2.45×
4B	6.4188	6.3290	2.53×
8B	6.2576	6.1793	2.59×
14B	6.0760	6.0097	2.66×
32B	5.9785	5.9185	2.70×

target	b/FP8-elem	vs raw FP8 (8 b)
0.6B	2.594	3.08×
1.7B	2.454	3.26×
4B	2.323	3.44×
8B	2.220	3.60×
14B	2.109	3.79×
32B	2.053	3.90×