投机式 KV 编码:将 KV 缓存无损压缩至原来的 1/4
Speculative KV coding: losslessly compressing KV cache by up to ~4×

原始链接: https://fergusfinn.com/blog/kv-entropy-coder/

随着大语言模型(LLM)的规模不断扩大,KV缓存的内存需求已成为瓶颈。尽管有损压缩(如降低位宽)可以缓解这一问题,但会带来模型质量下降的风险。“推测式KV编码”(Speculative KV coding)通过提供一种无损压缩方法解决了这一问题,在现有FP8方法的基础上实现了约4倍的额外压缩(总计约8倍)。 其核心思想是利用一个更快的“预测器”模型来估计目标模型的缓存。编码器和解码器并行运行此预测器,为每个缓存值生成统计模型(均值 $\mu$ 和方差 $\sigma^2$)。随后,算术编码器利用这些预测值将实际缓存值压缩为位流。由于双方都能根据提示词确定性地重建相同的 $(\mu, \sigma)$,因此原始缓存可以被精确恢复。 初步结果显示,尤其是在与预量化的FP8缓存结合使用时,该方法在比特率方面有显著提升。这种方法在带宽受限的场景下前景广阔,例如跨数据中心的分离式LLM服务或扩展主机RAM前缀缓存。下一阶段的开发重点将集中在更复杂的残差建模,并利用跨模型预测器,以进一步优化计算开销与内存节省之间的权衡。

Hacker News | 最新 | 过往 | 评论 | 提问 | 展示 | 招聘 | 提交 | 登录 推测性 KV 编码:将 KV 缓存无损压缩高达 ~4 倍 (fergusfinn.com) 10 分,作者 kkm,2 小时前 | 隐藏 | 过往 | 收藏 | 1 条评论 | 帮助 hypfer 13 分钟前 [–] 如果这能行得通,且我理解无误的话,那意味着 24GB 的 RTX 4090 可以在运行 Qwen3.6-27B (IQ4_NL 量化) 的同时,容纳 256k 的 q8 上下文。这将是巨大的突破。尤其是考虑到这块显卡有如此充沛的算力冗余。 回复 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文

The size of LLM context grows by the day. KV caching is what makes running those long contexts affordable: it trades compute for memory so the model doesn’t re-prefill work it has already done. But as agentic workflows push contexts ever longer, storing and moving the cache starts to dominate everything. To get to the next order of magnitude of LLM capability, we need it to be smaller.

You can make it smaller lossily. TurboQuant is a (somewhat controversial

We explored lossless compression of LLM weights in a recent post. The numbers for KV cache look much the same: empirically, the bytewise entropy of a bf16 cache is about 11 bits per scalar, around 30% smaller than the raw representation. Worse, as we showed then

In this post we introduce Speculative KV coding, a method for losslessly compressing the KV cache of a large target model by up to ~4×4\times

A KV cache isn’t really random§

Quick refresher on entropy coding

The KV cache isn’t really a list of samples from a random source. The whole cache is the deterministic output of a forward pass through known weights on a known prompt. There’s exactly one tensor it can be. So the “true” distribution pp is a delta on that tensor, and a delta has zero entropy. Every bit the coder spends is pure KL: H(p,q)=lnq(KVtrue)H(p, q) = -\ln q(KV_\mathrm{true})

What we need, then, is a calibrated model of the specific forward pass that produced one, in a form an arithmetic coder can consume.

What should qq look like?§

Suppose for the moment that we had access to something that, given the prompt, produced a reasonable per-scalar prediction μ\mu of the KV cache and a calibrated sense σ2\sigma^2

lnq(KVfull)  =  12ln(2πσ2)spread cost  +  (KVfullμ)22σ2miss cost.-\ln q(\text{KV}_\text{full}) \;=\; \underbrace{\tfrac{1}{2}\ln(2\pi\sigma^2)}_{\text{spread cost}} \;+\; \underbrace{\frac{(\text{KV}_\text{full} - \mu)^2}{2\sigma^2}}_{\text{miss cost}}.
联系我们 contact @ memedata.com