Show HN: KVBoost – 为 HuggingFace 提供块级 KV 缓存复用，首字延迟（TTFT）提升 5

Show HN: KVBoost – 为 HuggingFace 提供块级 KV 缓存复用，首字延迟（TTFT）提升 5–48 倍
Show HN: KVBoost – chunk-level KV cache reuse for HuggingFace, 5–48x faster TTFT

原始链接: https://pythongiant.github.io/KVBoost/

```python from kvboost import KVBoost # 加载模型 engine = KVBoost.from_pretrained("Qwen/Qwen2.5-3B") # 预热共享前缀（仅需执行一次） engine.warm("You are a helpful assistant...") # 后续所有调用均可复用缓存 result = engine.generate(prompt) # 打印 KV 复用率 print(result.kv_reuse_ratio) # ✓ 80%+ ```

**KVBoost** 是一个全新的 Python 库，旨在通过实现块级 KV 缓存重用来加速 HuggingFace 模型推理。通过优化共享上下文的处理方式，它实现了显著的性能提升，例如在 8 轮对话中，首字延迟（TTFT）比基准测试快达 47.9 倍。主要功能包括： * **高效性：** 支持 int8/int4 KV 量化，可减少 2–4 倍的内存占用，并提供磁盘备份存储。 * **兼容性：** 适配 Llama、Qwen、Gemma 和 Mistral 等 11 种主流模型架构。 * **高性能：** 通过利用内部块重用技术，其表现优于 vLLM 和 MLX 等现有工具（这些框架往往无法有效命中缓存）。在贪婪解码模式下，KVBoost 生成的输出与基准模型逐字一致。该工具最适用于参数量在 3B 以上且拥有至少 500 个共享上下文标记的模型。项目现已通过 `pip install kvboost` 发布，并托管在 GitHub 上。

from kvboost import KVBoost

engine = KVBoost.from_pretrained(
"Qwen/Qwen2.5-3B"
)

# Warm a shared prefix once
engine.warm("You are a helpful assistant...")

# All subsequent calls reuse cache
result = engine.generate(prompt)

print(result.kv_reuse_ratio) # ✓ 80%+

Show HN: KVBoost – 为 HuggingFace 提供块级 KV 缓存复用，首字延迟（TTFT）提升 5–48 倍 Show HN: KVBoost – chunk-level KV cache reuse for HuggingFace, 5–48x faster TTFT

Show HN: KVBoost – 为 HuggingFace 提供块级 KV 缓存复用，首字延迟（TTFT）提升 5–48 倍
Show HN: KVBoost – chunk-level KV cache reuse for HuggingFace, 5–48x faster TTFT