Transformer 需要三个投影吗？QKV 变体的系统性研究

Transformer 需要三个投影吗？QKV 变体的系统性研究
Do transformers need three projections? Systematic study of QKV variants

本文探讨了 Transformer 中标准的三个投影（Query、Key、Value）注意力机制是否绝对必要。通过系统地测试投影共享约束（具体包括 Q=K=V、Q=K=V 和 Q=K-V），作者证明了减少投影数量并不会显著影响性能。研究人员发现，**Q=K=V（共享键值）**变体尤为有效，在几乎不损失困惑度（perplexity）的情况下，实现了与传统架构相当的性能，并将 KV 缓存大小降低了 50%。此外，该方法与分组查询注意力（GQA）和多查询注意力（MQA）等现有技术具有高度互补性。结合这些策略，可以将 KV 缓存的内存占用率最高降低 96.9%，从而显著减少终端设备推理的内存开销。研究得出结论，键（Key）和值（Value）通常占据相似的表示空间，因此可以在不牺牲模型质量的前提下进行权重绑定。通过证明高性能模型可以在更少投影的情况下运行，该研究为在边缘设备上部署高效、内存优化的 Transformer 提供了实践路径。

Hacker News 最新 | 过往 | 评论 | 提问 | 展示 | 招聘 | 提交登录 Transformer 需要三个投影吗？对 QKV 变体的系统研究 (arxiv.org) 16 分，Anon84 发布于 17 分钟前 | 隐藏 | 过往 | 收藏 | 1 条评论帮助 xiaoyu2006 0 分钟前 | 下一条 [–] 如果最后发现我们一直把 Transformer 搞得过于复杂了，那可真是太棒且有趣了。不过代码库还没放出来…… 回复指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索：

原文

[Submitted on 1 Jun 2026]

View a PDF of the paper titled Do Transformers Need Three Projections? Systematic Study of QKV Variants, by Ali Kayyam and 2 other authors

View PDF HTML (experimental)

Abstract:Transformers have become the standard solution for various AI tasks, with the query, key, and value (QKV) attention formulation playing a central role. However, the individual contribution of these three projections and the impact of omitting some remain poorly understood. We systematically evaluate three projection sharing constraints: a) Q-K=V (shared key-value), b) Q=K-V (shared query-key), and c) Q=K=V (single projection). The last two variants produce symmetric attention maps; to address this, we also explore asymmetric attention via 2D positional encodings. Through experiments spanning synthetic tasks, vision (MNIST, CIFAR, TinyImageNet, anomaly), and language modeling (300M and 1.2B parameter models on 10B tokens), we discovered that our transformers perform on par or occasionally better than the QKV transformer. In language modeling, Q-K=V projection sharing achieves 50% KV cache reduction with only 3.1% perplexity degradation. Crucially, projection sharing is complementary to head sharing (GQA/MQA): combining Q-K=V with GQA-4 yields 87.5% cache reduction, while Q-K=V + MQA achieves 96.9%, enabling practical on-device inference. We show that Q-K=V preserves quality because keys and values can occupy similar representational spaces and attention operates in a low-rank regime, whereas Q=K-V breaks attention directionality. Our results systematically characterize projection sharing as an underexplored instance of weight tying in attention, with direct, quantifiable inference memory benefits, particularly valuable for edge deployment. The code is publicly available at this https URL

From: Anusha Madan Gopal [view email]
[v1] Mon, 1 Jun 2026 20:59:05 UTC (2,017 KB)

Transformer 需要三个投影吗？QKV 变体的系统性研究 Do transformers need three projections? Systematic study of QKV variants

Transformer 需要三个投影吗？QKV 变体的系统性研究
Do transformers need three projections? Systematic study of QKV variants