注意力残差

注意力残差
Attention Residuals

原始链接: https://github.com/MoonshotAI/Attention-Residuals

## 注意力残差 (AttnRes): 摘要本文介绍了一种名为注意力残差 (AttnRes) 的新型结构，用于替代 Transformer 中的标准残差连接，旨在解决深度网络中信号稀释的问题。标准残差连接会均匀地累积层输出，导致隐藏状态幅度无界——这在 PreNorm 架构中尤其突出。 AttnRes 利用学习到的、输入相关的注意力机制来选择性地聚合早期层表示。每一层都计算所有先前输出上的注意力权重，使其能够专注于最相关的信息。一种计算效率更高的变体，即块注意力残差 (Block AttnRes)，将层分组为块，仅在块级别应用注意力，从而将内存需求从 O(Ld) 降低到 O(Nd)。实验表明，AttnRes 在各种计算预算下始终优于基线 Transformer。具体而言，块注意力残差 (Block AttnRes) 实现了与使用 25% 更多计算量训练的基线相当的性能，在多步推理和代码生成任务中表现出显著的提升（例如，GPQA-Diamond +7.5，HumanEval +3.1）。此外，AttnRes 有效地缓解了 PreNorm 稀释问题，保持了有界输出幅度以及更均匀的跨层梯度范数。

## 注意力残差：摘要一种名为“注意力残差”（AttnRes）的新技术，由一名高中生参与开发，旨在提高大型语言模型（LLM）的效率。传统的LLM使用固定权重累积层输出，导致早期层的贡献被稀释。AttnRes用softmax注意力机制取代了这种方法，允许层以学习到的权重选择性地聚合过去的表示。一种变体“块注意力残差”（Block AttnRes）通过对块级表示应用注意力，进一步减少了内存使用，在最小的开销下实现了大部分收益。这导致训练计算量减少约20%，并且推理所需的带宽需求显著降低——可能使在消费级硬件上获得更好的性能成为可能。讨论围绕着关于推理速度和计算节省的说法准确性展开，一些人警告不要过度解读结果。核心思想被赞扬为直观，并强调了加速模型架构迭代和更广泛可访问性的潜力。人们也对特权和访问在促成此类成就中的作用表示担忧，并将其与中国人工智能领域的类似发展进行比较。

原文

Paper | arXiv | Overview | Results | Citation

(a) Standard residuals with uniform additive accumulation. (b) Full AttnRes: each layer attends over all previous outputs. (c) Block AttnRes: layers are grouped into blocks, reducing memory from O(Ld) to O(Nd).

This is the official repository for Attention Residuals (AttnRes), a drop-in replacement for standard residual connections in Transformers that enables each layer to selectively aggregate earlier representations via learned, input-dependent attention over depth.

Standard residual connections accumulate all layer outputs with fixed unit weights. As depth grows, this uniform aggregation dilutes each layer's contribution and causes hidden-state magnitudes to grow unboundedly — a well-known problem with PreNorm.

AttnRes replaces this fixed accumulation with softmax attention over preceding layer outputs:

$$\mathbf{h}_l = \sum_{i=0}^{l-1} \alpha_{i \to l} \cdot \mathbf{v}_i$$

where the weights $\alpha_{i \to l}$ are computed via a single learned pseudo-query $\mathbf{w}_l \in \mathbb{R}^d$ per layer. This gives every layer selective, content-aware access to all earlier representations.

Full AttnRes is straightforward but requires O(Ld) memory at scale. Block AttnRes partitions layers into N blocks, accumulates within each block via standard residuals, and applies attention only over block-level representations. With ~8 blocks, it recovers most of Full AttnRes's gains while serving as a practical drop-in replacement with marginal overhead.

PyTorch-style pseudocode

def block_attn_res(blocks: list[Tensor], partial_block: Tensor, proj: Linear, norm: RMSNorm) -> Tensor:
    """
    Inter-block attention: attend over block reps + partial sum.
    blocks:
        N tensors of shape [B, T, D]: completed block representations for each previous block
    partial_block:
        [B, T, D]:    intra-block partial sum (b_n^i)
    """
    V = torch.stack(blocks + [partial_block])  # [N+1, B, T, D]
    K = norm(V)
    logits = torch.einsum('d, n b t d -> n b t', proj.weight.squeeze(), K)
    h = torch.einsum('n b t, n b t d -> b t d', logits.softmax(0), V)
    return h

def forward(self, blocks: list[Tensor], hidden_states: Tensor) -> tuple[list[Tensor], Tensor]:
    partial_block = hidden_states
    # apply block attnres before attn
    # blocks already include token embedding
    h = block_attn_res(blocks, partial_block, self.attn_res_proj, self.attn_res_norm)

    # if reaches block boundary, start new block
    # block_size counts ATTN + MLP; each transformer layer has 2
    if self.layer_number % (self.block_size // 2) == 0:
        blocks.append(partial_block)
        partial_block = None

    # self-attention layer
    attn_out = self.attn(self.attn_norm(h))
    partial_block = partial_block + attn_out if partial_block is not None else attn_out

    # apply block attnres before MLP
    h = block_attn_res(blocks, partial_block, self.mlp_res_proj, self.mlp_res_norm)

    # MLP layer
    mlp_out = self.mlp(self.mlp_norm(h))
    partial_block = partial_block + mlp_out

    return blocks, partial_block

AttnRes consistently outperforms the baseline across all compute budgets. Block AttnRes matches the loss of a baseline trained with 1.25x more compute.

Downstream Performance (Kimi Linear 48B / 3B activated, 1.4T tokens)

Category	Benchmark	Baseline	AttnRes
General	MMLU	73.5	74.6
	GPQA-Diamond	36.9	44.4
	BBH	76.3	78.0
	TriviaQA	69.9	71.8
Math & Code	Math	53.5	57.1
	HumanEval	59.1	62.2
	MBPP	72.0	73.9
Chinese	CMMLU	82.0	82.9
	C-Eval	79.6	82.5

AttnRes improves across the board, with the largest gains on multi-step reasoning (+7.5 on GPQA-Diamond) and code generation (+3.1 on HumanEval).

AttnRes mitigates PreNorm dilution: output magnitudes remain bounded across depth and gradient norms distribute more uniformly across layers.

If you found our work useful, please cite

@misc{chen2026attnres,
  title         = {Attention Residuals},
  author        = {Kimi Team  and Chen, Guangyu  and Zhang, Yu  and Su, Jianlin  and Xu, Weixin  and Pan, Siyuan  and Wang, Yaoyu  and Wang, Yucheng  and Chen, Guanduo  and Yin, Bohong  and Chen, Yutian  and Yan, Junjie  and Wei, Ming  and Zhang, Y.  and Meng, Fanqing  and Hong, Chao  and Xie, Xiaotong  and Liu, Shaowei  and Lu, Enzhe  and Tai, Yunpeng  and Chen, Yanru  and Men, Xin  and Guo, Haiqing  and Charles, Y.  and Lu, Haoyu  and Sui, Lin  and Zhu, Jinguo  and Zhou, Zaida  and He, Weiran  and Huang, Weixiao  and Xu, Xinran  and Wang, Yuzhi  and Lai, Guokun  and Du, Yulun  and Wu, Yuxin  and Yang, Zhilin  and Zhou, Xinyu},
  year          = {2026},
  archiveprefix = {arXiv},
  eprint        = {2603.15031},
  primaryclass  = {cs.CL}
}

注意力残差 Attention Residuals

Downstream Performance (Kimi Linear 48B / 3B activated, 1.4T tokens)

注意力残差
Attention Residuals