Reproducing DeepSeek's MHC: When Residual Connections Explode

H_res: How streams mix in the residual path (the red crossings)
H_pre: How streams combine before entering the layer
H_post: How the layer’s output distributes back to streams

January 11, 2026

Every transformer you’ve ever used has the same residual connection design from 2016.

GPT-5, Claude, Llama, Gemini. Under the hood, they all do the same thing: $x + F(x)$

DeepSeek asked: what if it was wider?

The Setup

Standard residual connections are the backbone of every modern transformer. The idea is simple:

x_{l+1} = x_l + F(x_l)

Model	Val Loss (mean ± std)	Max Amax (mean ± std)
HC	0.884 ± 0.033	6.77 ± 0.60
mHC	1.116 ± 0.012	1.00 ± 0.00