Every transformer you’ve ever used has the same residual connection design from 2016.
GPT-5, Claude, Llama, Gemini. Under the hood, they all do the same thing: . One stream of information flowing through the network, with each layer adding to it.
DeepSeek asked: what if it was wider?
The Setup
Standard residual connections are the backbone of every modern transformer. The idea is simple:
The input flows through unchanged, plus the layer’s output. One stream of information. What goes in comes out, plus a learned update. This is why transformers can be hundreds of layers deep: the gradient has a clean path backward. Simple. Stable. Unchanged since 2016.
Hyper-Connections take a different approach. Instead of one stream, expand to n parallel streams with learnable mixing matrices:
Compared to standard residual:
Three matrices control how information flows:
- H_res: How streams mix in the residual path (the red crossings)
- H_pre: How streams combine before entering the layer
- H_post: How the layer’s output distributes back to streams
More expressive. More parameters with negligible computational overhead. Better performance, in theory.
The problem? Those mixing matrices are unconstrained. They can amplify signals, not just route them.
The Explosion
Under aggressive learning rates, Hyper-Connection (HC) signal amplification in my reproduction hit 7x before eventually collapsing. Amax (the maximum of row and column absolute sums) measures how much a matrix can amplify signals.
At my 10M parameter scale, this is survivable. But DeepSeek saw this at 27B:
“The Amax Gain Magnitude yields extreme values with peaks of 3000”
That’s not a typo. Three thousand times amplification. At 27B parameters, unconstrained HC didn’t just drift. It exploded. My 10M reproduction hitting 9.2x is the early warning sign of this exponential failure.
This is why unconstrained mixing matrices break at scale. Small amplifications compound exponentially.
The Fix: Constrain the Manifold
DeepSeek’s fix is clean: constrain the mixing matrices to be doubly stochastic.
A doubly stochastic matrix has:
- All non-negative entries
- Rows sum to 1
- Columns sum to 1
This means the mixing operation can only take weighted averages of streams. It can route information, shuffle it, blend it. But it cannot amplify.
How? The Sinkhorn-Knopp algorithm.
The algorithm is dead simple:
- Start with any matrix (the raw learned weights)
- Exponentiate to make all entries positive:
- Normalize rows so each row sums to 1
- Normalize columns so each column sums to 1
- Repeat steps 3-4 until convergence
That’s it. Alternate row and column normalization. Twenty iterations is enough.
This procedure is differentiable. Gradients flow back through all twenty iterations. The network learns the raw weights , and Sinkhorn ensures the actual mixing matrix is always doubly stochastic.
When I first saw this, it felt like cheating. You’re not learning stability. You’re forcing it. But some properties shouldn’t be learned; they should be guaranteed.
Technical note: Strictly speaking, only the recursive matrix H_res needs the full Sinkhorn doubly-stochastic treatment. It’s the one compounding errors layer-over-layer. The input/output mixers (H_pre, H_post) are just bounded via sigmoid. The Sinkhorn compute cost is paid only where it matters most.
The Results
Seed Variation Results (Depth 24, 3 seeds)
| Model | Val Loss (mean ± std) | Max Amax (mean ± std) |
|---|---|---|
| HC | 0.884 ± 0.033 | 6.77 ± 0.60 |
| mHC | 1.116 ± 0.012 | 1.00 ± 0.00 |
HC wins on raw performance: 0.88 vs 1.12 validation loss. At 10M parameters, the mHC constraint acts like a stability tax; you pay in expressivity. But at 27B parameters, that tax is the only thing preventing your model from exploding to NaN.
But look at the variance. HC’s loss varies 3x more across seeds (±0.033 vs ±0.012). And Amax? HC swings from 6.1 to 7.6 depending on the seed. mHC is 1.00. Every seed. Every run. Zero variance.
At 10M parameters, the instability is survivable. HC still wins. But at 27B parameters, that 6-7x amplification becomes 3000x. You can’t gamble at that scale.
Depth Scaling
I also swept depths from 6 to 24 layers (constant ~11M parameter budget):
- Loss improves with depth, until it doesn’t. Depth 20 hit the sweet spot (0.85 val loss). Depth 24 regressed slightly (0.93) due to the width bottleneck from shrinking dim to 192.
- Amax is unpredictable. Depth 20 spiked to 9.2x. Depth 12 hit 6.6x. Depth 8 stayed at 4.3x. There’s no clean relationship; HC is chaotic.
Experiment Details
Dataset: TinyShakespeare (~1M chars, character-level) Model: GPT-2 architecture, ~10M parameters Training: 5000 steps, AdamW (β1=0.9, β2=0.95), weight decay 0.1, cosine LR decay Hardware: Apple M-series (MPS)
Depth sweep: 8 configurations (6-24 layers), width adjusted to maintain ~11M params Seed variation: 3 seeds (42, 123, 456) at depth 24
Why This Matters
Residual connections are more than a trick to help gradients flow. They’re a conservation law.
In physics, conservation laws constrain what’s possible but enable prediction. You can’t build a perpetual motion machine, but you can calculate exactly where a ball will land.
The identity mapping in residual connections is similar. It constrains the network by preventing arbitrary transformations, but it guarantees stability. Signal magnitude is preserved.
HC breaks conservation; mHC restores it, not by returning to identity, but by finding a richer manifold that still conserves signal.
In 2016, He et al. introduced ResNets to solve the vanishing gradient problem, ensuring signals didn’t die. Ten years later, the opposite problem emerged: exploding signals from hyper-connectivity. The identity mapping solved the first by being passive. mHC solves the second by enforcing conservation.
Every residual connection is a conservation law. mHC enforces it.
Not a hack, not a trick. A principled constraint that makes the architecture work at scale.
Takeaways
-
The stream persistence bug humbled me. My first implementation looked right. The equations matched the paper. The code ran. But I was projecting the output back to a single stream and re-expanding it at each layer, killing the parallel architecture. The “hyper” part of Hyper-Connections wasn’t actually doing anything. Three separate audits said “looks correct.” The bug was architectural, not mathematical. I only caught it by asking: “Wait, what shape is actually flowing between layers?”
-
Constraints aren’t limitations; they’re guarantees. The doubly stochastic projection forces stability. You’re not learning good behavior. You’re making bad behavior impossible. My first reaction: “That’s not elegant. That’s a straitjacket.” Then I saw HC hit 7x amplification. Oh. That’s the point.
-
The boring choice scales. Standard residual connections have survived since 2016 not because they’re optimal, but because they’re stable. HC is more expressive but fragile. mHC finds a middle ground: more expressive than standard residuals, with stability guarantees.
What’s Next
This is Part 1 of a two-part series.
Part 1 (this post): Reproduce mHC at small scale to understand the mechanics.
- 10M parameters, TinyShakespeare dataset
- Constant parameter budget across depths
- Goal: Validate the core claim: HC explodes, mHC doesn’t
Part 2 (Thursday): Scale up to see real instability.
- 1B parameters on A100s
- C4 dataset, fixed width (no bottleneck)
- Goal: Push toward the 3000x Amax regime
At 10M params, HC peaked at 9.2x amplification, chaotic but survivable. The paper saw 3000x at 27B. Part 2 will show where things break.
Resources
Paper: Manifold-Constrained Hyper-Connections (arXiv 2512.24880)
Related: Deep Residual Learning (He et al., 2016)
Code: Coming with Part 2.
Part 2 comes Thursday. Follow @TayKolasinski to catch it.