偏差抵消,方差抹除
Bias Compounds, Variance Washes Out

原始链接: https://convergentthinking.sh/posts/bias-compounds-variance-washes-out/

浮点运算中的标准“舍入到最近”(RNE)会引入一种随时间累积的持续偏差。在执行大量小规模更新(例如神经网络训练)时,RNE 产生的舍入误差会导致数值停滞,因为每次更新都会被舍入回同一个可表示的数字。相反,随机舍入(SR)产生的误差是无偏的,其均值为零。虽然这些更新包含更多噪声,但它们会在长时间内相互抵消,从而使总和能够按预期增长。 在数学上,有偏误差呈线性增长($O(n)$),而无偏误差则以随机游走的速度增长($O(\sqrt{n})$)。这种差异对训练稳定性至关重要。实验表明,在优化器状态中使用带有 SR 的 BF16 可以达到与 FP32 精度相当的性能,而带有 RNE 的 BF16 则会导致训练损失过早陷入平稳。通过在优化器内核中用 SR 替换 RNE(这无需额外的内存或带宽),从业者可以在获得 FP32 级精度的同时,有效地将每个参数的内存占用从 10 字节降低至 6 字节。简而言之,消除舍入偏差是维持低精度训练收敛性的关键。

抱歉。
相关文章

原文

Round-to-nearest makes the same rounding error every time. Stochastic rounding makes a different error each time, centered on zero. When the same error repeats, it compounds. When errors are zero-mean, they partly cancel.

Add 0.001 to 1.0 a thousand times in BF16 and round-to-nearest never moves. Every update falls closer to 1.0 than to the next representable value, so every update rounds back to 1.0. Stochastic rounding reaches 2.0. Each update rounds up with probability proportional to where it falls in the rounding interval. In expectation, the sum is exact.

accumulation.png

Over $n$ steps, biased errors grow as $O(n)$, but unbiased errors grow as $O(\sqrt{n})$.

The variance diffuses like a random walk, growing with every step. But it grows slower than bias, and over long runs of small updates, that difference is everything.

A small MLP trained on a teacher-student regression task using HeavyBall’s AdamW. All configs store parameters in bf16. The experiment varies the rounding mode for optimizer state.

precision_toy_param.png

BF16 + SR (6 bytes for parameters, first moment, and second moment) matches FP32 state (10 bytes). The per-step updates are noisier, but the noise is unbiased and washes out over training. BF16 + RNE (6 bytes) plateaus orders of magnitude above. The same errors repeat, and the loss stalls.

Stochastic rounding replaces round-to-nearest inside the optimizer kernel, adding no memory and no bandwidth.

Remove the bias and six bytes match ten. Leave it, and six bytes hit a wall.


Code

Corrected 2026-03-16. Results rerun after fixing a torch.compile fusion that eliminated bf16 round-trips in the original experiment.

联系我们 contact @ memedata.com