超越牛顿-舒尔茨的极坐标因子 – 快速矩阵逆平方根

超越牛顿-舒尔茨的极坐标因子 – 快速矩阵逆平方根
Polar Factor Beyond Newton-Schulz – Fast Matrix Inverse Square Root

原始链接: https://jiha-kim.github.io/posts/polar-factor-beyond-newton-schulz-fast-matrix-inverse-square-root/

## Muon 优化器：机器学习的快速极因子计算 Muon 优化器在机器学习中表现出色，能够高效地近似计算矩阵的极因子——一项关键运算，类似于 signSGD 或 Lion。它的目标是计算 **polar(G) = G(GᵀG)⁻¹/²**，适用于高矩阵 G，重点在于速度、数值稳定性（尤其是在 bf16 中）以及在线精度验证。 Muon 通过避免直接 SVD 计算来实现这一点，而是使用仅包含矩形矩阵乘法（GEMM）和较小正方矩阵运算的迭代方式来细化近似值。其核心思想是计算 Gram 矩阵 (GᵀG) 的逆平方根，然后乘以 G。主要特性包括：一种**Gram 侧逆平方根**方法，利用** minimax 多项式**进行高效迭代，以及基于 Gram 残差的**在线证书**来验证结果的准确性。Jacobi 缩放用于改善谱条件，且不引入偏差。通过对称化、岭回归和重启块来增强稳定性，这些技术借鉴自 Polar Express。预计算的多项式系数允许根据当前残差进行快速在线选择，在积极迭代和受控收敛之间取得平衡，这对于低精度算术尤其重要。这种方法提供了一种快速、稳定且可认证的近似，适用于大规模机器学习应用。

黑客新闻新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录超越牛顿-舒尔兹的极因子 – 快速矩阵逆平方根 (jiha-kim.github.io) 4 点由 ibobev 1 小时前 | 隐藏 | 过去 | 收藏 | 讨论帮助指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索：

The Muon optimizer has found huge empirical success in machine learning. It’s essentially signSGD (or Lion by including momentum) for matrices. For the update, we need to approximate the sign function on the singular values of the momentum matrix to compute the polar factor.

Goal: Given $G\in \mathbb{R}^{m \times n}$ tall ( $m \ge n$ ), compute the (column-)orthonormal polar factor
\[ \mathrm{polar}(G):=G(G^\top G)^{-1/2} \]

For the compact SVD $G=U\Sigma V^\top$ , $\mathrm{polar}(G)=UV^\top$ . This is the “directional” component in the polar decomposition $G=\mathrm{polar}(G) \vert G\vert$ , similar to the polar coordinates of a complex number $z=e^{i\theta}\cdot r$ :

\[ \vert G\vert := \sqrt{ G^\top G } \quad \text{("stretch" part: modulus of matrix)} \]

\[ \mathrm{polar}(G)=G\vert G\vert ^{-1} \quad\text{("direction" part: unitary polar factor)}) \]

In Muon, we typically do not need high accuracy, but we do want:

a fast GPU path (mostly GEMMs),
numerical stability in bf16,
a way to certify that $\sigma_i(U)$ are close to $$1$$ .

Newton-Schulz/Polar Express iterations: normalize singular values to unit interval $$[0,1]$$ then directly compute with rectangular GEMMs.

Potential opportunity for $m \gg n$ : compute $(G^\top G)^{-1/2}$ on the small side and multiply once, can refine with full polar steps. This gives some nicer theoretical properties to try, e.g. (precomputed) online coefficient scheduling compared to Polar Express offline coefficients.

超越牛顿-舒尔茨的极坐标因子 – 快速矩阵逆平方根 Polar Factor Beyond Newton-Schulz – Fast Matrix Inverse Square Root

Goal

What we can certify online (stronger than rectangular direct iterations)

Why we do NOT use AOL here (replace with unbiased Jacobi on \(B\))

Core iteration: minimax-polynomial inverse square root for SPD matrices

Template (“drive the Gram to \(I\)”)

Why minimax (Polar Express port)

Online coefficients: dense offline grid + online selection (recommended)

Offline

Online selection

Two-phase scheme (safe globalization, aggressive local polish)

Phase 0: Form \(B\) and apply unbiased preconditioning

Phase 1: Safe scaling to \((0,1]\) and global minimax steps

Phase 2: Local symmetric-around-1 steps (aggressive but certified)

Finish: map back to \(B^{-1/2}\) and form \(\widetilde U\)

Certification and optional polish

Restarts (important for bf16)

What “dense coefficients” buys you

超越牛顿-舒尔茨的极坐标因子 – 快速矩阵逆平方根
Polar Factor Beyond Newton-Schulz – Fast Matrix Inverse Square Root