展示HN:高瘦网络的黑森矩阵易于求逆
Show HN: The Hessian of tall-skinny networks is easy to invert

原始链接: https://github.com/a-rahimi/hessian

该软件包高效地计算深度网络 Hessian 矩阵的逆与向量的乘积,这是一项计算量巨大的任务。直接求 Hessian 矩阵的逆会随着网络规模的增大而性能下降,使其不切实际。 这项工作利用巧妙的增强和枢轴技术,将方程组转换为可解的块三对角形式。求解该增强后的系统有效地模拟了通过修改后的网络的正向传播。 核心算法详见链接论文,并在 `hessian_inverse_product` 中实现,它依赖于操作分层结构的块矩阵,并由 `block_partitioned_matrices.py` 库提供支持。最终目标是将其用作预条件器来加速随机梯度下降,并基于 Pearlmutter 关于 Hessian-向量乘积的工作。演示可在 `demo_hessian.ipynb` 中找到。

## 深度网络Hessian反演:摘要 一种高效反演深度神经网络Hessian矩阵的新方法已被开发,可能为传统优化技术提供更快的替代方案。虽然朴素地反演Hessian矩阵计算成本高昂(随网络层数呈立方级增长),但该方法通过利用Hessian矩阵内的矩阵多项式结构,实现了线性缩放。 核心思想让人联想到Pearlmutter的早期工作,涉及应用一种“Hessian逆乘积”算法,该算法类似于在对偶网络上的反向传播。研究人员认为这可以作为随机梯度下降(SGD)的预条件器,从而可能加速训练。 讨论的重点在于将其用作预条件器,还是直接应用于牛顿法(考虑批次大小与完整数据集计算)。对话还涉及相关技术,如随机牛顿法、无雅可比牛顿-克里洛夫方法,以及使用雅可比矩阵进行优化的好处。为了帮助那些对数学概念不太熟悉的人理解,分享了关于Hessian矩阵、梯度和雅可比矩阵的资源和解释。
相关文章

原文

This package shows how to multiply the inverse of the Hessian of a deep network with a vector. If the Hessian-vector product is $H v$ for some fixed vector $v$, we're interested in solving $H x = v$ for $x$. The hope is to soon use this as a preconditioner to speed up stochastic gradient descent.

Pearlmutter showed a clever way to compute the Hessian-vector-product for a deep net. By contrast, the paper and code in this repo shows how to compute the Hessian-inverse-product, the product of the inverse of the Hessian of a deep net with a vector. Solving this system naively requires a number of operations that scales cubically with the number of parameters in the deep net, which is impractical for most modern networks. The trick is to augment the system of equations $Hx=v$ with auxiliary variables, pivot the system into a block-tri-diagonal system, factor that system, and then solve it. These steps, in effect, end up looking like running propagation on a network that is an augmented version of the original network.

The full idea is described in this paper. For a demo, see demo_hessian.ipynb. For a look at how the algortihm is implemented, see the hessian_inverse_product function.

The algorithm relies heavily on operations hierarchically nested, structured, block matrices. For example, it makes use of partitioned matrices whose blocks are block-diagonal, and tri-diagonal matrices whose blocks are partitiond matrices. The code includes a library to manipulate such matrices in block_partitioned_matrices.py. See its tutorial.

联系我们 contact @ memedata.com