（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=40466826

本文提出将自然梯度下降（NGD）（一种二阶优化方法）用于机器学习任务。与传统的梯度下降不同，NGD 需要线性系统的求解，使其更加复杂。为了应对这一挑战，作者建议与常规 GPU 并行实施 NGD，利用 GPU 计算梯度和粗麻布矩阵，同时将线性系统计算卸载到热力学计算机。尽管目前还不存在热力学计算机的实际实现，但作者的目标是基于这个概念创建一个混合数字模拟训练循环，可能为更大的优化问题提供显着的优势。此外，作者还讨论了将该框架扩展到深度学习之外的各种优化场景的可能性。然而，他们尚未公开发布有关用于在神经网络之外访问此功能的 API 或软件堆栈的信息。

The main point of this is that natural gradient descent is a second-order method. The main GD update equation is:

∇̃L(θ) = F⁻¹∇L(θ)

which requires solving a linear system. For this, you can use the methods from the author's previous paper [Thermodynamic Linear Algebra](https://arxiv.org/abs/2308.05660).

Since it's hard to implement a full neural network on a thermodynamic computer, the paper suggests running one in parallel to a normal GPU. The GPU computes F and ∇L(θ), but offloads the linear system to the thermo computer, which runs in parallel to the digital system (Figure 1).

It is important to note that the "Runtime vs Accuracy" plot in Figure 3 uses a "timing model" for the TNGD algorithm, since the computer necessary to run the algorithm still doesn't exist.

Cool and interesting. The authors propose a hybrid digital-analog training loop that takes into account the curvature of the loss landscape (i.e., it uses second-order derivatives), and show with numerical simulations that if their method is implemented in a hybrid digital-analog physical system, each iteration in the training loop would incur computational cost that is linear in the number of parameters. I'm all for figuring out ways to let the Laws of Thermodynamics do the work of training AI models, if doing so enables us to overcome the scaling limitations and challenges of existing digital hardware and training methods.

I know they mainly present results on deep learning/neural network training and optimization, but I wonder how easy it would be to use the same optimization framework for other classes of hard or large optimization problems. I was also curious about this when I saw posts about Extropic (https://www.extropic.ai/) stuff for the first time.

I tried looking into any public info on their website about APIs or software stack to see what's possible beyond NN stuff to model other optimization problems. It looks like that's not shared publicly yet.

There are certainly many NP-hard and large combinatorial or analytical optimization problems still out there that are worth being able to tackle with new technology. Personally, I care about problems in EDA and semiconductor design. Adiabatic quantum computing was one technology with the promise of solving optimization problems (and quantum computing is still playing out with only small-scale solutions at the moment). Hoping that these new "thermodynamic computing" startups also might provide some cool technology to explore these problems with.

Hey thanks for the reply! I'm assuming your the first author on the paper; if so, the signup button on the Normal Computing website is not working at the moment (at least for me, even with ad blocker turned off).

Leveraging thermodynamics to more efficiently compute second-order updates is certainly cool and worth exploring, however specifically in the context of deep learning I remain skeptical of its usefulness.

We already have very efficient second-order methods running on classical hardware [1] but they are basically not being used at all in practice, as they are outperformed by ADAM and other 1st-order methods. This is because optimizing highly nonlinear loss functions, such as the ones in deep learning models, only really works with very low learning rates, regardless of whether a 1st or a 2nd order method is used. So, comparatively speaking, a 2nd order method might give you a slightly better parameter update per step but at a more-than-slightly-higher cost, so most of the time it's simply not worth doing.

[1] https://andrew.gibiansky.com/blog/machine-learning/hessian-f...

Agreed that it's very cool, and also about how hard it is to make second order methods worthwhile. We're just using such huge datasets that sometimes it's hard to even get a decent estimate of the gradient for a minibatch. Getting a useful estimate of second order information over the dataset is even harder, especially when the whole point of using minibatches is computational feasibility.

Those are valid points! Hessian-free (HF) optimization is a really nice method, but as you say remains costly so people don't use it. The key idea in this paper is that if you are able to solve linear systems faster by using an analog device, the cost of a HF-like method is brought down, so the method can become competitive.

About the noise, it is true that the second-order information will be noisier than the gradient for a given batch size (and a lot of results out there for HF optimization are with impractically large batch sizes). In the paper we use relatively small batch sizes (eg 32 for the fine-tuning example) and show that you can still get an advantage from second-order information. Of course it would be interesting to study in more detail how noisy 2nd order information can be, and on more datasets.

Not having read the paper carefully, could someone tell me what the draw is? It looks like it is going to have the same asymptotic complexity as SGD in terms of sample size, per Table 1. Given that today's large, over-specified models have numerous, comparable extrema, is there even a need for this? I wouldn't get out of bed unless it were sublinear.

I don't get it, gradient descend computation is super frequent, state/input changes all the time, you'd have to reset heat landscape very frequently, what's the point? No way there is any potential speedup opportunity there, no?

If anything you could probably do something with electromagnetic fields, their interference, possibly in 3d.

Sounds great until

> requires an analog thermodynamic computer

Wait. What?

Perhaps a trained physicist can comment on that. Thanks.

The whole point is to leverage the laws of nature to train AI models, overcoming the limitations and scaling challenges of digital hardware and existing training methods.

This could be attractive if they can build a product along those lines: tens, if not hundreds, of billions of dollars are spent yearly on numerical optimization worldwide, and if this can significantly accelerate it, it could be very profitable.

The paper describes it pretty well in appendix C. A matrix of integrators is constructed with a bunch of opamps, RC time constants (using digital potentiometers, presumably) and a multichannel ADC/DAC interface to the PC. Essentially a dedicated differential-equation solver.

So it's a combination of old-school analog computation and modern GPU-based code. Takes longer in practice due to the overhead of interfacing with the hardware and waiting for the integrators to settle, but the authors are claiming that an optimized implementation could outperform a purely-digital solution, as I understand it, by accelerating convergence.

The core idea being that conventional gradient descent is a linear operation at heart, while the gradients actually being traversed are curved surfaces that have to be approximated with multiple unnecessary steps if everything is done in the digital domain.

The trouble, as everybody from Seymour Cray onward has learned the hard way, is that CMOS always wins in the end, simply because the financial power of an entire industry goes into optimizing it.

First author of the paper here. That's it indeed! One thing is that this is entirely CMOS-compatible. You could also do something similar with optics or other platforms, but we chose electronic circuits for this reason specifically.

By that remark I meant "digital CMOS," in the sense of elements that store state information discretely in flip-flops or gate insulators rather than continuously with analog integrators.

Very cool work in any event, though! Best of luck with the ongoing R&D.

I didn't realize they included details about the hardware. Lie you said these just look like analog computers, compute in memory, analog arrays, which have also made a resurgence with deep leaning.

I believe one example would be quantum annealers. Where "programming" involves setting the right initial conditions and allowing thermodynamics to bring you to an optimum via relaxation.

Analog computers have a lot of history. You can Google analog with neural network or differential equations to get many results. They are fast with low power, can have precision issues, and require custom, chip design.

https://en.m.wikipedia.org/wiki/Analog_computer

Mixed signal ASIC’s often use a mix of digital and analog blocks to get the benefits of analog. It’s especially helpful for anything that eats lots of power or to prevent that (eg mobile).

Hard to beat the string algorithm for finding shortest paths on a positive weights network (e.g. build the network out of string where topologies match and link lengths are link weights, find the origin and destination nodes/knots of interest, grab the two nodes and pull until taut).

Or the spaghetti approach to finding the largest value from a list of positive values (e.g. cut dry spaghetti noodles to length for each value, bundle them together and tap the bundle on a table, the one visually sticking out the most is the largest valued element).

Of course, we already need to be working in Spaghetti ints or prepping them will be as complex than a linear scan.

Can't wait for spaghetti arithmetic.

Do we have a better algo than log(n) for locating the min?

I'm thinking spaghetti max align them, lay them across an arm at the midway point, sweep the shorts that fell, and repeat until all remaining are same length.

（评论） (comments)

（评论）
(comments)