（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=41470074

本文讨论了大型语言模型 (LLM) 中的内存限制问题，该问题是由于 CPU 速度改进和内存带宽改进之间的差距越来越大而导致的。为了解决这个问题，人们正在考虑内存计算（CIM）或内存处理（PIM）等新技术。这些技术涉及直接对内存中的数据执行操作，而不是先将其移至 CPU 寄存器。这种方法可以减少延迟，提高功耗，并有可能克服“内存墙”。然而，在推断性能时需要仔细考虑这些技术之间的比较，特别是对于不同半导体的 ASIC 和 FPGA。值得注意的是，门的低效率和非最佳逻辑路径是与所引用的设计相关的权衡。尽管存在这些挑战，作者还是提出了一种称为 BitGrid 的新颖架构，它利用二维单元阵列进行高度并行计算。与传统方法相比，BitGrid 具有潜在的优势，例如提高能源效率、降低复杂性以及易于实施。然而，在结合存储器和计算元件时会出现实际困难，特别是考虑到存储器和计算元件的制造工艺的差异，导致性能较差、生产成本较高以及对不断变化的需求的适应性较低。此外，作者还质疑 Mythic 等公司关于内存计算能力的说法。由于写入速度慢，闪存被排除在内存计算之外，而 Mythic 的方法涉及利用模拟电路进行模型推理。 D-Matrix 和 Cerebras 是从事真正内存计算的实体的例子，尽管成本高昂。最后，作者指出，闪存主要用作配置存储器，存储模型的恒定权重，而不是用作正在进行的计算的工作存储器。总之，虽然内存计算前景广阔，但它面临着重大的技术障碍，并且能否得到广泛采用仍然存在不确定性。

This paper is light on background so I’ll offer some additional context:

As early as the 90s it was observed that CPU speed (FLOPs) was improving faster than memory bandwidth. In 1995 William Wulf and Sally Mckee predicted this divergence would lead to a “memory wall”, where most computations would be bottlenecked by data access rather than arithmetic operations.

Over the past 20 years peak server hardware FLOPS has been scaling at 3x every 2 years, outpacing the growth of DRAM and interconnect bandwidth, which have only scaled at 1.6 and 1.4 times every 2 years, respectively.

Thus for training and inference of LLMs, the performance bottleneck is increasingly shifting toward memory bandwidth. Particularly for autoregressive Transformer decoder models, it can be the dominant bottleneck.

This is driving the need for new tech like Compute-in-memory (CIM), also known as processing-in-memory (PIM). Hardware in which operations are performed directly on the data in memory, rather than transferring data to CPU registers first. Thereby improving latency and power consumption, and possibly sidestepping the great “memory wall”.

Notably to compare ASIC and FPGA hardware across varying semiconductor process sizes, the paper uses a fitted polynomial to extrapolate to a common denominator of 16nm:

> Based on the article by Aaron Stillmaker and B.Baas titled ”Scaling equations for the accurate prediction of CMOS device performance from 180 nm to 7nm,” we extrapolated the performance and the energy efficiency on a 16nm technology to make a fair comparison

But extrapolation for CIM/PIM is not done because they claim:

> As the in-memory accelerators the performance is not based only on the process technology, the extrapolation is performed only on the FPGA and ASIC accelerators where the process technology affects significantly the performance of the systems.

Which strikes me as an odd claim at face value, but perhaps others here could offer further insight on that decision.

Links below for further reading.

https://arxiv.org/abs/2403.14123

https://en.m.wikipedia.org/wiki/In-memory_processing

http://vcl.ece.ucdavis.edu/pubs/2017.02.VLSIintegration.Tech...

While this might have been true for a while before 2018, since then 400GbE ethernet became the fastest adapted interconnect, and today 1.6Tbit interconnects exist. PCI-e V4 came and went so fast that it lived maybe 2 years.

NVMeOF has been scaling with fabric performance and it's been great.

400GB/s interconnect on the H100 DGX today.

True. Samsung's Dr. Jung Bae Lee was also talking about that recently.

"rapid growth of AI models is being constrained by a growing disparity between compute performance and memory bandwidth. While next-generation models like GPT-5 are expected to reach an unprecedented scale of 3-5 trillion parameters, the technical bottleneck of memory bandwidth is becoming a critical obstacle to realizing their full potential."

https://www.lycee.ai/blog/2024-09-04-samsung-memory-bottlene...

I'm a layman on this topic, so I'm definitely about to say something wrong. But I recall an intriguing idea about a sort of "reversion to analog," whereby we use the full range of voltage crossing a resistor. Instead of cutting it in half to produce binary (high voltage is 1, low voltage is 0), we could treat the voltage as a scalar weight within a network of resistors.

Has anyone else heard of this idea or have any insight on it?

Or even an architecture akin to an atonishingly large number of RP2050's. It does seem like it would work well for certain types of nnet architectures.

I've always been partial to the idea of two parallel surfaces with optical links, Make a connection machine style hypercube where the bit of the ID of every processor indicates its location in the hypercube. Place all of the even parity CPUs on one surface and the odd parity CPUs on the other surface, every CPU would have line of sight on its neighbour in the hypercube (as well as the diametrically opposed CPU with all the ID bits flipped)

> Or even an architecture akin to an atonishingly large number of RP2050's.

Groq and Cerebras are probably that kind of architecture

I've always been partial to systolic arrays. I iterated through a bunch of options over the past few decades, and settled upon what I think is the optimal solution, a cartesian grid of cells.

Each cell would have 4 input bits, 1 each from the neighbors, and 4 output bits, again, one to each neighbor. In the middle would be 64 bits of shift register from a long scan chain, the output of which goes to 4 16:1 multiplexers, and 4 bits of latch.

Through the magic of graph coloring, a checkerboard pattern would be used to clock all of the cells to allow data to flow in any direction without preference, and without race conditions. All of the inputs to any given cell would be stable.

This allows the flexibility of an FPGA, without the need to worry about timing issues or race conditions, glitches, etc. This also keeps all the lines short, so everything is local and fast/low power.

What it doesn't do is be efficient with gates, nor give the fastest path for logic. Every single operation happens effectively in parallel. All computation is pipelined.

I've had this idea since about 1982... I really wish someone would pick it up and run with it. I call it the BitGrid.

The GA144 chip is a 12x12 grid of RISC Stack oriented CPUs with networking hardware between them. Each can be doing something different.

The BitGrid is bit oriented, not bytes or words, which makes if flexible for doing FP8, FP16, or whatever other type of calculation is required. Since everything is clocked at the same rate/step, data flows continuously, without need for queue/dequeue, etc.

Ideally, a chip could be made that's a billion cells. The structure is simple, and uniform. An emulator exists, and that would return the exact same answer (albeit much slower) than a billion cell chip. You could divide up simulation among a network of CPUs to speed it up.

I'd love to watch a LLM run in WebGL where everything is Textures. Would be neat to visually see the difference in architectures.

Memory move is the bottleneck these days, thus the expensive HBM, Nvidia's design is also memory-optimized since it's the true bottleneck chip wise and system wise.

Could a FPGA + ASICs + in-mem hybrid architecture have any role to play in scaling/flexibility? Each one has its own benefits (e.g., FPGAs for flexibility, ASICs for performance, in-memory for energy efficiency), so could a hybrid approach integrating each to juice LLM perf even further?

normally it's FPGA + memory first, when it hits a sweet spot in the market with volume, you then turn FPGA to ASIC for performance and cost saving. For big companies they will go ASIC directly.

Claims 90% memory reduction with OSS code for replication on standard GPUs, https://github.com/ridgerchu/matmulfreellm

> ..avoid using matrix multiplication using two main techniques. The first is a method to force all the numbers within the matrices to be ternary, meaning they can take one of three values: negative one, zero, or positive one. This allows the computation to be reduced to summing numbers rather than multiplying.. Instead of multiplying every single number in one matrix with every single number in the other matrix.. the matrices are overlaid and only the most important operations are performed.. researchers were able to maintain the performance of the neural network by introducing time-based computation in the training of the model. This enables the network to have a “memory” of the important information it processes, enhancing performance.

> ... On standard GPUs.. network achieved about 10 times less memory consumption and operated about 25 percent faster.. could provide a path forward to enabling the algorithms to run at full capacity on devices with smaller memory like smartphones.. Over three weeks, the team created a [FPGA] prototype of their hardware.. surpasses human-readable throughput.. on just 13 watts of power. Using GPUs would require about 700 watts of power, meaning that the custom hardware achieved more than 50 times the efficiency of GPUs.

I'd expect it to be MAC hardware embedded on the DRAM die (or in the case of stacked HBM, possibly on the substrate die).

To quote from an old article about such acceleration which sees 19x improvements over DRAM + GPU:

   Since MAC operations consume the dominant part of most ML workload runtime,
   we propose in-subarray multiplication coupled with intra-bank accumulation.
   The multiplication operation is performed by performing AND operations and
   addition in column-based fashion while only adding less than 1% area overhead.

https://arxiv.org/pdf/2105.03736

Sure, but I don't think that makes sense here; when I run an LLM on CPU, I load to memory and run it, when I run on GPU I load the model into the GPU's memory and run it, and I don't have anything like that much money to burn but I imagine if I used an FPGA then I would load the model into its memory and then run it from there. So the fact that they're saying "in-memory" in contrast to ex. GPU makes me think that they're talking about something different here.

While this has been proposed repeatedly for many decades, I doubt that it will ever become useful.

Combining memory with computation seems good in theory, but it is difficult to do in practice.

The fabrication technologies for DRAM and for computational devices are very different. If you implement computational units on a DRAM chip, they will have a much worse performance than those implemented with a dedicated fabrication process, so for instance their performance per watt and per occupied area will be worse, leading to higher costs than for using separate memories and computational devices.

The higher cost might be acceptable in certain cases if a much higher performance is obtained. However it is unavoidable that unlike with a CPU/GPU/FPGA, where you can easily reprogram the device to implement a completely different algorithm, a device with in-memory computation would be much less flexible, so it either will implement extremely simple operations, like adding to memory or multiplying the memory, which would not increase much the performance due to communication overheads, or it would implement some more complex operations, which might implement some ML/AI algorithm that is popular for the moment, but which would be hard to use to implement better algorithms when such algorithms are discovered.

+1. Personal opinion: accelerators are useful today but have kept us in a local minimum which is certainly not the ideal. There are interesting approaches such as near linear low-rank approximation of attention gradients [1]. Would we rather have that, or somewhat better constant factors? [1] https://arxiv.org/html/2408.13233v1

I suspect that the attempts to remove the DRAM controller and embedding it into the chips directly will succeed in meaningfully reducing the power per retrieval and increase the bandwidth by big enough that it’ll postpone these more esoteric architectures even though its pretty clear that bulk data processing like LLMs (and maybe even graphics) is better suited to this architecture since it’s cheaper to fan out the code than it is to shuffle all these bits back and forth.

Am I misreading something?

> At their core, NVM arrays are arranged in two dimensions and programmed to discrete conductances (Fig. 5). Each crosspoint in the array has two terminals connected to a word line and a bit line. Digital inputs are converted to voltages, which then activate the word lines. The multiplication operation is performed between the conductance gij and the voltage Vi by applying Ohm’s law at each cell, while currents Ij accumulate along each column according to Kirchhoff’s current law

Sounds like the compute element is embedded within the DRAM but instead of doing a digital computation it's done in analog space (which feels a bit wrong since the DAC+ADC combo would eat quite a bit of power but maybe it's easier to manufacture or other reasons to do it in analog space).

Or you're saying it would be better with flash storage because it could be used for even larger models. I think that's right but my overall point holds - removal of the DRAM controller could free up significant amounts of DRAM bandwidth (like 20x IIRC) and reduction in power (by 100x IIRC). There's value in that regardless and it would just be a free speedup and would significantly benefit existing LLMs that rely on RAM. An analog compute circuit embedded within flash would be usable basically only for today's LLMs architecture and not be very flexible and require a huge change in how this stuff works to take advantage. Might still make sense if the architecture remains largely unchanged and other approaches can't be as competitive, but it does lock you into a design more than something more digitally programmable that can also do other things.

Am I misreading something?

Yes, you are. NVM stands for 'non-volatile memory", which is literally the opposite of DRAM.

Analog computation can be done using any memory cell technology (transistor, capacitor, memristor, etc), but the result will always go through ADC to be stored in a digital buffer.

Flash does not provide any advantages as far as model size, the size of crossbar is constrained by other factors (e.g. current leakage), and typically it's in the ballpark of 1kx1k matmuls. You simply put more of them on a chip, and try to parallelize as much as possible.

But I largely agree with your conclusion.

Using analog means it will be faster (digital is slow, waiting for the carry on each bit), but I am curious how they do the ADC. RAM stuff is generally so different that not introducing logic gates in the memory makes sense.

Digital is slow, but I would think converting the signal to/from digital might be slow too. Maybe it's taking the analog signal from the RAM itself & storing back the analog signal with a little bit of cleanup without ever going into the digital domain?

Oh, absolutely. Never switching to digital would be the way. And not hard for low bit counts like 4. I am very interested in the methodology if they do this with 64bit.

SRAM does not have enough capacity to be useful for in-memory computation.

The existing CPUs, GPUs and FPGAs are full of SRAM that is intimately mixed with the computational parts of the chips and you could not find any structure improving on that.

All the talk about in-memory computing is strictly about DRAM, because only DRAM could increase the amount of memory from the up to hundreds of MB of memory that is currently contained inside the biggest CPUs or GPUs to the hundreds of GB that might be needed by the biggest ML/AI applications.

All the other memory technologies mentioned in the paper linked by you are many years or even decades away from being usable as simple memory devices. In order to be used for in-memory computing, one must first solve the problem of making them work as memories. For now, it is not even clear if this simpler problem can be solved.

Let’s see: Mythic uses flash, d-Matrix uses SRAM. Encharge is the only one who uses capacitor based crossbars, but those are custom built from scratch and very different from any existing DRAM technology.

Which companies are using DRAM for in-memory computing?

Mythic does not do in-memory computing, despite their claims.

Flash cannot be used for in-memory computing, because writing it is too slow.

According to what they say, they have an inference device that uses analog computing for inference. They have a flash memory, but that stores only the weights of the model, which are constant during the computation, so the flash is not a working memory, it is used only for the reconfiguration of the device, when a new model is loaded.

Analog computing for inference is actually something that is much more promising than in-memory computing, so Mythic might be able to develop useful devices.

d-Matrix appears to do true in-memory computing, but the price of their devices for an amount of memory matching a current GPU will be astronomical.

Perhaps there will be organizations willing to pay huge amounts of money for a very high performance, like those which are buying Cerebras nowadays, but such an expensive technology will always be a niche too small to be relevant for most users.

You don't need to write anything back to flash to use it to compute something: the output of a floating gate transistor is written to some digital buffer nearby (usually SRAM). Yes, it's only used for inference, not sure how that disqualifies it from being in-memory computing? In-memory computing simply means there's a memory device/circuit (transistor, capacitor, memristor, etc) that holds a value and is used to compute another value based on some input received by the cell. As opposed to a traditional ALU which receives two inputs from a separate memory circuit (registers) to compute the output.

This is not in-memory computing, because from the point of view of the inference algorithm the flash memory is not a memory.

You can remove all the flash memory and replace all its bits with suitable connections to ground or the supply voltage, corresponding to the weights of the model.

Then the device without any flash memory will continue to function exactly like before, computing the inference algorithm without changes. Therefore it should be obvious that this is not in-memory computing, if you can remove the memory without affecting the computing.

The memory is needed only if you want to be able to change the model, by loading another set of weights.

The flash memory is a configuration memory, exactly like the configuration memories of logic devices like FPGAs or CPLDs. In FPGAs or CPLDs you do the same thing, you load the configuration memory with a new set of values, then the FPGA/CPLD will implement a new logic device, until the next reloading of the configuration memory.

Exactly like in this device, the configuration memory of the FPGAs/CPLDs, which may be a flash memory too, is not counted as a working memory. The FPGAs/CPLDs contain memories and registers, but those are distinct from the configuration memory and they cannot be implemented with flash memory, like the configuration memory.

In this inference device with analog computing there must also be a working memory, which contains mutable state, but that must be implemented with capacitors that store analog voltages.

You might talk about an in-memory computing only with reference to the analog memory with capacitors, but even this description is likely to be misleading, because from the point of view of the analog memory it is more probable that the structure of the inference device is some kind of dataflow structure, where the memory capacitors implement some kind of analog shift registers and not anything resembling memory cells in which information is stored for later retrieval.

Not in the context of discussing hardware architectures.

(Context in the abstract is "First, we present the accelerators based on FPGAs, then we present the accelerators targeting GPUs and finally accelerators ported on ASICs and In-memory architectures" and the section title in the paper body is "V. In-Memory Hardware Accelerators")

Is there a "nice" way to read content on Arxiv?

Every time I land on that site I'm so confused / lost in it's interface (or lack there of) I usually end up leaving without getting to the content.

I've always had the same issue as OP - it's never bothered me much because I'm rarely in the mood to read something so dense.

But I find it quite interesting that I've managed to completely miss the big obvious blue buttons every time, I just immediately scan down to the first paragraph.

The cynic in me guesses it's because I'm so used to extraneous content taking up space that I instinctively skim past, but maybe that's too pessimistic & there's another UX/psychological reason for it.

Its the idea of the gateway: we click on a link and expect it to be what we are looking for. Having some "summary" or paywall or anything between us and what we thought we are getting is , QUICK CLOSE THE STUPID WEBSITE as fast as possible triggering.

Same here -- I visited this link earlier today and thought "Oh, it's just an abstract, I'm out". I've read Arxiv papers before but the UI just doesn't look like it offers any content.

this explains the success of Groq's ASIC-powered LPUs. LLM inference on Groq Cloud is blazingly fast. Also, the reduction in energy consumption is nice.

In-memory sounds like the way to go not just in terms of performance, but in that it makes no sense to build an ASIC or program an FPGA for a model that will most likely be obsolete in a few months at best if you're lucky.

Eh, there's so much shenanigans these days even in fine tuning, people adding empty layers and pruning and whatnot, it's unlikely that even models based on the same one will have the same architecture.

For new foundation models it's even worse, because there's some fancy experiment every time and the llama.cpp team needs two weeks to figure out how to implement it so the model can even run.

Yeah so you might need to implement a weird activation or positional enciding in software or something but I suspect 90% will probably be the same... If it's just layer count or skipped matrices I assume it should be possible to write an orchestrator that could run most of those models... Unless we move to mamba or something

（评论） (comments)

（评论）
(comments)