![]() |
|
![]() |
| True. Samsung's Dr. Jung Bae Lee was also talking about that recently.
"rapid growth of AI models is being constrained by a growing disparity between compute performance and memory bandwidth. While next-generation models like GPT-5 are expected to reach an unprecedented scale of 3-5 trillion parameters, the technical bottleneck of memory bandwidth is becoming a critical obstacle to realizing their full potential." https://www.lycee.ai/blog/2024-09-04-samsung-memory-bottlene... |
![]() |
| > Or even an architecture akin to an atonishingly large number of RP2050's.
Groq and Cerebras are probably that kind of architecture |
![]() |
| I'd love to watch a LLM run in WebGL where everything is Textures. Would be neat to visually see the difference in architectures. |
![]() |
| Memory move is the bottleneck these days, thus the expensive HBM, Nvidia's design is also memory-optimized since it's the true bottleneck chip wise and system wise. |
![]() |
| normally it's FPGA + memory first, when it hits a sweet spot in the market with volume, you then turn FPGA to ASIC for performance and cost saving. For big companies they will go ASIC directly. |
![]() |
| Claims 90% memory reduction with OSS code for replication on standard GPUs, https://github.com/ridgerchu/matmulfreellm
> ..avoid using matrix multiplication using two main techniques. The first is a method to force all the numbers within the matrices to be ternary, meaning they can take one of three values: negative one, zero, or positive one. This allows the computation to be reduced to summing numbers rather than multiplying.. Instead of multiplying every single number in one matrix with every single number in the other matrix.. the matrices are overlaid and only the most important operations are performed.. researchers were able to maintain the performance of the neural network by introducing time-based computation in the training of the model. This enables the network to have a “memory” of the important information it processes, enhancing performance. > ... On standard GPUs.. network achieved about 10 times less memory consumption and operated about 25 percent faster.. could provide a path forward to enabling the algorithms to run at full capacity on devices with smaller memory like smartphones.. Over three weeks, the team created a [FPGA] prototype of their hardware.. surpasses human-readable throughput.. on just 13 watts of power. Using GPUs would require about 700 watts of power, meaning that the custom hardware achieved more than 50 times the efficiency of GPUs. |
![]() |
| I'd expect it to be MAC hardware embedded on the DRAM die (or in the case of stacked HBM, possibly on the substrate die).
To quote from an old article about such acceleration which sees 19x improvements over DRAM + GPU:
https://arxiv.org/pdf/2105.03736 |
![]() |
| +1. Personal opinion: accelerators are useful today but have kept us in a local minimum which is certainly not the ideal. There are interesting approaches such as near linear low-rank approximation of attention gradients [1]. Would we rather have that, or somewhat better constant factors?
[1] https://arxiv.org/html/2408.13233v1
|
![]() |
| Oh, absolutely. Never switching to digital would be the way. And not hard for low bit counts like 4. I am very interested in the methodology if they do this with 64bit. |
![]() |
| Is there a "nice" way to read content on Arxiv?
Every time I land on that site I'm so confused / lost in it's interface (or lack there of) I usually end up leaving without getting to the content. |
![]() |
| Same here -- I visited this link earlier today and thought "Oh, it's just an abstract, I'm out". I've read Arxiv papers before but the UI just doesn't look like it offers any content. |
![]() |
| this explains the success of Groq's ASIC-powered LPUs. LLM inference on Groq Cloud is blazingly fast. Also, the reduction in energy consumption is nice. |
As early as the 90s it was observed that CPU speed (FLOPs) was improving faster than memory bandwidth. In 1995 William Wulf and Sally Mckee predicted this divergence would lead to a “memory wall”, where most computations would be bottlenecked by data access rather than arithmetic operations.
Over the past 20 years peak server hardware FLOPS has been scaling at 3x every 2 years, outpacing the growth of DRAM and interconnect bandwidth, which have only scaled at 1.6 and 1.4 times every 2 years, respectively.
Thus for training and inference of LLMs, the performance bottleneck is increasingly shifting toward memory bandwidth. Particularly for autoregressive Transformer decoder models, it can be the dominant bottleneck.
This is driving the need for new tech like Compute-in-memory (CIM), also known as processing-in-memory (PIM). Hardware in which operations are performed directly on the data in memory, rather than transferring data to CPU registers first. Thereby improving latency and power consumption, and possibly sidestepping the great “memory wall”.
Notably to compare ASIC and FPGA hardware across varying semiconductor process sizes, the paper uses a fitted polynomial to extrapolate to a common denominator of 16nm:
> Based on the article by Aaron Stillmaker and B.Baas titled ”Scaling equations for the accurate prediction of CMOS device performance from 180 nm to 7nm,” we extrapolated the performance and the energy efficiency on a 16nm technology to make a fair comparison
But extrapolation for CIM/PIM is not done because they claim:
> As the in-memory accelerators the performance is not based only on the process technology, the extrapolation is performed only on the FPGA and ASIC accelerators where the process technology affects significantly the performance of the systems.
Which strikes me as an odd claim at face value, but perhaps others here could offer further insight on that decision.
Links below for further reading.
https://arxiv.org/abs/2403.14123
https://en.m.wikipedia.org/wiki/In-memory_processing
http://vcl.ece.ucdavis.edu/pubs/2017.02.VLSIintegration.Tech...