（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=40371612

大约二十年前，一名程序员研究了休斯雷达信号处理器，这是一种先进的流水线机器，以其在雷达处理方面的优越性而闻名。为了计算 0 到 1 之间 x 的 e^x，他们利用预先计算的表并将它们相乘。此方法比之前的例程快五倍。虽然现在已经过时，但该系统处理雷达信号的速度比竞争对手更快，尽管纸面上的速度较慢。在本科学习期间，程序员遇到了霍夫曼的控制论，其中手动计算涉及多次乘法，而无需使用计算器。他们发现与现代方法相比，这种经历令人沮丧。最近，一个名为“韦尔奇实验室”的 YouTube 频道讨论了对数函数的历史意义。尽管对于大语言模型 (LLM) 推理速度的微小优化存在分歧，但 softmax 操作的高内存使用率表明有显着的改进。然而，矩阵乘法消耗了绝大多数浮点运算，使得 softmax 调整总体上相对不显着。然而，值得注意的是，内存带宽密集型计算所花费的时间比基于简单浮点计数比较的预期要长。现代技术，例如 Flash Attention 和 Silu 改进，为 LLM 性能提供了有意义的增强。对于 IREe 和 GGML 等替代技术，它们对运行时性能的影响差异很大。虽然 TensorFlow Lite 在本地推理场景中可能会落后，但它在移动应用程序和 Web 应用程序中表现出色。此外，ONNX 仍然是跨各种平台的模型兼容性的行业标准。

About 20 years ago, I was programming for the Hughes radar signal processor, a highly parallel pipelined machine which accounted for much of Hughes success in radar processing. Anyway, I needed to compute e^x for 0 < x < 1. The processor had a multiply, so I used four 256 long tables of e^x for each possible 8-bit values in the 4 blocks in the 32-bit word, multiplied them to get the final value. It was about 5 times as fast as the previous best e^x routine. That machine was fun! It is obsolete now, but for many years is could process radar signals faster that processors that were nominally many times faster.

In case anyone else wasn't quite following this, I think the idea is

e^x

= e^(a+b+c+d) (where abcd are the bytes of x)

= e^a * e^b * e^d * e*d

and then you make a lookup table for each of those e^a, e^b values.

(I fudged "a" here, where it's really more like "high byte << 24", but I think you just make your lookup table for e^a be a map from a => e^(a<<24), and similar for the other bytes.)

When I took Huffman's Cybernetics course as an undergrad, his tests were full of multiplication and you couldn't use a calculator, so he provided a table of exponents and logarithms, and you'd look up in one table, sum the values, then look up the result in another table. I was not a fan.

A good YouTube channel “Welch labs” recently did a good history of the logarithm and how it revolutionized the speed at which people could do math

Slide rules were a great teaching aide, because they required you to (1) keep track of powers of 10 for yourself, and (2) understand the exponential difficulty of seeking more digits in a result.

The first point lets you guess the order of magnitude of a result before doing the detailed calculation, and that is a great way to find errors.

The second one reveals why teachers get so annoyed when 9-digit answers are given to problems with 2-digit data.

I do wish some company would start producing cheap plastic slide rules again, so my students could try them out. The online ones are not as effective as actually seeing something in the real world, squinting your eyes to guess where the cursor lies between two lines on the rule. (Hm, I can get more digits on the left end of the rule than I can on the right ...)

ThinkGeek sold a cheap plastic slide rule once. Its motion was too sticky for me to, well, stick with learning to use it. I agree there's a good product idea here if you can do it better.

How much do these silu and softmax improvements affect the LLM inference speed as a whole? Correct me if I'm wrong but I feel that this change will only have a small effect as the majority of the time is spent doing matrix multiplications.

Overwhelming majority of flops is indeed spent on matmuls, but softmax disproportionately uses memory bandwidth, so it generally takes much longer than you'd expect from just looking at flops.

In transformers the attention matrix is N*N, so there are a lot of values to go over. Typically makes it memory bandwidth bound, not compute bound.

Oooooh, I forgot that the self attention layer has a softmax. I thought this was referring to a softmax on the dense forward layer. Thanks!

Next question: does the softmax in the SA block cause it to be bandwidth bound—won’t it have to materialize all the parameters of the N^2 matrix either way? Does SM cause redundant data reads?

Yes but as far as I understand this is only really usefully possible with FlashAttention. (The main idea is that you have to use the log-sum-exp trick when computing the softmax, but can't compute the max activation incrementally so have to rescale everything.)

off topic, but sheesh. I was skimming this and thought, "This seems like a crazy optimization. It's complex and the code in question has had a ton of eyes on it already." and then I saw the contributor and was like "Of course it's jart. It's always jart with the crazy good solutions."

well done :)

Its not something that they teach at asymptotic analysis lectures, right? which reminds me of professor who famously said - that constant that everyone's just overlooking can in engineering terms very much eat your head.

Unrelated but I see this sentiment a lot. Algorithms (and related courses) are supposed to be largely mathematical/theory based so it sort of has to be hardware agnostic.

Imo an algorithms course doesn't really need to teach you about lowering the constant factor (unless the optimizations for lowering said factor are reasonably abstracted away from physical HW), thats the purpose of Computer Architecture or Operating Systems course.

And realistically in the real world you'd be benchmarking these things anyway if a supposedly "optimal" algorithm is really the bottleneck, and you'd apply context specific optimizations to speed it up.

When I did tutoring for an algorithms class, we’d generally make gestures at the fact that the big-O cost doesn’t tell you everything. But I can’t think of any way to handle that on a systemic level. Just plot n vs t and observe they the big-O prediction comes true for large n, but it gets complicated for small.

After all, nobody would have invented big-O if it wasn’t much easier than the alternative!

ACM papers on sorting algorithms tend to plot run time versus n to show that their novel optimization to sorting pulled ahead of more general algorithms at certain ranges of n. In the early 00's I saw a handful of papers from different teams who came up with algorithms that had much flatter time curves than mergesort for n > 10000.

Mostly looks scary* because that's just how it is with intrinsics syntax in C and C++. As with many things there, this pain is mostly self-inflicted.

There are C++ libraries that allow for C#-style SIMD and hardware instrinsics syntax as far as I'm aware. It comes at a disadvantage as you can't directly lookup mnemonics in ISAs documentation though.

*not to seem as if dismissing the importance of the work done there, just highlighting that it could have been much more accessible to wider audience, but I'm not gonna suggest something that everyone here would consider preposterous such as rewriting inference back-end in C# just yet

> replaces short[65536] look up table

Is that not quite dim to begin with (having a LUT the size of the whole L1 cache?) or does it work surprisingly well because of some probabilistic fudging?

The lookup table does surprisingly well because the workload is otherwise extremely cache-hostile, and it doesn't really matter if you blow up your L1 cache, none of the data you evicted because you needed to fit the LUT was ever going to be reused anyway.

ML loads in general are streaming loads that linearly load the entire dataset for every iteration.

One interesting option for these big memory scans on x86 and ARM CPUs is using the non-temporal load/store instructions. Those actually bypass caching* and may help with the cache pressure of LLM workloads that just do scans. The lookup table is still probably the wrong solution even with this sort of thing.

* Not quite all of it - There are still buffers to do write combining and some read caching on scans.

One case that doesn't mention explicitly: sometimes your problem is lucky enough that you have the choice between:

1. for each function, for each datum, call function with datum

2. for each datum, for each function, call function with datum

For 1, if your main data is fetched predictably but does not saturate bandwidth, it can be beneficial to use even a fairly large LUT. Note also that while actual L1i is semi-ignorable, e.g. branch prediction benefits much more from remaining "in cache".

For 2, assuming other functions also have their own miscellaneous data, the non-LUT approach is likely to win. But you should probably benchmark against #1.

But sometimes you can get the best of both worlds:

3. for each arbitrary chunk of data (smaller than L3, or maybe even L2 depending on when your throughput saturates), for each function, for each datum in the chunk, call function with datum

Yes, you should of course do whole-system benchmarks, but being able to guess helps you write a useful implementation to benchmark in the first place.

In the two cases you have specified here, #2 is almost always the winner on performance. So much so that in performance-sensitive code, many people will (justifiably) default to it without a benchmark. Computers almost always operate in a memory bandwidth bound state, and have comparatively idle cores, and #1 is likely to just be wasteful of the resource that will almost always be the binding constraint.

Examples are ECS systems in games, and async run-to-completion runtimes on servers. HPC systems also tend to operate this way.

Also, in the interest of disclosure, I wrote the blog post you are responding to.

For some parsing, serialization, or filtering tasks, you end up with quite large lookup tables of shuffles that would otherwise be pretty costly to compute on the fly, but real workloads hit only a tiny part of the lookup table at a time. (This comment fits entirely within his point about benchmarks, I guess)

I get the impression most llama.cpp users are interested in running models on GPU. AFAICT this optimization is CPU-only. Don't get me wrong – a huge one! – and opens the door to running llama.cpp on more and more edge devices.

With expf() most of the work had already been done and I could kill two birds with one stone. If you want to do the math for doing SiLU directly, that'd be an awesome change I'd happily merge into llamafile. You might even be able to get that into PyTorch and even bigger name projects.

Perhaps somewhat off topic, does anyone know how stuff like ggml compares to runtimes (tensorflow lite, onnxruntime, etc.)?

I'm intimately familiar, maintain ONNX and llama.cpp Flutter libraries across all 6 True Platforms.

Quick opinionated TL;DR:

- llama.cpp for LLMs and can do whisper with it's core dependency, GGML.

- ONNX for everything else.

- TF is the Apple of ML, it's great if you're completely wedded to Google ML ecosystem. Virtually dead outside that. (Something absurd ,like, 94%, of HF models are Pytorch)

- only chance I'd have to do a direct comparison in inference performance is Whisper in ONNX vs. GGML, someone got my llama .cpp lib running with Whisper and didn't report significant perf difference

I don’t think that’s a terribly fair description of Google - AWS’s chips (inferon and trainium) both have robust XLA/JAX support. Plus JAX now exports MLIR, so there is a really compelling JAX -> IREE pipeline so JAX models can more or less be deployed anywhere, even on bare metal embedded devices.

You're right, if you need to go from data => model running in web app, I'd do TF - the inartful Apple analogy is meant to indicate that: great vertical integration.

For local inference of an existing model, TF Lite pales in comparison to ONNX. ONNX goes out of its way for you get ~anything running ~anywhere on the best accelerator available on the platform.* AFAIK TF Lite only helps if your model was in TF.

And there simply isn't an LLM scene for TensorFlow, so it "loses" to llama.cpp for that. There isn't an ONNX LLM scene either, though. (see below)

* There's one key exception...until recently...LLMs! ONNX's model format was limited due to protobuf, IIRC it was 2-4 GB. Part of the Phi-3 announcement was this library they've been stubbing out that's on top of ONNX, but more specialized for LLMs. That being said, haven't seen any LLMs in it except Phi-3, and it's an absolute mess, the library was announced weeks ahead of when it was planned to be released, and then throw in the standard 6-week slippage, I'm probably not trying it again until June.

I didn’t mention tensorflow or TF Lite? IREE is a different technology, it is essentially a compiler/VM for MLIR. Since JAX exports MLIR, you can run JAX on any platform supported by IREE (which is basically any platform from embedded hardware to datacenter GPUs). It is less mature than ONNX at this point, but much more promising when it comes to deploying on edge devices.

> I didn’t mention tensorflow or TF Lite

I know.

OP didn't mention JAX, or IREE.

I avoided having a knee-jerk "Incorrect, the thing you are talking about is off-topic" reaction because it's inimical to friendly conversation.

Mentioning ONNX and llama.cpp made it clear they were talking about inference. In that context, TF Lite isn't helpful unless you have TF. JAX is irrelevant. IREE is expressly not there yet[1] and has nothing to do with Google.

[1] c.f. the GitHub "IREE is still in its early phase. We have settled down on the overarching infrastructure". From a llama.cpp/ONNX/local inference perspective, there's little more than "well, we'll get MLIR, then we can make N binaries for N model x arch x platform options." Doesn't sound so helpful to me, but I got it easy right now, models are treated as data instead of code in both ONNX and llama.cpp. I'm not certain that's the right thing long-term, ex. it incentivizes "kitchen sink" ML frameworks, people always want me to add a model to the library, rather than use my library as a dependency.

Speaking as someone who deploys ML models on edge devices - having the MLIR and being able to make binaries for different platforms is terrifically useful! You’re often supporting a range of devices (e.g. video game consoles, phones, embedded, etc) and want to tune the compilation to maximize the performance for your deployment target. And there are a lot of algorithms/models in robotics and animation that simply won’t work with ONNX as they involve computing gradients (which torch cannot export, and the gradients ONNXRuntime give you only work in training sessions, which aren’t as widely). supported.

Also, if you have JAX you can get TF and therefore TFLite. And IREE is part of the OpenXLA project, which Google started.

No time* to try it unfortunately :( Sounds great, though, and Mac just kicks the pants of every other platform on local inference thanks to Metal, I imagine MLX must extend that lead to the point Qualcomm/Google has to a serious investment in open source acceleration. Cheapest iPhone from 2022 kicks the most expensive Android from 2023 (Pixel Fold) around the block, 2x on inference. 12 tkns/s vs. 6/s.

* it sounded like a great idea to do an OpenAI LLM x search app. Then it sounded like a great idea to add embeddings locally for privacy (thus, ONNX). Then it sounded like a great idea to do a local LLM (thus, llama.cpp). Then it sounded like a great idea to differentiate by being on all platforms, supported equally. Really taxing. Think I went too far this time. It works but, jeez, the workload...hopefully, after release, it turns out maintenance load is relatively low

At this point, is gguf/llama.cpp a more performant solution for unbatched inference on CUDA devices, or is exllamav2+flashattention still reigning supreme?

Great work. But what's their goal? Are they trying to make that GeLU approximation go faster? Things would probably go a lot faster going back to the erff().

Yes, but a direct `exp` implementation is only like 10-20 FMAs depending on how much accuracy you want. No gathering or permuting will really compete with straight math.

Yes, I have an AVX512 double precision exp implementation that does this thanks to iperm2pd. This approach was also recommended by the Intel optimization manual -- a great resource.

I just went with straight math for single-precision, though.

I'm gpu poor so I use cpu to run large models. For example, llamafile running mixtral 8x22b on my threadripper can chew through long legal documents and give me advice in a few minutes, using f16 weights and kahan summation. Show me someone who has that many graphics cards.

I'm a hobbyist developer. I haven't been employed in six years. In the past few months I've been fortunate enough to find work as a TVC for Mozilla, which has been wonderful. It's a good lifestyle that affords me a great deal of freedom and autonomy, but certainly not a lot of GPUs. I have an RTX 4090, a few XTX cards, and I rent stuff from Vast occasionally, but that's about it.

Models get better for equal parameters, CPUs get faster and good folks at intel, amd and arm are probably working very hard to catch up with apple silicon in terms of memory architecture. I can see this be very relevant in a couple of years.

（评论） (comments)

（评论）
(comments)