![]() |
|
A good YouTube channel “Welch labs” recently did a good history of the logarithm and how it revolutionized the speed at which people could do math
|
![]() |
|
ThinkGeek sold a cheap plastic slide rule once. Its motion was too sticky for me to, well, stick with learning to use it. I agree there's a good product idea here if you can do it better.
|
![]() |
|
Overwhelming majority of flops is indeed spent on matmuls, but softmax disproportionately uses memory bandwidth, so it generally takes much longer than you'd expect from just looking at flops.
|
![]() |
|
In transformers the attention matrix is N*N, so there are a lot of values to go over. Typically makes it memory bandwidth bound, not compute bound.
|
![]() |
|
> replaces short[65536] look up table Is that not quite dim to begin with (having a LUT the size of the whole L1 cache?) or does it work surprisingly well because of some probabilistic fudging? |
![]() |
|
Perhaps somewhat off topic, does anyone know how stuff like ggml compares to runtimes (tensorflow lite, onnxruntime, etc.)?
|
![]() |
|
At this point, is gguf/llama.cpp a more performant solution for unbatched inference on CUDA devices, or is exllamav2+flashattention still reigning supreme?
|
![]() |
|
Great work. But what's their goal? Are they trying to make that GeLU approximation go faster? Things would probably go a lot faster going back to the erff().
|
![]() |
|
Yes, but a direct `exp` implementation is only like 10-20 FMAs depending on how much accuracy you want. No gathering or permuting will really compete with straight math.
|