![]() |
|
![]() |
| this is a great way to put it, that said, it was not obvious to me that the attention space (how it is structured in LLMs) is a frequency domain |
![]() |
| Mostly because of computational efficiency irrc, the non linearity doesn’t seem to have much impact, so picking one that’s fast is a more efficient use of limited computational resources. |
![]() |
| > In other words, work in the domain that is natural to your data.
Why would multiplication be more "natural" to a domain than convolution, as opposed to just simpler to calculate? |
![]() |
| One way to think about things is in terms of diagonalization. A generic linear operator is a fairly complicated thing that mixes information from different dimensions. When you diagonalize an operator, it's the same operator, but you're choosing a coordinate system where it becomes clear that all it was really doing was stretching along each coordinate axis, so you've broken it down into something that acts independently in a simple way on each dimension. The Fourier transform is unitary, so the intuition is that you're basically doing something like a rigid transformation of space (e.g. a rotation), and when you look from the correct angle, you see that your original operator wasn't some complicated "add a stretched version of this dimension to this other dimension minus a stretched version of a third dimension", but just "stretch this by this, this by this, etc."
On the other hand, convolution itself is already "just" multiplication. e.g. multiplying polynomials is convolution of their coefficients (to get the x^n coefficient, you need to add up all the combinations of a_i a_j x^i x^j where i+j=n), and this point of view also applies to e.g. linear time-invariant systems[0] by thinking of your function as the weights of an infinite weighted sum (so sort of an infinite polynomial) of time-shift operators (and this point of view works for other groups, not just time shifts). So f(t) is then the "coefficient" for the t-shift, and multiplying two such weighted sums again has you convolve the coefficients (so your original functions). The jargon way to say this is that your space of G-invariant functions is secretly the free algebra generated by G (G being a group). From that point of view, convolution is the "natural" multiplication on G-invariant functions. One can then ask whether there's a Fourier transform for other groups, which leads to abstract harmonic analysis. e.g. the Mellin transform is the Fourier transform for scale/stretch invariant functions as opposed to shift invariant. [0] The typical systems that one studies in signal processing contexts where convolution and Fourier transforms are your bread and butter: https://en.wikipedia.org/wiki/Linear_time-invariant_system |
![]() |
| > just simpler to calculate
That's all that "natural" means in this context. It's like "elegant" -- if it just takes less effort to get the result u need then why wouldn't you take the easier route? |
![]() |
| Unfortunately that’s very vague because there’s many notions of “dual space” within applied mathematics, even just considering the parts relevant to engineering applications. |
![]() |
| GPU saw a 10% improvement over the TPU
>The TPU is so inefficient at FTs that the researchers did not use the FFT algorithm on sequences < 4096 elements, instead opting for a quadratic-scaling FT implementation using a pre-computed DFT matrix. > on an Nvidia Quadro P6000 GPU, the FT was responsible for up to 30% of the inference time on the FNet architecture [0] This company [0] claimed in 2021 they could squash inference time by 40% if google would use their light chips on TPU. Perhaps more if FFTNet does more heavy lifting. [0]: https://scribe.rip/optalysys/attention-fourier-transforms-a-... |
![]() |
| I would guess that the FFT scales better as you increase the number of tokens in the context window. Interesting Google's models outperform their competitors on context size. |
![]() |
| > faster than FFT
Not only that, but FFT support on TPU has always been best effort. Last I tried this, there were serious precision issues. |
![]() |
| FFT is only O(N log N) for a vector of length N WRT to matrices for an N by N matrix it would be like O(N^2 log N) you would perform FFT for each row or column |
![]() |
| The Fourier transform is taken along the “token” dimension. However, in many applications, this dimension is not meaningful. That’s why transformers are a great option for consuming data which is permutation invariant.
I would like to see additional experiments using the lesser known Fourier transform over finite groups [1], which is permutation invariant but shares many properties with the standard Fourier transform. I also wonder if this becomes the next big thing for LLMs, how easy will it be for inference engines(eg vLLM, llama.cpp) to integrate it? [1] https://en.wikipedia.org/wiki/Fourier_transform_on_finite_gr... |
![]() |
| You mean for the group operation to be standard modular addition? In that case (as a sibling comment says) you'll recover the classic discrete Fourier transform. |
![]() |
| Not an expert in this space.
Aren't tokens transformed with position dependent information in most models? I believe llama applies a rotation to the vector based on the position in the input. |
![]() |
| They do show their model as winning every category in Long Range Arena (LRA) benchmark. Hopefully they have not excluded losing categories or better models. |
![]() |
| Having a decent intuitive notion of Fourier transforms is an incredibly useful tool in your toolbox, even if you can't derive a Fourier transform by hand or write a fast Fourier transform (FFT) algorithm.
The basic idea is this: (almost) any (useful) signal can be represented as a sum of sine waves with different frequencies and phases. For example, an electrical signal or a sound wave is a one-dimensional signal where the x-axis is time. This might look like a really complex squiggly line that's hard to work with. Using a Fourier transform, you can separate the individual frequencies of that time-based signal. Then, you can modify the specific frequencies however you want. For example, if you have a lot of random, spiky noise in the signal, that will show up as high frequencies. To clean it up, just do a Fourier transform, throw out any data with a frequency above a certain threshold, and then run an inverse Fourier transform on the remaining data to get back a smoother version of the original signal. This is called a low-pass filter, and it's more or less equivalent to taking a moving average of the original signal. Where it gets really fun is that you can extend this, in a pretty straightforward way, to higher dimensions. A two-dimensional signal, where both the x- and y-axes are space, is just an image. JPEG compression is based on this concept: it removes the high-frequency signal in the image in order to store the data in a more compact form, at the expense of losing some fine detail (or creating those ring-like artifacts, if you throw out too much data). Add a third dimension for time, and now you have video. And so on. The nice thing about all this is that it's very visual, so you can get a good intuition for it without having to know all the math inside and out. Here's a good page with lots of visualizations and interactive examples: https://www.jezzamon.com/fourier/index.html And this 3Blue1Brown video does a good job of explaining it: https://youtu.be/spUNpyF58BY?si=dz0z-s8NftW3Htun |
![]() |
| How does this compare with Mamba? Intuitively it feels like this should have similar performance to Mamba, but the math is a lot simpler and easier to understand. |
![]() |
| Working in the Fourier domain has been a staple of scientific and engineering applications. Learning those interactions rather than just hardcoding them has been fairly widely explored as well - the term to look for is Fourier Neural Operators [1][2]. It turns out you can prove universality even if the nonlinearity remains in the real domain [3].
The concept is fairly mainstream nowadays, to the degree that Jensen talked about it in his GTC keynote in 2021 [4] and there’s even a mainstage TED talk about its applications [5]. A nice property of doing things this way is that your model ends up being resolution-invariant which is particularly interesting for engineering domains. Scaling these methods has sparked the "let’s do a fully deep-learning-based weather model"-race [6][7]. As for using this on text data: my intuition would be that is going to not work as well because of a fairly unique property of text: for image, video and scientific data each individual element is of approximately equal importance, whereas in text you can have discrete tokens like a "not" somewhere in there that change the meaning of everything around it fairly significantly and you’d want that all to all interaction to capture that. Any kind of mixing that smoothes things out is going to inherently be at a disadvantage - probably true to some degree for most of those efficiency saving methods and why we’re seeing more limited adoption on text. [1] https://arxiv.org/abs/2010.08895 [2] https://www.nature.com/articles/s42254-024-00712-5 [3] https://jmlr.org/papers/v22/21-0806.html [4] https://www.youtube.com/watch?v=jhDiaUL_RaM&t=2472s [5] https://www.ted.com/talks/anima_anandkumar_ai_that_connects_... [6] https://arxiv.org/abs/2202.11214 (Feb 2022) [7] https://www.wired.com/story/ai-hurricane-predictions-are-sto... |
![]() |
| Complex numbers work just fine on a GPU. You just represent the data as a real-part matrix and an imaginary-part matrix and do the (ac-bd)+(ad+bc)i stuff in matrix land instead of on complex scalars. |
![]() |
| I am very out of my depth here haha with the math, appreciate those below taking the time to baby walk me along with the understanding! Great links too! |
![]() |
| In image processing I thought there was a whole host of specialized algorithms, such as edge detection, SCC, etc. that were run before the data was even fed into the ANN. |
![]() |
| Yes, in Mamba accuracy seems to goes down and has trouble in exact token recall. But, I would say it might be good for very power efficient edge deployment, and ultra long contexts. |
![]() |
| TL;DR:
1. Take FNet (https://arxiv.org/abs/2105.03824). 2. Replace the fixed (frequency-domain) convolution filter with one that is dynamically computed from the data. 3. Apply non-linear functions to both real and imaginary components, before mapping the convolved data back to the time domain. |
![]() |
| Our bodies have parts that do processing in frequency domain. Who knows, maybe it'll turn out we are thinking in the frequency domain, too. |
![]() |
| Yes, the cochlea for one basically performs an FT on the incoming input signal, at least with respect to magnitude. The phase portion is still in the time domain |
Whereever you have a convolution operation on your data, transform them to the conjugate domain to turn it into multiplication.
In other words, work in the domain that is natural to your data.
[0] https://en.wikipedia.org/wiki/Convolution_theorem