![]() |
|
![]() |
| Conceptually, just a bit, practically (in terms of implementation), a lot. The standard python implementation internally compiles a kernel for your specific hardware. |
![]() |
| RVSDG (and the like) is needed to cope with shared state like globally addressable memory; plain e-graphs are only suited for pure code.
Happy to chat more btw, feel free to hit me up on discord. |
![]() |
| If anyone wants to port this over to ROCm / AMD MI300x, reach out to me: [email protected] (we won't ever spam you).
Happy to donate the compute time for this work. |
![]() |
| Not trying to be rude but what is the thinking behind this offer? Why would someone do this port for…free save for access to the hardware? What’s the upside for them? |
![]() |
| Perhaps I phrased my question wrong; I think you answered what are you getting out of this? My question is what the person writing code for you is getting out of it. |
![]() |
| > FlashAttention-3 is optimized for Hopper GPUs (e.g. H100).
How does FA3 fare for consumer GPUs such as 3090 and 4090? |
![]() |
| This is one of the most important improvements in all of AI, because it benefits most AI users by giving them access to more, faster, for the same hardware with little to no tradeoffs. |
![]() |
| I am wondering why flash attention is like 5x slower with variable masking than without it? Lack of good masking support almost zeros out the optimizations |
Tri’s publication history has been leaning toward SSM and Mamba style architectures recently. Unlike Flash Attention which has quadratic time complexity wrt sequence length, these latest algorithms are subquadratic. Thus they do much less computation, instead of just doing it more efficiently a la Flash Attention.
Dao and Gu published a really long paper this year which demonstrated (among other things) how Mamba/SSM can be formulated such that it’s amenable to acceleration using the same hardware primitives that Transformers benefit from.