(评论)
(comments)

原始链接: https://news.ycombinator.com/item?id=39726781

本次讨论围绕实现高效的机器学习算法展开,特别关注 Transformer 和名为“FlashAttention”的近似注意力。 用户更喜欢使用 Triton 编写自定义内核,因为它易于集成到现有工具链和抽象层中,并且不会对性能产生负面影响。 他们提到,甚至 FlashAttention 的创建者也从 Cutlass 转向 Triton。 尽管 Triton 目前仅支持 AMD GPU,但它提供了块级编程和超参数自动优化等优势,使得开发快速内核变得更加简单。 关于零成本与非零成本抽象的争论仍在继续。 双方都同意 Triton DSL(训练、推理、建模)可以使构建复杂的机器学习架构变得更容易。

相关文章

原文
Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Flash Attention in ~100 lines of CUDA (github.com/tspeterkim)
224 points by tspeterkim 1 day ago | hide | past | favorite | 40 comments










Pretty neat implementation. In general, for these sort of exercises (and even if the intention is to go to prod with custom kernels) I lean towards Triton to write the kernels themselves. It is much more easier to integrate to the tool chain, and allows a level of abstraction that doesn't affect performance even a little bit while providing useful constructs.


yeah even the official flashattention is moving many implementations from cutlass to triton except for the main mha backward/forward pass


It was written with cutlass? No wonder Peter Kim found it valuable and worthwhile to de-obfuscate. Adopting a new programming language invented by OpenAI doesn't sound like a much better alternative. I'd be shocked if either of them were able to build code for AMD GPUs, where it's easy to adapt CUDA code, but not if it's buried in tens of thousands of lines of frameworks. I like open source code to have clarity so I can optimize it for my own production environment myself. When people distribute code they've productionized for themselves, it squeezes out all the alpha and informational value. Just because something's open source doesn't mean it's open source. I think people mostly do it to lick the cookie without giving much away.


Triton has an AMD backend, although work is still ongoing.


You will also be able to use Triton to target Ryzen AI.


As a person who finds CUDA extremely easy to write and integrate, what does Triton have to offer?


block level rather than thread level programming, automatic optimization across hyperparameters, makes it much easier to write fast kernels


You mean triton the inference server or triton the DSL for cuda?




they mean the dsl (not just necessarily for cuda)


triton the DSL.


> allows a level of abstraction that doesn't affect performance even a little bit

The second part of this sentence is true because the first part is false.



zero cost abstractions exist. doesn't mean all abstractions are zero-cost. or being zero-cost somehow invalidates their abstractness/genericness. but maybe we differ on the definition of abstractions.


> zero cost abstractions exist

So does perpetual motion :shrug: but my point is Triton is not an abstraction in the least. Source: 1) I spent 6 months investigating targeting other backends 2) Phil himself said he doesn't care to support other backends https://github.com/openai/triton/pull/1797#issuecomment-1730...



It's amazing how heavily provided hn is. I have a response here that's been deleted that is like 15 words, including a link to source that corroborates my claim but that response contains a transcribed emoji and so it's been deleted by dang or whomever. Lol super rich environment for discourse we've got going here.


For those who have no idea what's being discussed, quick background.

Discussing: Transformer [1] memory issues and approximate attention [2] in machine learning training.

Specifically: FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. [3]

As a side comment, this entire industry is sorely in need of at least intros. The entire space has moved so fast in the last year I need an entire new dictionary and thesaurus for all the terms they've created. Notably, because of this, found out Google has a glossary of machine learning terms. Actually somewhat handy.

[1] Google Machine Learning Glossary (Transformer): https://developers.google.com/machine-learning/glossary/#tra...

[2] Same (Attention): https://developers.google.com/machine-learning/glossary/#att...

[2] arXiv: https://arxiv.org/abs/2205.14135



Regarding your comment about how fast the research and industry is moving, would HN readers be interested in relevant one or two paragraph summaries that are basically "explain it like I am a machine learning engineer from 2020" but also knows the power of these models from a perspective of using ChatGPT or MS Copilot? That is, assume a fair amount of technical knowledge about the fundamentals, but don't assume that the reader is paying any attention to have whitebox knowledge of the current state of the art.


I personally have been looking for "explain it like I'm a CS PhD with lots of experience and the ability to look stuff up". But I suspect your summary would be pretty handy as well.


I reckon you need tacit knowledge. Experience. Luckily in the order of 100 hours not 10000.

Build a GPT using Python and Pytorch. For a good course: Andrej Karpathy is your keyword. At $1000 his course is great value. But actually it is free which is even better ;-)

It wont take you to flash attention but will ramp you to the point you could probably read papers about it. I almost got that far then life lifed me. But I was able to implement changes to the architecture of GPT and do some “hey mum I am doing SOTA (2021) machine learning”.



That sounds at least somewhat helpful. Honestly, a gradient for some of this stuff would be nice. Explain to me like I'm: "five", "a high schooler", "a college grad (not CS/ML/Eng)", "a CS/Eng not ML".

Although in a couple years, kids in restaurants will probably telling me how they're leveling up attention on their neuro-pet. The singularity is steep.



singularity implies AI increases exponentially, not human intelligence. Kids will not talk about neural nets any time soon.


I would love an explanation for software enginners / CS majors who aren't familiar with ML.

Last I studied ML was 2016 and that was stuff like decision trees, k nearest neighbors...



frankly i don’t really feel like all that much has changed since 2020 except for scale


Zero shot is wrong, but that definition is commonly used.

Zero shot is testing out if distribution, not just "a task" not trained on. The later is ill defined.

The original definition comes from a few papers. But the classic example is a clarifier recognizing zebras but having never been trained in zebras (but may have been trained on horses). There's are out of distribution. But importantly, out of the implicit distribution, not the target distribution.

The common improper usage usually confuses these two. A simple example might me training in 256x256 images and testing on 1024x1024. That's still in the implicit distribution (as long as the classes are identical). A very common example is training on a large dataset like LAION and then testing on coco or image net 1k. This is not zero shot because the classes in ImageNet are in LAION (and in Coco). Basically, this is a useless definition because then any validation or test set is zero shot because those were never seen in the training data and thus out of the training distribution. But remember that data sets are proxies for larger distributions.

Where is can get sometimes tricky is tasks (emergence has entered the chat). For example, you may not intend to train a generative model to do clarification but you probably did (it's very clear -- in the math -- if you're training density models (KLD, score, etc)). This can get hairy because it's very easy to train a model to do things that you aren't realizing you are and later find out. Some people can get upset about this but it's the nature of frameworks that have low interpretability. There's still a lot of mathematics we need to learn and it tends not to be an explicit focus in ML but there are plenty in the community focused on this.



What does __syncthreads() do here exactly? I'm new to CUDA, could get the overall idea of the FlashAttention paper but not the details.


Causes every thread in the block to wait until they have reached this point. Worth reading a cuda primer for more details on blocks/warps.

Since the threads are relying on each other to fill the SRAM with all needed data if you didn’t wait then values would be missing.



Any CUDA primer you recommend in particular? I had this same question.


Here's an article on syncing in CUDA via cooperative groups: https://developer.nvidia.com/blog/cooperative-groups/

There's also explicit warp synchronization, i.e. __syncwarp(). More on warp primitives here: https://developer.nvidia.com/blog/using-cuda-warp-level-prim...



Probably https://www.youtube.com/watch?v=nOxKexn3iBo (or just skimming the attached colab).


This is terrific, thanks!


If CPU/GPU execution speed is the goal while simultaneously code golfing the source size, https://halide-lang.org/ might have come in handy.


Superb work! Extremely neat and clear implementation! Interesting note on the backward pass - what do you think are the main blockers for a backward pass? https://jobsrajasthan.com/


Fantastic work! Extremely neat and clear implementation! Interesting note on the backward pass - what do you think are the main blockers for a backward pass?


Thanks Daniel. The main blocker is me not able to fully grasp the backward pass. (trying to understand Appendix B.2 in the original paper)

I need to get more comfortable with matrix derivatives before I can confidently reimplement it in the same minimal way as I did with the forward pass.



Oh ok! Ye the backwards passes are always much more difficult due to the derivatives!


My GPU work is not in ML (deep or otherwise); but ...

1. "100 lines of CUDA" + PyTorch; maybe this is useful and maybe it isn't, but counting lines of code on top of a huge codebase is not very meaningful.

2. Launching separate kernels, synchronously, on the default stream, for various operations, is typically not the right way to utilize a GPU.



> maybe this is useful and maybe it isn't, but counting lines of code on top of a huge codebase is not very meaningful.

In this case it's pretty reasonable imo, since the kernel itself is fairly independent - the usage of torch is just for some bindings for the data structures.

> Launching separate kernels, synchronously, on the default stream, for various operations, is typically not the right way to utilize a GPU.

This is actually the standard way to do things in ML. Assuming you're from a HPC background (where this may seem quite strange), the biggest change is that "More or less everything in ML runs on the GPU", so there is very rarely any device to host synchronizations. In addition, each individual kernel is typically run on fairly large chunks of data (a million elements would be on the smaller side), so maximizing occupancy with streams is not as necessary as in HPC.



This is fantastic. I am just starting in the ML space (compile from compilers) and I love short kernels that I can use to understand things better with.


> compile from compilers

What does that mean?



Typo, meant to write “coming from compilers”






Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact



Search:
联系我们 contact @ memedata.com