![]() |
|
![]() |
| hm, fair enough. IIRC JPEG XL was a few hundred KB of SIMD code for the four or so different targets/ISAs, including the generic fallback, but I can believe video codecs are larger. |
![]() |
| Perhaps the use cases are different (heavily data-parallel), but FWIW I do not remember many cases where we were frontend bound, so icache hasn't been a concern. |
![]() |
| There indeed have been bugs caused by amd64 assembly code assuming unix calling convention being used for Windows builds and causing data corruption. You have to be careful. |
![]() |
| As a counterpoint, I regularly run into trivial cases that compilers are not able to autovectorize well:
https://gcc.godbolt.org/z/rjEqzf1hh This is an unsigned byte saturating add. It is directly supported as a single instruction in both x86-64 and ARM64 as PADDUSB and UQADD.16B. But all compilers make a mess of it from a straightforward description, either failing to vectorize it or generating vectorized code that is much larger and slower than necessary. This is with a basic, simple vectorization primitive. It's difficult to impossible to get compilers to use some of the more complex ones, like a rounded narrowing saturated right shift (UQRSHRN). |
![]() |
| Don’t modern or even just not ancient cpus use branch prediction to work past a check knowing that the vast majority of the time the check yields the same result? |
![]() |
| How does FFmpeg generate SEH tables for assembly functions on Windows? Is this something that x86asm.inc handles, or do you guys just not worry about it? |
![]() |
| Collaborators have actually superoptimized some of the more complicated Highway ops on RISC-V, with interesting gains, but I think the approach would struggle with largish tasks/algorithms? |
![]() |
| I have tried with Grok3 and Claude. They both seem to have an understanding of the algorithms and data patterns which is more than I expected but then just guess a solution that's often nonsensical. |
![]() |
| I did the first 27 chapters of this tutorial just because I was interested in learning more and it was thoroughly enjoyable: https://mariokartwii.com/armv8/
I actually quite like coding in assembly now (though I haven’t done much more than the tutorial, just made an array library that I could call from C). I think it’s so fun because at that level there’s very little magic left - you’re really saying exactly what should happen. What you see is mostly what you get. It also helped me understand linking a lot better and other things that I understood at a high level but still felt fuzzy on some details. Am now interested to check out this ffmpeg tutorial bc it’s x86 and not ARM :) |
![]() |
| This looks to be very cool will check it out. Wild to see it on a Mario Kart Wii Site, but I guess modders/hackers are one of the groups of people who still need to work with assembly frequently. |
![]() |
| One “fun” thing about it is that it’s higher level than you think, because the actual chip may do things with branch prediction and pipelining that you can only barely control.
I remember a university course where we competed on who could have the most performant assembly program for a specific task; everyone tried various variants of loop unrolling to eke out the best performance and guide the processor away from bad branch predictions. I may or may not have hit Ballmer Peak the night before the due date and tried a setup that most others missed, and won the competition by a hair! There’s also the incredible joy of seeing https://github.com/chrislgarry/Apollo-11 and quipping “this is a Unix system; I know this!” Knowing how to read the language of how we made it to the moon will never fade in wonder. Short answer: yes! |
![]() |
| Yes, it is definitely worth it. You get a much better understanding of CPU architectures. Also, most of your knowledge will be applicable to any platform. |
![]() |
| I learned 8086 (not x86) assembly in a university course during my bachelors degree and won a contest to create the first correct implementation that would play "Jingle Bells" on the PC-Speaker[0] attached to the custom built computer.
That was very fun and I kept playing around with assembly a bit afterwards, but never got around to learning any of the extensions made in x86 assembler and beyond.
In my masters degree, there was another course, where one built their own computer PCB in Eagle, got it fabbed and then had to make a game for the 8052 CPU on there. 8052 assembly is very fun! The processor has a few bytes of ram where every bit is individually addressable and testable. I built the game Tetris on three attached persistence of vision LED-Matrices[1]. Unfortunately, the repository isn't very clean, but I used expressive variable names, so it should be readable. I did create my own calling convention for performance reasons and calculated how many cpu cycles were available for game logic between screen refreshes. Those were all very fun things to think about :) Reading assembly now has me look up instruction names here and there, but mostly I can understand what's going on. [0] https://github.com/AnyTimeTraveler/HardwareNaheProgrammierun... [1] https://github.com/AnyTimeTraveler/HardwarenaheSystementwick... |
![]() |
| I personally don't think there's much value in writing assembly (vs using intrinsics), but it's been really helpful to read it. I have often used Compiler Explorer (https://godbolt.org/) to look at the assembly generated and understand optimizations that compilers perform when optimizing for performance.
|
![]() |
| no lol you're just missing the question I am asking. obviously sizeof wont return a pointer. Im just saying, wouldn't it be `sizeof(usize)` essentially... or `sizeof(ptr_size_on_platform)` |
![]() |
| I don’t care about the split, just wanted to say that this guide is so good. I wish I had this back when I was interested in low-low-level. |
![]() |
| Asm is 10x faster than C? That was definitely true at some point but is it still true today? Have compilers really stagnated so badly they can't come close to hand coded asm? |
![]() |
| Intrinsics have the disadvantages of asm (non-portable) but also don't reliably have the advantages of them (compilers are pretty unpredictable about optimizing with them) and they're ugly (especially x86 with its weird Hungarian stuff).
There is just a little bit of intrinsics code in ffmpeg, which I wrote, that does memory copies. https://github.com/FFmpeg/FFmpeg/blob/master/libavutil/x86/i... It's like this because we didn't want to hide the memory accesses from the compiler, because that hurts optimization, as well as memory tools like ASan. |
![]() |
| C compilers are still pretty bad at auto vectorization. For problems where SIMD is applicable, you can reasonably expect a 2x-16x speed up over the naive scalar implementation. |
![]() |
| Also, if you write code with intrinsics the autovectorization can make it _worse_. eg a pattern is to write a SIMD main loop and then a scalar tail, but it can autovectorize that and mess it up. |
![]() |
| Oh I definitely agree that in the vast majority of cases the compiler will probably win.
But I suspect there are cases where the super experts exist who can do things better. |
![]() |
| That's for random "I know asm so it must be faster".
If you know it really well, have already optimized everything on an algorithmic level and have code that can benefit from simd, 10x is real. |
![]() |
| That's assembly by people who learned it in 1990. Intel very much does want you writing assembly for their processors and in many ways the only way to push them hard is by doing so. |
![]() |
| I suspect that it is not worth using AVX2 vector gathers on any CPU. But certainly you could end up with the best implementation varying between microarchitectures for other reasons. |
![]() |
| MSVC doesn't even support inline assembly anymore, so to be portable across the big three compilers you have to use either intrinsics or standalone assembly. |
![]() |
| You can use https://github.com/simd-everywhere/simde if you like. In general portable SIMD libraries are of limited utility because having different primitives available on different architectures often means that you should approach problems differently. That is to say, in many cases using any portable SIMD API to solve your problem means leaving 200% speedups on the table on at least one of your top 3 targets.
The thing that is present in Zig and not yet stable in Rust does not include any dynamic shuffles, so these end up requiring intrinsics or asm for all sorts of things. It's a significant weakness compared to e.g. highway, eve, or simde. |
![]() |
| SIMD was introduced in the 80s but become ubiquitous when Intel got in on it in the 90s. It's interesting that (for x86), PLT is still stuck at hand-writing assembly 40 years later. |
![]() |
| Uhmmm...Lots of praise but these are just three small lessons covering basics. Exercises not uploaded yet. Looks like a work in progress or in the beginning? |
![]() |
| Practically every computer device manufactured in the last 15 years has some sort of accelerator/specific instructions designed primarily for optimizing the decoding of video. |
![]() |
| ffmpeg does more than hardware decoding. For example scaling, cropping, changing colors, effects. All this stuff can benefit from vectorized operations (on CPU or GPU). |
![]() |
| I'm kind of stunned we haven't gotten something better / more rust based than ffmpeg?
Especially curious given the advent of apple metal etc. Does anyone have recommendations? |
![]() |
| Why? It's a Herculean effort. It took it was 28 years between the creation of C and FFMPEG, so if there still is not a replacement by 2038, then your complaint is justified. |
![]() |
| The only thing I don't like about this is the focus on x86 assembly, which is a sinking ship because RISC-V is coming to eat its lunch, FAST. |
![]() |
| I could understand if you wrote arm, because that's an architecture with actual marketshare. arguably more marketshare than x86-64 at this point, but you had to choose risc-v for the lols. |
As I'm seeing in the comments here, the usefulness of handwritten SIMD ranges from "totally unclear" to "mission critical". I'm seeing a lot on the "totally unclear" side, but not as much on the "mission critical", so I'll talk a bit about that.
FFmpeg is a pretty clear use case because of how often it is used, but I think it is easier to quantify the impact of handwriting SIMD with something like dav1d, the universal production AV1 video decoder.
dav1d is used pretty much everywhere, from major browsers to the Android operating system (superseding libgav1). A massive element of dav1d's success is its incredible speed, which is largely due to how much of the codebase is handwritten SIMD.
While I think it is a good thing that languages like Zig have built-in SIMD support, there are some use cases where it becomes necessary to do things by hand because even a potential performance delta is important to investigate. There are lines of code in dav1d that will be run trillions of times in a single day, and they need to be as fast as possible. The difference between handwritten & compiler-generated SIMD can be up to 50% in some cases, so it is important.
I happen to be somewhat involved in similar use cases, where things I write will run a lot of times. To make sure these skills stay alive, resources like the FFmpeg school of assembly language are pretty important, in my opinion.