(评论)
(comments)

原始链接: https://news.ycombinator.com/item?id=38645021

总体而言,尽管 AMD 尝试为独立专业 GPU 开发 ROCm,但由于在为旧硬件提供支持方面存在限制,它在与 CUDA 竞争时仍然面临挑战,这可能会阻碍预算较低的大型机构做出必要的贡献和修复。 尽管 AMD 对 HPC 的持续投资(例如与阿贡国家实验室合作进行百亿亿次计算)可能会增加人们对其人工智能工作的兴趣,但这并不一定能解决学生、业余爱好者和小型机构的可用性和定价问题。 此外,适应快速变化的技术环境仍然至关重要,因为在高端 GPU 上拥有负担得起的租赁时间或优先考虑新一代卡而不是旧硬件,可以作为购买昂贵 GPU 的替代方案。 然而,利益相关者之间的分散和缺乏承诺仍然给建立一个主导者来挑战 CUDA 的主导地位带来了困难。 尽管如此,一些成功案例证明了在此类努力中投入资源的潜在成果,最近在 ROCm 下为 PyTorch 添加 RX 7900 XT 支持就证明了这一点。 最终,清晰的沟通、对创新的持续投资的承诺以及与学术机构和行业内主要参与者的合作可以促进在挑战 CUDA 作为集中堆栈的影响力和地位方面取得更大进展。

相关文章

原文
Hacker News new | past | comments | ask | show | jobs | submit login
Intel CEO: 'The entire industry is motivated to eliminate the CUDA market' (tomshardware.com)
308 points by rbanffy 1 day ago | hide | past | favorite | 351 comments










As another commenter said, it's CUDA. Intel and AMD and whoever can turn out chips reasonably fast, but nobody gets that it's the software and ecosystem. You have to out-compete the ecosystem. You can pick up a used Mi100 that performs almost like an A100 for 5x less money on eBay for example. Why is it 5x less? Because the software incompatibilities mean you'll spend a ton of time getting it to work compared to an Nvidia GPU.

Google is barely limping along with it's XLA interface to pytorch providing researchers a decent compatibility path. Same with Intel.

Any company in this space should basically setup a giant test suite of IDK, every model on hugging face and just start brute force fixing the issues. Then maybe they can sell some chips!

Intel is basically doing the same shit they always do here, announcing some open initiative and then doing literally the bare minimum to support it. 99% chance openvino goes nowhere. OpenAIs Triton already seems more popular, at least I've heard it referenced a lot more than openvino.



The funny thing to me is that so much of the "AI software ecosystem" is just PyTorch. You don't need to develop some new framework and make it popular. You don't need to support a zillion end libraries. Just literally support PyTorch.

If PyTorch worked fine on Intel GPUs, a lot of people would be happy to switch.



But you can't support Pytorch without a proper foundation in place. They don't need to support zillion _end_ libraries, sure, but they do need to have at least a very good set of standard libraries, equivalent of Cublas, Curand etc.

And they don't. My work recently had me working with rocRAND (Rocm's answer to Curand). It was frankly pretty bad- the design, performance (50% slower in places that don't make any sense because generating random numbers is not exact that complicated), and documentation (God it was awful).

Now, that's a small slice of the larger pie. But imagine if this trend continues for other libraries.



Generating random numbers is a bit complicated! I wrote some of the samplers in Pytorch (probably replaced by now) and some of the underlying pseudo-random algorithms that work correctly in parallel are not exactly easy... running the same PRNG with the same seed on all your cores will produce the same result, which is probably NOT what you want from your API.

But, to be honest, it's not that hard either. I'm surprised their API is 2x slower, Philox is 10 years old now and I don't think there's a licensing fee?



> Generating random numbers is a bit complicated!

I know! I just wrote a whole paper and published a library on this!

But really, perhaps not as much as many from outside might think. The core of a Philox implementation can be around 50 lines of C++ [1], with all the bells and whistles maybe around 300-400. That implementation's performance equals CuRAND's , sometimes even surpasses it! (the API is designed to avoid maintaining any rng states on device memory, something curand forces you to do).

> running the same PRNG with the same seed on all your cores will produce the same result

You're right. Solution here is to utilize multiple generator objects, one per thread, ensuring each produces statistically independent random streams. Some good algorithms (Philox for example), allow you to use any set of unique values as seeds for your threads (e.g. thread id).

[1] https://github.com/msu-sparta/OpenRAND/blob/main/include/ope...



Cool! I’ll have a lookseee. I’ve my own experiments in this space.


I wonder if the next generation chips are going to just have a dedicated hardware RNG per-core if that's an issue?


Why bother?

It's not the generation that matters so much, it's the gathering of entropy, which comes from peripherals and not possible to generate on-die.

If you don't need cryptographically secure randomness, you still want the entropy for generating the seeds per thread/die/chip.



It absolutely is possible to generate entropy on-die, assuming you actually want entropy and not just a unique value that gets XORed with the seed, so you can still have repeatable seeds.

Pretty much every chip has an RNG which can be as simple as just a single free running oscillator you sample



> Pretty much every chip has an RNG which can be as simple as just a single free running oscillator you sample

Every chip may have some sort of noise to sample, but they are nowhere near good sources of entropy.

Entropy is not a binary thing (you either have it or don't), it's a spectrum and entropy gathered on-die is poor entropy.

Look, I concede that my knowledge on this subject is a bit dated, but the last time I checked there were no good sources of entropy on-die for any chip in wide use. All cryptographically secure RNGs depend on a peripheral to grab noise from the environment to mix into the entropy pool.

A free-running oscillator is a very poor source of entropy.



For non-cryptographic applications, a PRNG like xorshift reseeded by a few bits from an oscillator might be enough.

As I understand it, the reason they don't use on-chip RNGs by themselves isn't due to lack of entropy, it's because people don't trust them not to put a backdoor on the chips or to have some kind of bug.

Intel has https://en.m.wikipedia.org/wiki/RDRAND but almost all chips seem to now have some kind of RNG.



for GPGPU, the better approach is CBRNG like random123.

https://github.com/DEShawResearch/random123

if you accept the principles of encryption, then the bits of the output of crypt(key, message) should be totally uncorrelated to the output of crypt(key, message+1). and this requires no state other than knowing the key and the position in the sequence.

the direct-port analogy is that you have an array of CuRand generators, generator index G is equivalent to key G, and you have a fixed start offset for the particular simulation.

moreover, you can then define the key in relation to your actual data. the mental shift from what you're talking about is that in this model, a PRNG isn't something that belongs to the executing thread. every element can get its own PRNG and keystream. And if you use a contextually-meaningful value for the element key, then you already "know" the key from your existing data. And this significantly improves determinism of the simulation etc because PRNG output is tied to the simulation state, not which thread it happens to be scheduled on.

(note that the property of cryptographic non-correlation is NOT guaranteed across keystreams - (key, counter) is NOT guaranteed to be uncorrelated to (key+1, counter), because that's not how encryption usually is used. with a decent crypto, it should still be very good, but, it's not guaranteed to be attack-resistant/etc. so notionally if you use a different key index for every element, element N isn't guaranteed to be uncorrelated to element N+1 at the same place in the keystream. If this is really important then maybe you want to pass your array indexes through a key-spreading function etc.)

there are several benefits to doing it like this. first off obviously you get a keystream for each element of interest. but also there is no real state per-thread either - the key can be determined by looking at the element, but generating a new value doesn't change the key/keystream. so there is nothing to store and update, and you can have arbitrary numbers of generators used at any given time. Also, since this computation is purely mathematical/"pure function", it doesn't really consume any memory-bandwidth to speak of, and since computation time is usually not the limiting element in GPGPU simulations this effectively makes RNG usage "free". my experience is that this increases performance vs CuRand, even while using less VRAM, even just directly porting the "1 execution thread = 1 generator" idiom.

Also, by storing "epoch numbers" (each iteration of the sim, etc), or calculating this based on predictions of PRNG consumption ("each iteration uses at most 16 random numbers"), you can fast-forward or rewind the PRNG to arbitrary times, and you can use this to lookahead or lookback on previous events from the keystream, meaning it serves as a massively potent form of compression as well. Why store data in memory and use up your precious VRAM, when you could simply recompute it on-demand from the original part of the original keystream used to generate it in the first place? (assuming proper "object ownership" of events ofc!) And this actually is pretty much free in performance terms, since it's a "pure function" based on the function parameters, and the GPGPU almost certainly has an excess of computation available.

--

In the extreme case, you should be able to theoretically "walk" huge parts of the keystream and find specific events you need, even if there is no other reference to what happened at that particular time in the past. Like why not just walk through parts of the keystream until you find the event that matches your target criteria? Remember since this is basically pure math, it's generated on-demand by mathing it out, it's pretty much free, and computation is cheap compared to cache/memory or notarizing.

(ie this is a weird form of "inverted-index searching", analogous to Elastic/Solr's transformers and how this allows a large number of individual transformers (which do their own searching/indexing for each query, which will be generally unindexable operations like fulltext etc) to listen to a single IO stream as blocks are broadcast from the disk in big sequential streaming batches. Instead of SSD batch reads you'd be aiming for computation batch reads from a long range within a keystream. (And this is supposition but I think you can also trade back and forth between generator space and index hitrate by pinning certain bits in the output right?)

--

Anyway I don't know how much that maps to your particular use-case but that's the best advice I can give. Procedural generation using a rewindable, element-specific keystream is a very potent form of compression, and very cheap. But, even if all you are doing is just avoiding having to store a bunch of CuRand instances in VRAM... that's still an enormous win even if you directly port your existing application to simply use the globalThreadIdx like it was a CuRand stateful instance being loaded/saved back to VRAM. Like I said, my experience is that because you're changing mutation to computation, this runs faster and also uses less VRAM, it is both smaller and better and probably also statistically better randomness (especially if you choose the "hard" algorithms instead of the "optimized" versions like threefish instead of threefry etc). The bit distribution patterns of cryptographic algorithms is something that a lot of people pay very very close attention to, you are turning a science toy implementation into a gatling gun there simply by modeling your task and the RNG slightly differently.

That is the reason why you shouldn't do the "just download random numbers", as a sibling comment mentions (probably a joke) - that consumes VRAM, or at least system memory (and pcie bandwidth). and you know what's usually way more available as a resource in most GPGPU applications than VRAM or PCIe bandwidth? pure ALU/FPU computation time.

buddy, everyone has random numbers, they come with the fucking xbox. ;)



thinking this through a little bit, you are launching a series of gradient-descent work tasks, right? taskId is your counter value, weightIdx is your key value (RNG stream). That's how I'd port that. Ideally you want to define some maximum PRNG usage for each stage of the program, which allows you to establish fixed offsets from the epoch value for a given event. Divide your keystream in whatever advantageous way, based on (highly-compressible) epoch counters and event offsets from that value.

in practice, assuming a gradient-descent event needs a lot of random numbers, having one keystream for a single GD event might be too much and that's where key-spreading comes in. if you take the "weightIdx W at GradientDescentIdx G" as the key, you can have a whole global keystream-space for that descent stage. And the key-spreading-function lets you go between your composite key and a practical one.

https://en.wikipedia.org/wiki/Key_derivation_function

(again, like threefry, there is notionally no need for this to be cryptographically secure in most cases, as long as it spreads in ways that your CBRNG crypto algorithm can tolerate without bit-correlation. there is no need to do 2 million rounds here either etc. You should actually pick reasonable parameters here for fast performance, but good enough keyspreading for your needs.)

I've been out of this for a long time, I've been told I'm out of date before and GPGPUs might not behave exactly this way anymore, so please just take it in the spirit it's offered, can't guarantee this is right but I've specifically gazed into the abyss the CuRand situation a decade ago and this was what I managed to come up with. I do feel your pain on the stateful RNG situation, managing state per-execution-thread is awful and destroys simulation reproducibility, and managing a PRNG context for each possible element is often infeasible. What a waste of VRAM and bandwidth and mutation/cache etc.

And I think that cryptographic/pseudo-cryptographic PRNG models are frankly just a much better horse to hook your wagon to than scientific/academic ones, even apart from all the other advantages. Like there's just not any way mersenne twister or w/e is better than threefish, sorry academia

--

edit: Real-world sim programs are usually very low-intensity and have effectively unlimited amounts of compute to spare, they just ride on bandwidth (sort/search or sort/prefix-scan/search algorithms with global scope building blocks often work well).

And tbh that's why tensor is so amazing, it's super effective at math intensity and computational focus, and that's what GPUs do well, augmented by things like sparse models etc. Make your random not-math task into dense or sparse (but optimized) GPGPU math, plus you get a solution (reasonable optimum) to an intractible problem in realtime. The experienced salesman usually finds a reasonable optimum, but we pay him in GEMM/BLAS/Tensor compute time instead of dollars.

Sort/search or sort/prefix-sum/search often works really well in deterministic programs too. Do you ever have a "myGroup[groupIdx].addObj(objIdx) stage? that's a sort and prefix-sum operation right there, and both of those ops run super well on GPGPU.



Folks also underestimate how complex these libraries are. There are dozens of projects to make BLAS alternatives which give up after ~3-6 months when they realize that this project will take years to be successful.


How does that work? Why not pick up where the previous team left off instead of everyone starting new ones? Or are they all targeting different backends and hardware?


It's a compiler problem and there is no money in compilers [1]. If someone made an intermediate representation for AI graphs and then wrote a compiler from that intermediate format into whatever backend was the deployment target then they might be able to charge money for support and bug fixes but that would be it. It's not the kind of business anyone wants to be in so there is no good intermediate format and compiler that is platform agnostic.

1: https://tinygrad.org/



JAX is a compiler.


So are TensorFlow and PyTorch. All AI/ML frameworks have to translate high-level tensor programs into executable artifacts for the given hardware and they're all given away for free because there is no way to make money with them. It's all open source and free. So the big tech companies subsidize the compilers because they want hardware to be the moat. It's why the running joke is that I need $80B to build AGI. The software is cheap/free, the hardware costs money.


> generating random numbers

You can't bench implementations of random numbers against each other purely on execution speed.

A better algorithm (better statistical properties) will be slower.



I have the fastest random number generator in the world. And it works in parallel too!

https://i.stack.imgur.com/gFZCK.jpg



Yeah. In this instance, I was talking about the same algorithm (phillox), the difference is purely in implementation.


If you haven't already, please consider filing issues on the rocrand GitHub repo for the problems you encountered. The rocrand library is being actively developed and your feedback would be valuable for guiding improvements.


Appreciate it, will do.


I honestly don't see why it's so hard. On my project we wrote our own gemm kernels from scratch so llama.cpp didn't need to depend on cublas anymore. Only took a few days and a few hundred lines of code. We had to trade away 5% performance.


For a given set kernels, and a limited set of architectures, the problem is relatively easy.

But covering all the important kernels acros all the crazy architecture out there and with relatively good performance and numerical accuracy ... Much harder



Instead of generating pseudorandom numbers you can just download files of them.

https://archive.random.org/



Or you could just re-use the same number; no one can prove it is not random.

https://xkcd.com/221/



This is a big reason why AMD did this deal with PyTorch...

https://pytorch.org/blog/experience-power-pytorch-2.0/



Just to point out it does, kind of: https://github.com/intel/intel-extension-for-pytorch

I've asked before if they'll merge it back into PyTorch main and include it in the CI, not sure if they've done that yet.

In this case I think the biggest bottleneck is just that they don't have a fast enough card that can compete with having a 3090 or an A100. And Gaudi is stuck on a different software platform which doesn't seem as flexible as an A100.



They could compete on ram, if the software was there. Just having a low cost alternative to the 4060ti would allow them to break into the student/hobbies/open source market.

I tried the a770, but returned it. Half the stuff does not work. They have the CPU side and GPU development on different branches (GPU seems to be ~6 months behind CPU) and often you have to compile it yourself, (if you want torchvision or torchaudio) it also currently on 2.0.1 of pytorch so somewhat lagging, and does not have most of the performance analysis software available. You also, do need to modify your pytorch code, often more than just replacing cuda for xpu as the device. They are also doing all development internally, then pushing intermittently to public. A lot of this would not be as bad if there was a better idea of feature timeline, or if they made their CI public. (Trying to build it myself involved a extremely hacky bash script, that inevitably failed halfway through.)



The amount of VRAM is the absolute killer USP for the current large AI model hobbyist segment. Something that had just as much VRAM as a 3090 but at half the speed and half the price would sell like hot cakes.


You are describing the ebay market for used nvidia tesla cards. The k80, p40, or m40 are widely available and sell for ~$100 with 24gb vram. The m10 even has 32gb! The problem for ai hobbyists is it won't take long to realize how many apis use the "optical flow" pathways and so on nvidia they'll only run at acceptable speeds on rtx hardware, assuming they run at all. Cuda versions are pinned to hardware to some extent.


Yep. I have a fleet of P40s that are good at what they do (Whisper ASR primarily) but anything even remotely new... nah. fp16 support is missing so you need P100 cards, and usually that means you are accepting 16GB of VRAM rather than 24GB.

Still some cool hardware.



For us hobbyists used 3090 or new 7900xtx seem to be the way. But even then you still need to build a machine with 3 or 4 of these GPUs to get enough VRAM to play with big models.


For sure - our prod machine has 6x RTX 3090s on some old cirrascale hardware. But P40s are still good for last-gen models. Just nothing new unfortunately.


Out of these three, only P40 is worth the effort to get running vs the capabilities they offer. That's also before considering that other than software hacks or configuration tweaks, those cards require specialised cooling shrouds for adequate cooling in tower-style cases.

If your time or personal energy is worth >$0, these cards work out to much more than $100. And you can't even file the time burnt on getting them to run as any kind of transferable experience.

That's not to say I don't recommend getting them - I have a P4 family card and will get at least one more, but I'm not kidding myself that the use isn't very limited.



The K80 is literally worth P40s are definitely the best value along with P100s. I still think there is a good amount of value to be extracted from both of those cards, especially if you are interested in using Whisper ASR, video transcoding, or CUDA models that were relevant before LLMs (a time many people have forgotten apparently).


This is pretty disappointing to hear. I’m really surprised they can’t even get a clean build script for users, let alone integrate into the regular Pytorch releases.


oh no, I just bought a refurbed A770 16GB for tinkering with GPGPU lol. It was $220, return?


PyTorch includes some Vulkan compat already (though mostly tested on Android, not on desktop/server platforms), and they're sort of planning to work on OpenCL 3.0 compat, which would in turn lead to broad-based hardware support via Mesa's RustiCL driver.

(They don't advertise this as "support" because they have higher standards for what that term means. PyTorch includes a zillion different "operators" and some of them might be unimplemented still. Besides performance is still lacking compared to CUDA, Rocm or Metal on leading hardware - so only useful for toy models.)



OneAPI isn't bad for PyTorch, the performance isn't there yet but you can tell it's an extremely top priority for Intel.


But this is the the thing. Speaking as someone who dabbles in this area rather than any kind of expert, it’s baffling to me that people like Intel are making press releases and public statements rather than (I don’t know) putting in the frikkin work to make performance of the one library that people actually use decent.

You have a massive organization full of gazillions of engineers many of whom are really excellent. Before you open your mouth in public and say something is a priority, deploy a lot of them against this and manifest that priority by actually doing the thing that is necessary so people can use your stuff.

It’s really hard to take them seriously when they haven’t (yet) done that.



You know how it works. The same busybodies who are putting out this useless noise releases are the ones who squandered Intel's lead, and now are patting themselves on the back for figuring out that with this they'll again be on top for sure!

There was a post on HN a few months ago about how Nvidia's CEO still has meetings with engineers in the trenches. Contrast that with what we know of Intel, which is not much good, and a lot of bad. (That they are notoriously not-well-paying, because they were riding on their name recognition.)



Intel has to do it by themselves. NVIDIA just lets Meta/OpenAI/Google engineers do it for them. Such a handicapped fight.


It wasn’t always like this. Nvidia did the initial heavy lifting to get cuda off the ground to a point where other people could use it.


That's because CUDA is a clear, well-functioning library and Intel has no equivalent. It makes any "you just have to get Pytorch working" a little less plausible.


It's not just Intel. Open initiatives and consortiums (the phase two of the same) are always the losers ganging up hoping that it will give them the leg up they don't have. If you're older you'll have seen this play out over and over in the industry - the history of Unix vs. Windows NT from the 1990s was full of actions like this, networking is going through it again for the nth time (this time with UltraEthernet) and so on. OpenGl was probably the most successful approach, barely worked, and didn't help any of the players who were not on the road to victory already. Unix 95 didn't work, unix 98 didn't work, etc.


You're just listing the ones that didn't knock it out of the park.

TCP/IP completely displaced IPX to the point that most people don't even remember what it was. Nobody uses WINS anymore, even Microsoft uses DNS. It's rare to find an operating system that doesn't implement the POSIX API.

The past is littered with the corpses of proprietary technologies displaced by open standards. Because customers don't actually want vendor-locked technology. They tolerate it when it's the only viable alternative, but make the open option good and the proprietary one will be on its way out.



Open source generally wins once the state of the art has stopped moving. When a field is still experiencing rapid change closed source solutions generally do better than open source ones. Until we somehow figure out a relatively static set of requirements for running and training LLMs I wouldn’t expect any open source solution to win.


That doesn't really make sense. Pretty much all LLMs are trained in pytorch, which is open source. LLMs only reached the state it is now because many academic conferences insisted that paper submissions have open source code attached to it. So much of the ML/AI ecosystem is open source. Pretty much only CUDA is not open source.


>Pretty much only CUDA is not open source.

What stops Intel to make their own CUDA and plug in into pytorch?



CUDA is huge and nvidia spent a ton in a lot of "dead end" use cases optimizing it. There have been experiments with CUDA translation layers with decent performance[1]. There are two things that most projects hit:

1. The CUDA API is huge; I'm sure Intel/AMD will focus on what they need to implement pytorch and ignore every other use case ensuring that CUDA always has the leg up in any new frontier

2. Nvidia actually cares about developer experience. The most prominent example is Geohotz with tinygrad - where AMD examples didn't even work or had glaring compiler bugs. You will find nvidia engineer in github issues for CUDA projects. Intel/AMD hasn't made that level of investment and thats important because GPUs tend to be more fickle than CPUs.

[1] https://github.com/vosen/ZLUDA



The same shit as always, patents and copyright.


You didn't have a choice when it came to protocols for the Internet; it's TCP/IP and DNS or you don't get to play. Everyone was running dual stack to support their LAN and Internet and you had no choice with one of them. So, everything went TCP/IP and reduced overall complexity.


> It's rare to find an operating system that doesn't implement the POSIX API.

Except that it has barelly improved beyond CLI and daemons, still thinks terminals are the only hardware, everything else that matters isn't part of it, not even more modern networking protocols that aren't exposed in socket configurations.



Its purpose was to create compatibility between Unix vendors so developers could write software compatible with different flavors. The primary market for Unix vendors is servers, which to this day are still about CLI and daemons, and POSIX systems continue to have dominant market share in that market.


Nah, POSIX on servers is only relevant enough for language runtimes and compilers, which then use their own package managers and cloud APIs for everything else.

Alongside a cloud shell, which yeah, we now have a VT100 running on a browser window.

There is a reason why there are USENIX papers on the loss of POSIX relevance.



> Nah, POSIX on servers is only relevant enough for language runtimes and compilers, which then use their own package managers and cloud APIs for everything else.

Those things are a different level of abstraction. The cloud API is making POSIX system calls under the hood, which would allow the implementation of the cloud API to be ported to different POSIX-compatible systems (if anybody cared to).

> There is a reason why there are USENIX papers on the loss of POSIX relevance.

The main reason POSIX is less relevant is that everybody is using Linux and the point of POSIX was to create compatibility between all the different versions of proprietary Unix that have since fallen out of use.



Maybe it's POSIX that is holding new developments back because people think it's good enough. It's not '60s anymore. I would have expected to have totally new paradigms in 2023 if you asked me 23 years ago. Even NT kernel seems more modern.

While POSIX was state of the art when it was invented, it shouldn't be today.

Lots of research was thrown in recycle bin because "Hey, we have POSIX, why reinvent the wheel?", up to the point that nobody wants to do operating systems research today, because they don't want their hard work to get thrown into the same recycle bin.

I think that people who invented POSIX were innovators and have they live today, they would come with a totally new paradigm, more fit to today's needs and knowledge.



Arguably the dominant APIs in the server space are the cloud APIs not POSIX.


Many of which are also open, like OpenStack or K8s, or have third party implementations, like Ceph implementing the Amazon S3 API.


Also all reimplementations of proprietary technology.

The S3 API is a really good example of the “OSS only becomes dominant when development slows down” principle. As a friend of mine who has had to support a lot of local blob storage says, “On the gates of hell are emblazoned — S3 compatible.”



> Also all reimplementations of proprietary technology.

That's generally where open standards come from. You document an existing technology and then get independent implementations.

Unix was proprietary technology. POSIX is an open standard.

Even when the standard comes at the same time as the first implementation, it's usually because the first implementer wrote the standard -- there has to be one implementation before there are two.



I think OP's point was less that the tech didn't work (e.g. OpenGL was fantastically successful) and that it didn't produce a good outcome for the "loser" companies that supported it.


The point of the open standard is to untether your prospective customers from the incumbent. That means nobody is going to monopolize that technology anymore, but that works out fine when it's not the thing you're trying to sell -- AMD and Intel aren't trying to sell software libraries, they're trying to sell GPUs.

And this strategy regularly works out for companies. It's Commoditize Your Complement.

If you're Intel you support Linux and other open source software so you can sell hardware that competes with vertically integrated vendors like DEC. This has gone very well for Intel -- proprietary RISC server architectures are basically dead, and Linux dominates much of the server market in which case they don't have to share their margins with Microsoft. The main survivor is IBM, which is another company that has embraced open standards. It might have also worked out for Sun but they failed to make competitive hardware, which is not optional.

We see this all over the place. Google's most successful "messaging service" is Gmail, using standard SMTP. It's rare to the point of notability for a modern internet service to use all proprietary networking protocols instead of standard HTTP and TCP and DNS, but many of them are extremely successful.

And some others are barely scraping by, but they exist, which they wouldn't if there wasn't a standard they could use instead of a proprietary system they were locked out of.



> The point of the open standard is to untether your prospective customers from the incumbent.

That is the point.

But FWIW the incumbent adopts it and dominates anyway. (Though you now are technically "untethered.")



> But FWIW the incumbent adopts it and dominates anyway. (Though you now are technically "untethered.")

That's assuming the incumbent's advantage isn't rooted in the lock-in.

If ML was suddenly untethered from CUDA, now you're competing on hardware. Intel would still have mediocre GPUs, but AMD's are competitive, and Intel's could be in the near future if they execute competently.

The open standard doesn't automatically give you the win, but it puts you in the ring.

And either of them have the potential to gain an advantage over Nvidia by integrating GPUs with their x86_64 CPUs, e.g. so the CPU and GPU can share memory, avoiding copying over PCIe and giving the CPU direct access to HBM. They could even put a cut down but compatible version of the technology in every commodity PC by default, giving them a huge installed base of hardware that encourages developers to target it.



If the software side no longer mattered, I would expect all three vendors would magically start competing on available RAM. A slower card with double today's RAM would absolutely sell.


> A slower card with double today's RAM would absolutely sell

Absolutely, SQL analytics people (like me) have been itching for a viable GPU for analytics for years now. The price/performance just isn't there yet because there's such a bias towards high compute and low memory.



Windows still uses WINS and NetBIOS when DNS is unavailable.


In a business-class network, even one running Windows, if DNS breaks "everything" is going to break.


Nvidia is probably ten times more scared of this guy https://github.com/ggerganov than Intel or AMD.


Can you expand on this? This is my first time seeing this guy’s work


He’s the main developer of Llama.cpp, which allows you to run a wide range of open-weights models on a wide range of non-NVIDIA processors.


but it's all inference, and most of Nvidias moat is in training afaik.


There is an example of training https://github.com/ggerganov/llama.cpp/tree/1f0bccb27929e261...

But that's absolutely false about the Nvidia moat being only training. Llama.cpp makes it far more practical run inference on a variety of devices. Including ones with or without Nvidia hardware.



People have really bizarre overdramatic misunderstandings of llama.cpp because they used it a few times to cook their laptop. This one really got me giggling though.


I am integrating llama.cpp into my application. I just went through one of their text generation examples line-by-line and converted it into my own class.

This is a leading-edge software library that provides a huge boost for non-Nvidia hardware in terms of inference capability with quantized models.

If you don't understand that, then you have missed an important development in the space of machine learning.



Jeez. Lol.

At length:

- yes, local inference is good. I can't say this strongly enough: llama.cpp is a fraction of a fraction of local inference.

- avoid talking down to people and histrionics. It's a hot field, you're in it, but like all of us always, you're still learning. When faced with a contradiction, check your premises, then share them.



This right here. Until Intel (and/or AMD) get serious about the software side and actually invest the money CUDA isn't going anywhere. Intel will make noises about various initiatives in that direction and then a quarter or two later they'll make big cuts in those divisions. They need to make a multi-year commitment and do some serious hiring (and they'll need to raise their salaries to market rates to do this) if they want to play in the CUDA space.


Literally, every single announcement (and action) coming out of AMD these days is that they are serious about the software. I don't see any reason at this point to doubt them.

The larger issue is that they need to fix the access to their high end GPUs. You can't rent a MI250... or even a MI300x (yet, I'm working on that myself!). But that said, you can't rent an H100 either... there are none available.



> Literally, every single announcement (and action) coming out of AMD these days is that they are serious about the software. I don't see any reason at this point to doubt them.

They're having to announce it so much because people are rightly sceptical. Talk is cheap, and their software has sucked for years. Have they given concrete proof of their commitment, e.g. they've spent X dollars or hired Y people to work on it (or big names Z and W)?



Agreed. Time will tell.

MI300x and ROCm 6 and their support of projects like Pytorch, are all good steps in the right direction. HuggingFace now supports ROCm.



>Literally, every single announcement (and action) coming out of AMD these days is that they are serious about the software. I don't see any reason at this point to doubt them.

I have a bridge to sell.



It is way, way better in the last year or so. Perfectly reasonable cards for inference if you actually understand the stack and know how to use it. Is nVidia faster? Sure, but at twice the price for 20-30% gains. If that makes sense for you, keep paying the tax.


Not just tax, but fighting with centralization on a single provider and subsequent unavailability.






What _7il4 removed were these two comments:

"AMD is not serious, and neither is Intel for that matter. Their software are piles of proprietary garbage fires. They may say they are serious but literally nothing indicates they are."

"Yes, and ROCm also doesn't work on anything non-AMD. In fact it doesn't even work on all recent AMD gpus. T"



It's not too polite to repost what people removed. Errors on the internet shouldn't haunt people forever.

However, my experience is that the comments about AMD are spot-on, with the exception of the word "proprietary."

Intel hasn't gotten serious yet, and has a good track record in other domains (compilers, numerical libraries, etc.). They've been flailing for a while, but I'm curious if they'll come up with something okay.



As a long time user of intels scientific compiler/accelerator stack, I'm not sure if I'd call it a "good track record". Once you get all their libs working they tend to be fairly well optimized but they're always a huge hassle to install, configure, and distribute. And when I say a hassle to install, I'm talking about hours to run their installers.

They have a track record of taking open projects, adding proprietary extensions, and then requiring those extensions to work with other tools. This sounds fine, but they are very slow/never update the base libs. From version to version they'll muck with deep dependencies, sometimes they'll even ship different rules on different platforms (I dare you to try to static link openmp in a recent version of oneapi targeting windows). If you ship a few tools (let's say A and B) that use the same dynamic lib, it's a royal pain to make sure they don't conflict with each other if you update software A but not B. Ranting about consumer junk aside, their cluster focused tooling on Linux tends to be quite good, especially compared to amd.



Normally, I wouldn't do that, but this time I felt like both comments were intentionally inflammatory and then the context for my response was lost.

"literally nothing" is also wrong given that they just had a large press announcement on Dec 6th (yt as part of my response below), where they spent 2 hours saying (and showing) they are serious.

The second comment was made and then immediately deleted, in a way that was to send me a message directly. It is what irked me enough to post their comments back.



I deleted it because I was in an extremely bad mood and later realized it was simply wrong of me to post it and vent my unrelated frustration in those comments. I think it's in extremely bad taste to repost what I wrote when I made the clear choice to delete it.


(Posting here because I don't have an email address for you.)

I reassigned your comments in this thread to a random user ID, so it's as if you had used a throwaway account to post them and there's no link to your main account. I also updated the reference to your username in another comment. Does that work for you?



You're right. I apologize. I've emailed dang to ask him to remove this whole thread.


If you didn't say it, the rocks would cry out. Personally, I'll believe AMD is maybe serious about Rocm if they make it a year without breaking their Debian repository.


>It's not too polite to repost what people removed. Errors on the internet shouldn't haunt people forever.

IMO the comment deletion system handles deleting your own comment wrong - it should grey out the comment and strikethrough it, and label it "comment disavowed" or something with the username removed, but it shouldn't actually delete the comment.

Deleting the comment damages the history, and makes the comment chain hard to follow.



I'm not sure history should be sacred. In the world of science, this is important, but otherwise, we didn't used to live in a universe where every embarrassing thing we did in middle school would haunt us in our old age.

I feel bad for the younger generation. Privacy is important, and more so than "the history."



Long ago, under a different account, I emailed dang about removing some comment content that became too identifying only years after the comments in question. He did so and was very gracious about it. dang, you're cool af and you make HN a great place!


You're being very obsequious about having to personally get approval from a moderator to do on your behalf something every other forum lets you do on your own by default.


Plus, dang does a good job.

Not a perfect job -- everyone screws up once in a while -- but this forum works better than most.



How’s SYCL proprietary exactly?


AMD made a presentation on their AI software strategy at Microsoft Ignite two weeks ago. Worth a watch for the slides and live demo

https://youtu.be/7jqZBTduhAQ?t=61



Seriously, why don't they just dedicate a group to creating the best pytorch backend possible? Proving it there will gain researcher traction and prove that their hardware is worth porting the other stuff over to.


You can't "just" do stuff like this. You need the right guy and big corps have no clue who is capable.


I agree. That’s what MS didn’t understand with cloud and Linux at first.

There is more than just a hardware layer to adoption.

CUDA is a platform is an ecosystem is also some sort of attitude. It won’t go away. Companies invested a lot into it.



Guys need to do ye olde embrace, extend maneuver. What wine did, what Javas of the world did. CUDA driver API and CUDA runtime API either translation or implementation layer that offers compatibility and speed. I see no way around it at this point, for now.


They could do that. That would eliminate Nvidia's monopoly. AMD has made gestures in that direction with Hip. But they ultimately don't want to do that - Hip support is half-assed and inconsistent. AMD creates and abandons a variety of APIs. So the conclusion is the other chip makers whine about Nvidia's monopoly but don't want to end it - they just want maneuver to get their smaller monopolies of some sort or other.


You're right. At this stage CUDA is de facto what the standard is around. Just like in ISA wars x86 was. Doesn't matter if you have POWER whatever when everything's on the other thing. I get why not though, it would drag the battle onto Nvidia's home turf. At least it would be a battle though.


Wasnt Intel the biggest supporter of OpenCV? I dont know any open source heavely supported by Nvidia


> but nobody gets that it's the software and ecosystem.

Really? Because I’m confident Intel knows exactly what it’s about. Have you looked at their contributions to Linux and open source in general? They employ thousands of software developers.



Which is why intel left nvidia and everyone else in the dust with CUDA, which they developed.

Oh, wait…



What exactly do you think Intel was going to write CUDA for? Their GPU products have been on the market less than 2 years, and they're still trying to get their arms wrapped around drivers.

Them understanding what was coming doesn't mean they have a magic wand to instantly have a fully competitive product. You can't write a CUDA competitor until you've gotten the framework laid. The fact they invested so heavily in their GPUs makes it pretty obvious they weren't caught off-guard, but catching up takes time. Sometimes you can't just throw more bodies at the problem...



That’s my point though - they’re catching up, not leading, which rather implies that they absolutely missed a beat, and don’t therefore understand where the market is going before it goes there.


Is Mojo trying to solve this?

https://www.youtube.com/watch?v=SEwTjZvy8vw



Both AMD and Intel (and Qualcomm to some degree) just don't seem to get how you beat NVIDIA.

If they want to grab a piece of NVIDIA's pie, they do NOT need to build something better than an H100 right away. There are a million consumers who are happy with a 4090 or 4080 or even 3080 and would love for something that's equally capable at half price, and moreover, actually available for purchase, from Amazon/NewEgg/wherever, and without a "call for pricing" button. AMD and Intel are much better at making their chips available for purchase than NVIDIA. But that's not enough.

What they DO need to do to take a piece of NVIDIA's pie is to build "intelcc", "amdcc", and "qualcommcc" that accept the EXACT SAME code that people feed to "nvcc" so that it compiles as-is, with not a single function prototype being different, no questions asked, and works on the target hardware. It needs to just be a drop-in replacement for CUDA.

When that is done, recompiling PyTorch and everything else to use other chips will be trivial.



That's not going to work because each GPU has a different internal architecture and aligning the way data is fed with how stream processors operate is different for each architecture (stuff like how memory buffers are organized/aligned/paged etc.). AMD is very incompatible to Nvidia at the lowest level so things/approaches that are fast on Nvidia can be 10x slower on AMD and vice versa.


> things/approaches that are fast on Nvidia can be 10x slower on AMD and vice versa.

That's fine but the job of software is to abstract that out. The code should at least compile, even if it is 10x less efficient and if multiple awkward sets of instructions need to be used instead of one.

If Pytorch can be recompiled for AMD overnight with zero effort (only `ln -s amdcc nvcc`, `ln -s /usr/local/cuda /usr/local/amda`) they will gain some footing against Nvidia.



>Both AMD and Intel (and Qualcomm to some degree) just don't seem to get how you beat NVIDIA.

That's simple, just do what Nvidia did. Be better than the competition.



thats what hip is though. a recompiler for cuda code. its not good enough for ptx assembly to amdgpu yet


They still don't get that those who are serious about hardware, must make their own software.


huh, I didn't even know openvino supported anything but CPUs! TIL


> the software incompatibilities mean you'll spend a ton of time getting it to work compared to an Nvidia GPU.

Leverage LLMs to port the SW



Fun fact: More than half of all engineers at NVIDIA are software engineers. Jensen has deliberately and strategically built a powerful software stack on top of his GPUs, and he's spent decades doing it.

Until Intel finds a CEO who is as technical and strategic, as opposed to the bean-counters, I doubt that they will manage to organize a successful counterattack on CUDA.



>finds a CEO who is as technical and strategic, as opposed to the bean-counters

Did you just call Gelsinger a "non-technical"? wow, how out of touch with reality

>Gelsinger first joined Intel at 18 years old in 1979 just after earning an associate degree from Lincoln Tech.[9] He spent much of his career with the company in Oregon,[12] where he maintains a home.[13] In 1987, he co-authored his first book about programming the 80386 microprocessor.[14][1] Gelsinger was the lead architect of the 4th generation 80486 processor[1] introduced in 1989.[9] At age 32, he was named the youngest vice president in Intel's history.[7] Mentored by Intel CEO Andrew Grove, Gelsinger became the company's CTO in 2001, leading key technology developments, including Wi-Fi, USB, Intel Core and Intel Xeon processors, and 14 chip projects.[2][15] He launched the Intel Developer Forum conference as a counterpart to Microsoft's WinHEC.



Gelsinger is a typical hardware engineer out of his depth competing against what is effctively a software play. This is a recurring theme in the industry where you have successful hardware companies with strong hardware focused leadership fail over time because they don't get software.

I used to work at Nokia Research. The problem was on full display during the period Apple made it's entry into mobile. We had plenty of great software people throughout the company. But the leadership had grown up in a world where Nokia was basically making and selling hardware. Radio engineers and hardware engineers basically. They did not get software all that well. And of course what Apple did was executing really well on software for what was initially a nice but not particularly impressive bit of hardware. It's the software that made the difference. The hardware excellence came later. And the software only got better over time. Nokia never recovered from that. And they tried really hard to fix the software. It failed. They couldn't do it. Symbian was a train wreck and flopped hard in the market.

Intel is facing the same issue here. Their hardware is only useful if there's great software to do something with it. The whole point of hardware is running software. And Intel is not in the software business so they need others to do that for them. Similar to Nokia, Apple came along and showed the world that you don't need Intel hardware to deliver a great software experience. Now their competitor NVidia is basically stealing their thunder in the AI and 3D graphics market. Intel wants in but just like they failed to get into the mobile market (they tried, with Nokia even), their efforts to enter this market are also crippled by their software ineptness.

This is a lesson that many IOT companies struggle with as well. Great hardware but they typically struggle with their software ecosystems and unlocking the value of the hardware. So much so that one Finnish software company in this space (Wirepas), has been running an absolute genius marketing campaign with the beautiful slogan "Most IOT is shit". Check out their website. Some very nice Finnish humor on display there. Their blunt message is that most hardware focused IOT companies are hopelessly clumsy on the software front and they of course provide a solution.



apple did initially want to build on what they saw as the best fab - Intel. unlock the power Intel would bring for their phone. But they had some design objectives focused on user experience (power/cost) and Intel didn't see the value. Intel then scrambled to try and build what apple had asked for but without the software.

Nokia kept doing crazy hardware to show off on hardware side. But these old companies can't stop nickle and dimeing - so you'd get stuff w crazy drm etc. And software wasn't there or invested in fully.



My brother in Christ he was literally the CEO of VMware


Intel spent more than a decade under Otellini, Krzanich and Swan. Bean counters. Gelsinger was appointed out of desperation, but the problem runs much deeper. I doubt that culture is gone. It has already cost Intel many opportunities.


Otellini wasn't an engineer but still he made the historical x86-mac deal, pushed like crazy for x86-android and owned the top500 with xeon phi.

The downfall began with Krzanich who had no goal besides raising the stock price and no strategy other than cutting long-term projects and other costs that got in the way. What a shame.



>owned the top500 with xeon phi.

This is interesting - because what I heard (within Intel at the time, circa 2015) was Xeon Phi was a disaster. The programming model was bad and they couldn't sell them.



Otellini also made the historical decision to pass on the iPhone chip...


Krzanich started out as an engineer


"Optimize for Wall Street" is a disease to which even the seemingly-best minds can succumb.


>Intel spent more than a decade under Otellini, Krzanich and Swan. Bean counters.

It still doesn't change mistake in your original message.

>Gelsinger was appointed out of desperation, but the problem runs much deeper.

How much "much deeper"? VPs? middle level managers? engineers?

The example goes from the top, so if he can change the culture at the top, it will eventually get deeper.



> as technical and strategic

It seems that this is an AND not an OR.



>It still doesn't change mistake in your original message.

Precisely. The problem I found on HN is that It is hard to have any meaningful discussion on anything Hardware. Especially when it is mixed with business or economics models.

I was greedy and was hoping Intel could fall closer to $20 in early 2023 before I load up more of their stock. Otherwise I would put more money where my mouth is.



>Gelsinger was appointed out of desperation,

You will need to observe Intel more closely. It was not out of desperation. And Gelsinger is more technical and strategic than you implies.



If you read carefully you’ll see the comment is “as technical and strategic.”

That’s very different from “non-technical.”

He was clearly capable of leading a team to develop a new processor but that’s not the issue here.



1989 is 34 years ago.


Gelsinger is saying "the entire industry" and that seems likely to be a simple fact. Every single player, other than Nvidia, has an incentive to minimise the importance of CUDA as a proprietary technology. That is a lot more programmers than Nvidia can afford to employ.

Even if Intel falls over its own feet, the incentives to bring in more chip manufacturers are huge. It'll happen, the only question is whether the timeframe is months, years or a decade. My guess is shorter timeframes, this seems to mostly be matrix multiplication and there is suddenly a lot of money and attention on the matter. And AMD's APU play [0] is starting to reach the high end of the market with the MI300A which is an interesting development.

[0] EDIT: For anyone not following that story, they've been unifying system and GPU memory; so if I've understood this correctly there isn't any need to "copy data to the GPU" any more on those chips. Basically the CPU will now have big extensions for doing matrix math. Seems likely to catch on. Historically they've been adding that tech to low-end CPU so it isn't useful for AI work, now they're adding it to the big ones.



> That is a lot more programmers than Nvidia can afford to employ.

How many programmers one can employ is determined by profits, and Nvidia has monopoly profits thanks to CUDA, while "the entire industry" can at best hope to create some commiditized alternative to CUDA. Companies with real market power can beat entire industries of commodity manufacturers, Apple is the prime example.



AMD and Intel together have more revenue than Nvidia, even without considering any other player in the industry or any community contributions they get from being open source.


it's not about revenue, it's about investment. It is closely related to future profit. Not so much to current revenue ...


Profit is revenue minus costs. Investment is costs. If you're reinvesting everything you take in your current-year profit would be zero because you're making large investments in the future.


How many programmers do you really need though to catch up to what CUDA has already? The path has been laid. There's no need for experimentation. Just copy what NVIDIA did. No?




> Gelsinger is saying "the entire industry" and that seems likely to be a simple fact. Every single player, other than Nvidia, has an incentive to minimise the importance of CUDA as a proprietary technology. That is a lot more programmers than Nvidia can afford to employ.

I mean, this statement is technically true, but it's true for any proprietary technology. If things work like this then we won't have any industry where proprietary techs/formats are prevalent.



I suppose, but it is a practical matter here. CUDA is a library for memory management and matrix math targeted at researchers, hyper-productive devs and enthusiasts. It looks like it'll be highly capital intensive, requiring hardware that runs in some of the biggest, nastiest, OSS-friendliest data-centres in the world who all design their own silicon. The generations of AMD GPU that matter - the ones out and on people's machines - aren't supported for high quality GPGPU compute right now. Alright, that means CUDA is a massive edge right now. But that doesn't look like a defensible moat.

I was interested in being part of this AI thing, what stopped me wasn't lack of CUDA, it was that my AMD card reliably crashes under load doing compute workloads. Then when I see George Hotz having a go, the problem isn't lack of CUDA; it was that his AMD card crashed under compute workloads (technically I think it was running the demo suite). That is only anecdata, but 2 for 2 is almost a significant number of people with the small number of players and lack of big money in AI historically.

Lacking CUDA specifically might be a problem here, but I've never seen AMD fall down at that point. I've only ever see them fall down at basic driver bugs. And I don't see how CUDA would matter all that much because I can implement most of what I need math-wise in code. If I see a specific list of common complaints maybe I'll change my mind, but I'm just not detecting where the huge complexity is. I can see CUDA maintaining an edge for years because it is convenient, but I really don't see how it can stay essential. The card can already do the workload in theory and in practice assuming the code path doesn't bug out. I really don't need CUDA, all I want rocBLAS to not crash. I suspect that'd go a long way in practice.



AMD could use testers(cough clients i mean) like you. Jokes aside, please report bugs to rocm github..


Unless their hardware is on the official support list, I wouldn't be too hopeful for a quick resolution. Still, it's even less likely to get fixed if it's not reported.

If nothing else, I would be curious to know more about the issue. Personally, I want to know how well ROCm functions on every AMD GPU.



I'm not an expert here, but with:

> That is a lot more programmers than Nvidia can afford to employ

How do you account for the increased complexity those developers have to deal with in an environment where there are multiple companies with conflicting incentives working on the standard?

My gut reaction is to worry if this is one of those problems like "9 people working together can't have a baby in one month".



I actually find that a really interesting question with a really interesting answer - the scaling properties of large groups of people are unintuitive. In this case, my guess would be high market complexity, and the entire userbase to ignore that complexity in favour of 1-2 vendors with simple and cheap options. So the market overall will just settle on de-facto standards.

Of course, based on what we see right now that standard would be Nvidia's CUDA; but while CUDA is impressive I don't think running neural nets requires that level of complexity. We're not talking about GUIs which are one of the stickiest and most complicated blocks of software we know about, or complex platform-specific operations. I'd expect that the need for specialist libraries to do inference to go away in time and CUDA to be mainly useful for researching GPU applications to new problems. Training will likely just come down to raw ops/second in hardware rather than software.

It isn't like this stuff can't already run on other cards. AMD cards can run stable diffusion or LLMs. The issue is just that AMD drivers tend to crash. That is simultaneously a huge and a tiny problem - if they focus on it it won't be around for long. CUDA is an advantage, but not a moat.



I find the hero-worship of Pat Gelsinger — examples in sibling comments — really weird. My impression of him at VMware was very beancounter-y, not especially technical, and too caught up in personal vendettas and status games to make good technical leadership decisions.

Granted, I may have just gotten off on the wrong foot. The first thing he said to Pivotal during the acquisition announcement was, “You were our cousins, but you’re now more like children.” So the whole tone was just weird.



This was true for Intel for at least 10 years and I’m pretty sure for much longer than that. It was probably true for nvidia for about as long as they exist.

Hardware without software is just expensive sand. Every semiconductor company knows this. Intel was the one to perfect the whole package with x86 in the first place…

In the GPU compute space CUDA is x86. It’s ubiquitous, de facto standard and will be disrupted. Question is if it takes a year or a decade.



The stereotype is hardware engineers all think software is easy. So while semiconductor firms know software is important, they're often optimistic about the ease of creating it.

Cuda is enormous, very complicated and fits together relatively well. All the semiconductor startups have a business plan about being transparent drop in replacements for cuda systems, built by some tens of software engineers in a year or two. That only really makes sense if you've totally misjudged the problem difficulty.



You don't know Pat Gelsinger do you?


We're in 2023, he's been in the CEO seat for 2 years already. He's had plenty of time to show the world his intent and where they are going. All that has happened is they launched a very mid GPU and have yielded more ground to AMD. Meanwhile AMD continue to eat away at Intel's talent pool, market share, and still managed to push into the AI space.

He should be sweating.



> for 2 years already ... All that has happened is they launched a very mid GPU

Hardware development cycles are closer to 5 years. So while he might have gotten some adjustments done on the designs so far, if he turned the ship around it'll take a while longer to materialize.

The software side is more agile, so any tea leave reading to discern what Gelsinger's strategy looks like is best done over there.



Not only that, for a “first” (not sure how much of Larrabee was salvaged) discrete GPU attempt, Intel Arc is fantastic. Look at the first GPUs Nvidia and ATI launched.

It’s only when you put them up against Nvidia and AMDs comes-with-decades-of-experience offerings that Intel’s GPUs seem less than stellar.



Yeah Arc is incredible in how much it accomplished as a first attempt and as long as they keep at it without chopping it up into a bunch of artificially limited market segments then it'll probably be incredibly competitive in a few generations.


His intent is “5 nodes in 4 years” - [0]. The goal is to reclaim the node leadership from TSMC by 2025.

They announced the first chips based on Intel 4 today, which is more or less equivalent to TSMC’s 5nm.

They may fail, but the goal is clear and ambitious.

[0] - https://www.xda-developers.com/intel-roadmap-2025-explainer/



>All that has happened is they launched a very mid GPU and have yielded more ground to AMD.

And you dont blame that to Raja Koduri but to Gelsinger?



or Lisa Su


Ultimately it comes down to Intel and AMD being penny wise, pound foolish. They're unwilling to hire a lot of quality software engineers, because they're expensive, so they just continue to fall behind NVidia.


Nvidia GPU moat has always been their software. Game ready drivers are a big deal for each AAA game launch and they always help to push their fps numbers on reviewers charts. I feel like for 20 years I've been reading people online complain about ATI/AMD drivers and how they want to go back to an Nvidia card the next chance they get.


This hasn't been true for more than a decade at this point, and in fact AMD tends to have the better driver support, especially long term.


Well, you said a decade so I'll take an easy one from 4 years ago.

"Yes, you know all those r/amd and r/nvidia posts about people ditching their RX 5700 XT and switching over to an Nvidia RTX 2070 Super… AMD is reading them and has obviously been jamming them down the throats of its software engineers until they could release a driver patch which addresses the issues Navi users have been experiencing."

https://www.pcgamesn.com/amd/radeon-rx-5700-drivers-black-sc...



People used the same argument when saying AMD would never beat Intel in CPUs. Intel has a lot of software engineers. Also these days AMD has a good number of software folks, thanks to the Xilinx acquisition and the organic investments in this area.


Intel has over 15,000 software engineers, per their website. I couldn't find a number for NVIDIA, but it looks like they have a bit above 26k total employees.

So, its very likely Intel has more software engineers than NVIDIA. Intel has far more products than NVIDIA though, so NVIDIA almost certainly has more software engineers working on GPU.



I would say that over a certain number of devs the output decreases.

I think it was 500 people working on Windows XP? A hundred for Windows 95. Etc.



Intel's/NVidia's/AMD's GPU drivers alone are probably more LOC at this point than the whole of XP...


Are you saying they able to develop without a PM every 5 engineers? Insane.


Yes and, for different reasons:

> ...as opposed to the bean-counters

Intel whining about the CHIPS Act, while doing huge layoffs, while continuing massive stock buybacks, while continuing to pay dividends...

I'm not impressed.

I'd expect a corporation facing an existential crisis would make some tough decisions and accept the consequences. I know that Wall St (investors) is the tail wagging the dog. But I expect a leader to hold off the vultures (ghouls) long enough to do the necessary pivots.

At least Intel's doom spiral isn't as grimm as the condundrum facing the automobile manufacturers. They need to prop up their declining business units, pay off their loan sharks (investors), somehow resolve their death embrace with the unions, transform culture (bust up their organizational siloes), transform relationships with their suppliers, AND somehow get capital (cheap enough) to fund their pivots.

I'm sure Intel has the technical chops to do great things. But financially their half-measures strategy doesn't instill confidence.

Source: Just a noob reading the headlines. Am probably totally off base.



Yeah, they should get someone who actually architected a successful chip, like the 486. Maybe a boomerang who used to be CTO. Get rid of this beancounter and hire someone like that!!

/S



If they create a better tool chain, ecosystem, and programming experience than CUDA and compatible with all computational platforms at their peak performance - awesome! Everyone wins!

Until then, it's a bit funny claim, especially considering what a failure OpenCL was (programmer's experience and fading support). Or trying to do GPGPU with compute shaders in DX/GL/Vulkan. Are they really "motivated"? Because they had so many years and the results are miserable... And I don't think they invested even a fraction of what got invested into CUDA. Put your money where your mouth is.



I wish AMD or Intel would just ship a giant honking CPU with 1000s of cores that doesn't need any special purpose programming languages to utilize. Screw co-processors. Screw trying to make yet another fucked up special purpose language -- whether that's C/C++-with-quirks or a half-assed Python clone or whatever. Nuts to that. Just ship more cores and let me use real threads in regular programming languages.


It doesn't work if you're going against GPUs. All the nice goodies we are accustomed to on large desktop x86 machines with gigantic caches and huge branch predictor area and OOO execution engines -- the features that yield the performance profile we expect -- simply do not translate or scale up to thousands of cores per die. To scale that up, you need to redesign the microarchitecture in a fundamental way to allow more compute-per-mm^2 of area, but at that point none of the original software will work in any meaningful capacity because the pipeline is so radically different, it might as well be a different architecture entirely. That means you might as well just write an entirely different software stack, too, and if you're rewriting the software, well, a different ISA is actually the easy part. And no, shoving sockets on the mobo does not change this; it doesn't matter if it's a single die or multi socket. The same dynamics apply.


While the first >1000 core x86 processor is probably a little ways out, Intel is releasing a 288-core x86 processor in the first half of 2024 (Sierra Forest). I assume AMD will have something similarly high core in 2024-25 as well.


To be clear, you can probably make a 1000 core x86 machine, and those 1000 cores can probably even be pretty powerful. I don't doubt that. I think Azure even has crazy 8-socket multi-sled systems doing hundreds of cores, today. But this thread is about CUDA. Sierra Forest will get absolutely obliterated by a single A100 in basically any workload where you could reasonably choose between the two as options. I'm not saying they can't exist. Just that they will be (very) bad in this specific competition. I made an edit to my comment to reflect that.

But what you mention is important, and also a reason for the ultimate demise of e.g. Xeon Phi. Intel surely realized they could just scale their existing Xeon designs up-and-out further than expected. Like from a product/SKU standpoint, what is the point of having a 300 core Phi where every core is slow as shit, when you have a 100 core 4-socket Xeon design on the horizon, using an existing battle-tested design that you ship billions of dollars worth every year? Especially when the 300 core Xeon fails completely against the competition. By the time Phi died, they were already doing 100-cores-per-socket systems. They essentially realized any market they could have had would be served better by the existing Xeon line and by playing to their existing strengths.



> Intel is releasing a 288-core x86

This made me wonder a couple of things-

What kind of workloads and problems is that best suited for? It’s a lot of cores for a CPU, but for pure math/compute, like with AI training and inference and with graphics, 288 cores is like ~1.5% of the number of threads of a modern GPU, right? Doesn’t it take particular kinds of problems to make a 288 core CPU attractive?

I also wondered if the ratio of the highest core count CPU to GPU has been relatively flat for a while? Which way is it trending- which of CPUs or GPUs are getting more cores faster?



You could do sparse deep learning with much, much larger models with these CPUs. As paradoxical as it might sound, sparse deep learning gets more compute bound as you add more cores.


I'd be curious to learn more about how it's compute bound and what specifically is compute bound. On modern H100s you need ~600 fp8 operations per byte loaded from memory in order to be compute bound, and that's with full 128-byte loads each time. Even integer/fp32 vector operations need quite a few operations to be compute bound (~20 for vector fp32).


I think you misunderstood what I mean. Sparse ML is inherently memory latency bound since you have a completely unpredictable access pattern prone to cache misses. The amount of compute you perform is a tiny blip compared to the hash map operations you perform. What I mean is that as you add more cores, there are sharing effects because multiple cores are accessing the same memory location at the same time. The compute bound sections of your code become a much greater percentage of the overall runtime as you add cores, which is surprising, since adding more compute is the easy part. Pay attention to my words "_more_ compute bound".

Here is a relevant article: https://www.kdnuggets.com/2020/03/deep-learning-breakthrough...



288 Cores or Threads? Cuz to my knowledge AMD already has a 128 Core, 256 Thread Processor with the Epyc 9754


Apple might be sort-of trying to build the honking CPU, but it still requires different language extensions and a mix of different programming models.

And what you suggest could be done, but it would likely flop commercially if you made it today, which is why they aren’t doing it. SIMD machines are faster on homogenous workloads, by a lot. It would be a bummer to develop a CPU with thousands of cores that is still tens or hundreds of times slower than a comparably priced GPU.

SIMD isn’t going away anytime soon, or maybe ever. When the workload is embarrassingly parallel, it’s cheaper and more efficient to use SIMD over general purpose cores. Specialized chiplets and co-processors are on the rise too, co-inciding with the wane of Moore’s law; specialization is often the lowest hanging fruit for improving efficiency now.

There’s going to be plenty of demand for general programmers but maybe worth keeping in mind the kinds of opportunities that are opening up for people who can learn and develop special purpose hardware and software.



Well, that is what a GPU is. Cuda / openmp etc are attempts at conveniently programming a mixed cpu/gpu system.

If you don't want that, program the GPU directly in assembly or C++ or whatever. A kernel is a thread - program counter, register file, independent execution from the other threads.

There isn't a Linux kernel equivalent sitting between you and the hardware so it's very like bare metal x64 programming, but you could put a kernel abstraction on it if you wanted.

Core isn't very well defined, but if we go with "number of independent program counters live at the same time" it's a few thousand.

X64 cores are vaguely equivalent to GCN compute units, 100 or so if either in a 300W envelope. X64 has two threads and a load of branch prediction / speculation hardware. GCN has 80 threads and swaps between them each cycle. Same sort of idea, different allocation of silicon.



The closest Intel got to this was Xeon Phi / Knights Landing https://en.wikipedia.org/wiki/Xeon_Phi with 60+ cores per chip, each able to run 4 threads simultaneously - each of which could run arbitrary x86 code. Discontinued due to low demand in 2020 though.

In practice, people weren’t afraid to roll up their sleeves and write CUDA code. If you wanted good performance you had to think about data parallelism anyways, and at that point you’re not benefiting from x86 backwards compatibility. It was a fascinating dream while it lasted though.



It was called Larrabee and XeonPhi, they botched it, and the only thing left from that effort is AVX.


I used to play with these toys 7-8 years ago. We tried everything, and it was bad at it all.

Traditional compute? The cores were too weak.

Number crunching? Okay-ish but gpus were better.

Useless stuff.



Hence why " they botched it".


They seemed exceedingly hard to use well but interestingly capable & full of promise. And they were made in a much more primitive software age.

I'd love to hear about what didn't work. OpenMP support seemed ok maybe but OpenMP is just a platform, figuring out software architectures that's mechanistically sympathetic to the system is hard. It would be so interesting to see what Xeon Phi might have been if we had Calcite or Velox or OpenXLA or other execution engine/optimizers that can orchestrate usage. The possibility of something like Phi seems so much higher now.

There's such a consensus around Phi tanking, and yes, some people came and tried and failed. But most of those lessons, of why it wasn't working (or was!) never survived the era, never were turned into stories & research that illuminates what Phi really was. My feeling is that most people were staying the course on GPU stuff, and that there weren't that many people trying Phi. I'd like more than the heresay heaped at Phi's feed to judge by.



Well... Back then in my shop they would just assign programmers to things, together with a couple of mathematicians.

Math guys came up with a list of algorithms to try for a search engine backend.

What we needed was matrix multiplication and maybe some decision tree walking (that was some time ago, trees were still big back then, NNs were seen as too compute-intensive for no clear benefits). So we thought that it might be cool to have a tool that would support both. Phi sounded just right for both.

And things written to AVX-512 did work. Software surpisingly easy to port.

But then comes the usual SIMD/CPU trouble: every SIMD generation wants a little software rewrite. So for both Phi generations we had to update our code. For things not compatible with the SIMD approach (think tree-walking) it is just a weak x86.

In theory Phi's were universal, in practice what we got was: okay number crunching, bad generic compute.

GPU was somewhat similar: the software stack was unstable, CUDA just did not materialize as a standard yet. But every generation introduced a massive increase in compute available. And boy did NVIDIA move fast...

So GPU situation was: amazing number crunching, no generic compute.

And then there were a few ML breakthroughs results which rendered everything that did not look like a matrix multiplication obsolete.

PS I wouldn't take this story too seriously, details may vary.



Some observations:

- Very bad performance at existing x86 workloads, so a major selling point was basically not there in practice, because extracting any meaningful performance required a software rewrite anyway. This was an important adoption criteria; if they outright said "All your existing workloads are compatible, but will perform like complete dogshit", why would anyone bother? Compatibility was a big selling point that ended up meaning little in practice, unfortunately.

- Not actually what x86 users wanted. This was at the height of "Intel stagnation" and while I think they were experimenting with lots of stuff, well, in this case, they were serving a market that didn't really want what they had (or at least wasn't convinced they wanted it).

- GPU creators weren't sitting idle and twiddling their thumbs. Nvidia was continuously improving performance and programmability of their GPUs across all segments (gaming, HPC, datacenters, scientific workloads) while this was all happening. They improved their compilers, programming models, and microarchitecture. They did not sit by on any of these fronts.

Ironically the main living legacy of Phi is AVX-512, which people did and still do want. But that kind of gives it all away, doesn't it? People didn't want a new massively multicore microarchitecture. They wanted new vector instructions that were flexible and easier to program than what they had -- and AVX-512 is really much better. They wanted the things they were already doing to get better, not things that were like, effectively a different market.

Anyway, the most important point is probably the last one, honestly. Like we could talk a lot about compiler optimizations or autovectorization. But really, the market that Phi was trying to occupy just wasn't actually that big, and in the end, GPUs got better at things they were bad at, quicker than Phi got better at things it was bad at. It's not dissimilar to Optane. Technically interesting, and I mourn its death, but the competition simply improved faster than the adoption rate of the new thing, and so flash is what we have.

Once you factor in that you have to rewrite software to get meaningful performance uplift, the rest sort of falls into place. Keep in mind that if you have a $10,000 chip and you can only extract 50% of the performance, you more or less have just $5,000 on fire for nothing in return. You might as well go all the way and use a GPU because at least then you're getting more ops/mm^2 of silicon.



I don't disagree anywhere but I don't think any of these statements actually condemn Xeon Phi outright. It didn't work at the time, and doing it with so little software support to tile out workloads well was a big & possibly bad gambit, but I'm so unsure we can condemn the architecture. There seems to be so few folks who made good attempts and succeeded or failed & wrote about it.

I tend to think there was tons of untapped potential still on the table. And that a failure to adopt potential isn't purely Intel alone's fault. The story we are commenting on is about the rest-of-industry trying to figure out enduring joint strategies, and much of this is chipmaker provided, but it is also informed and helped by plenty of consumers also pouring energy in to figure out what's working and not, trying to push the bounds.

Agreed that anyone going in thinking Xeon Phi would be viable for running a boring everyday x86 workload was going to be sad. To me the promise seemed clear that existing toolchains & code would work, but it was always clear to me there were a bunch of little punycores & massive SIMD units and that doing anything not SIMD intensive wasn't going to go well at all. But what's the current trend? Intel and AMD are both actively building not punycores but smaller cores, with Sierra Forest and Bergamo. E-cores are the grown up Atom we saw here.

Yes the GPGPU folks were winning. They had a huge head start, were the default option. And Intel was having trouble delivering nodes. So yes, Xeon Phi was getting trounced for real reasons. But they weren't architectural issues! It just means the Xeon Phi premise was becoming increasingly handicapped.

As I said I broadly agree everywhere. Your core point about giving the market more of what it already does is well taken, is a river of wisdom we see again and again. But I do think conservative thinking, iterating along, is dangerous thinking that obstructs us from seeing real value & possibility before us. Maybe Intel could have made a better ML chip than the GPGPU market has gotten for years, had things gone differently; I think the industry could perhaps have been glad they had veered onto a new course, but the barriers to that happening & the slow down in Intel delivery & the difficulty bootstrapping new software were all horrible encumberances which were rightly more than was worth bearing together.



I don't thing anybody seriously considered Phi's for generic compute or something.

Most experimenters saw it as a way to have something GPU-like in terms of raw power but with no limitations charateristic of SIMT's. Like, slightly different code paths for threads doing number crunching or something.

But it turns out that it's easier to force everything into a matrix. Or a very big matrix. Or a very-very-very big matrix.

And then see what sticks.



Why are we not also talking about memory bandwidth? Personal opinion: this is the key. The latest Phi had about 100 GB/s in 2017. The contemporary Nvidia GTX 1080: 320 GB/s.

When CPUs actually come with bandwidth and a decent vector unit, such as the A64FX, lo and behold, they lead the Top500 supercomputer list, also beating out GPUs of the day.

Why have we not been getting bandwidth in CPUs? Is it because SPECint benchmarks do not use much? Or because there is too much branch-heavy code, so we think hundreds of cores are helpful?

Existing machines are ridiculously imbalanced, hundreds of times more compute vs bandwidth than the 1:1 still seen in the 90s. Hence matmul as a way of using/wasting the extra compute.

The AMD MI300a looks like a very interesting development: >5 TB/s shared by 24 cores plus GPUs.



AVX might be going the right direction, even if the AVX512 was stretch too far. I was impressed by llama.cpp performance boost when AVX1 support was added.

There's no intrinsic reason why multiplying matrices requires massive parallelism, in principle it could be done on few cores plus good management of ALUs/memory bandwidth/caches.



What's wrong with compute shaders ?


I shipped a dozen products with them (mostly video games), so there's nothing "wrong" that would make them unusable. But programming them and setting up the graphics pipe (and all the passes, structured buffers, compiling, binding, weird errors, and synchronization) is a huge PITA as compared to the convenience of CUDA. Compilers are way less mature, especially on some platforms cough. Some GPU capabilities are not exposed. No real composability or libraries. No proper debugging.


These days, some game engines have done pretty well at making compute shaders easy to use (such as Bevy [1] -- disclaimer, I contribute to that engine). But telling the scientific/financial/etc. community that they need to run their code inside a game engine to get a decent experience is a hard sell. It's not a great situation compared to how easy it is on NVIDIA's stack.

[1]: https://github.com/bevyengine/bevy/blob/main/examples/shader...



I have recently published an AI-related open-source project entirely based on compute shaders https://github.com/Const-me/Cgml and I’m super happy with the workflow. Possible to implement very complicated things without compiling a single line of C++, the software is mostly in C#.

> setting up the graphics pipe

I’ve picked D3D11, as opposed to D3D12 or Vulkan. The 11 is significantly higher level, and much easier to use.

> compiling, binding

The compiler is design-time, I ship them compiled, and integrated into the IDE. I solved the bindings with a simple code generation tool, which parses HLSL and generates C#.

> No proper debugging

I partially agree but still, we have renderdoc.



I understand why you've picked D3D11, but people have to understand that comes with serious limitations. There are no subgroups, which also means no cooperative matrix multiplication ("tensor cores"). For throughput in machine learning inference in particular, there's no way D3D11 can compete with either CUDA or a more modern compute shader stack, such as one based on Vulkan 1.3.


> no subgroups

Indeed, in D3D they are called “wave intrinsics” and require D3D12. But that’s IMO a reasonable price to pay for hardware compatibility.

> no cooperative matrix multiplication

Matrix multiplication compute shader which uses group shared memory for cooperative loads: https://github.com/Const-me/Cgml/blob/master/Mistral/Mistral...

> tensor cores

When running inference on end-user computers, for many practical applications users don’t care about throughput. They only have a single audio stream / chat / picture being generated, their batch size is a small number often just 1, and they mostly care about latency, not throughput. Under these conditions inference is guaranteed to bottleneck on memory bandwidth, as opposed to compute. For use cases like that, tensor cores are useless.

> there's no way D3D11 can compete with either CUDA

My D3D11 port of Whisper outperformed original CUDA-based implementation running on the same GPU: https://github.com/Const-me/Whisper/



Sure. It's a tradeoff space. Gain portability and ergonomics, lose throughput. For applications that are throttled by TOPS at low precisions (ie most ML inferencing) then the performance drop from not being able to use tensor cores is going to be unacceptable. Glad you found something that works for you, but it certainly doesn't spell the end of CUDA.


> ie most ML inferencing

Most ML inferencing is throttled with memory, not compute. This certainly applies to both Whisper and Mistral models.

> it certainly doesn't spell the end of CUDA

No, because traditional HPC. Some people in the industry spent many man-years developing very complicated compute kernels, which are very expensive to port.

AI is another story. Not too hard to port from CUDA to compute shaders, because the GPU-running code is rather simple.

Moreover, it can help with performance just by removing abstraction layers. I think the reason why compute shaders-based Whisper outperformed CUDA-based version on the same GPU, these implementations do slightly different things. Unlike Python and Torch, compute shaders actually program GPUs as opposed to calling libraries with tons of abstractions layers inside them. This saves memory bandwidth storing and then loading temporary tensors.



This. It's crazy how primitive the GPU development process still is in the year 2023. Yeah it's gotten better, but there's still a massive gap with traditional development.


It's kinda like building Legos vs building actual Skyscrapers. The gap between compute shaders and CUDA is massive. At least it feels massive because CUDA has some key features that compute shaders lack, and which make it so much easier to build complex, powerful and fast applications.

One of the features that would get compute shaders far ahead compared to now would be pointers and pointer casting - Just let me have a byte buffer and easily cast the bytes to whatever I want. Another would be function pointers. These two are pretty much the main reason I had to stop doing a project in OpenGL/Vulkan, and start using CUDA. There are so many more, however, that make life easier like cooperative groups with device-wide sync, being able to allocate a single buffer with all the GPU memory, recursion, etc.

Khronos should start supporting C++20 for shaders (basically what CUDA is) and stop the glsl or spirv nonsense.



You might argue for forking off from glsl and SPIR-V for complex compute workloads, but lightweight, fast compilers for a simple language like glsl do solve issues for graphics. Some graphics use cases don't get around shipping a shader compiler to the user. The number of possible shader configurations is often either insanely large or just impossible to enumerate, so on the fly compilation is really the only thing you can do.


Ironically, most people use HLSL with Vulkan, because Khronos doesn't have a budget nor the people to improve GLSL.

So yet another thing where Khronos APIs are dependent on DirectX evolution.

It used to be that AMD and NVidia would first implement new stuff on DirectX in collaboration with Microsoft, have them as extensions in OpenGL, and eventually as standard features.

Now even the shading language is part of it.



For GPGPU tasks, they lack a lot of useful features that CUDA has like the ability to allocate memory and launch kernels from the GPU. They also generally require you to write your GPU and CPU portions of an algorithm in different languages, while CUDA allows you to intermix your code and share data structures and simple functions between the two.


CUDA = C++ on GPUs. Compute shader - subset of C with a weird quirks.


There are existing efforts to compile SYCL to Vulkan compute shaders. Plenty of "weird quirks" involved since they're based on different underlying varieties of SPIR-V ("kernels" vs. "shaders") and seem to have evolved independently in other ways (Vulkan does not have the amount of support for numerical computation that OpenCL/SYCL has) - but nothing too terrible or anything that couldn't be addressed by future Vulkan extensions.


A subset that lacks pointers, which makes compute shaders a toy language next to CUDA.


Vulkan 1.3 has pointers, thanks to buffer device address[1]. It took a while to get there, and earlier pointer support was flawed. I also don't know of any major applications that use this.

Modern Vulkan is looking pretty good now. Cooperative matrix multiplication has also landed (as a widely supported extension), and I think it's fair to say it's gone past OpenCL.

Whether we get significant adoption of all this I think is too early to say, but I think it's a plausible foundation for real stuff. It's no longer just a toy.

[1] https://community.arm.com/arm-community-blogs/b/graphics-gam...



> Vulkan 1.3 has pointers, thanks to buffer device address[1].

> [1] https://community.arm.com/arm-community-blogs/b/graphics-gam...

"Using a pointer in a shader - In Vulkan GLSL, there is the GL_EXT_buffer_reference extension "

That extension is utter garbage. I tried it. It was the last thing I tried before giving up on GLSL/Vulkan and switching to CUDA. It was the nail in the coffin that made me go "okay, if that's the best Vulkan can do, then I need to switch to CUDA". It's incredibly cumbersome, confusing and verbose.

What's needed are regular, simple, C-like pointers.



Is IREE the main runtime doing Vulkan or are there others? Who should we be listening to (oh wise @raphlinus)?

It's been awesome seeing folks like Keras 3.0 kicking out broad Intercompatibility across JAX, TF, Pytorch, powered by flexible executuon engines. Looking forward to seeing more Vulkan based runs getting socialized benchmarked & compared. https://news.ycombinator.com/item?id=38446353



The two I know of are IREE and Kompute[1]. I'm not sure how much momentum the latter has, I don't see it referenced much. There's also a growing body of work that uses Vulkan indirectly through WebGPU. This is currently lagging in performance due to lack of subgroups and cooperative matrix mult, but I see that gap closing. There I think wonnx[2] has the most momentum, but I am aware of other efforts.

[1]: https://kompute.cc/

[2]: https://github.com/webonnx/wonnx



How feasible would it be to target Vulkan 1.3 or such from standard SYCL (as first seen in Sylkan, for earlier Vulkan Compute)? Is it still lacking the numerical properties for some math functions that OpenCL and SYCL seem to expect?


That's a really good question. I don't know enough about SYCL to be able to tell you the answer, but I've heard rumblings that it may be the thing to watch. I think there may be some other limitations, for example SYCL 2020 depends on unified shared memory, and that is definitely not something you can depend on in compute shader land (in some cases you can get some of it, for example with resizable BAR, but it depends).

In researching this answer, I came across a really interesting thread[1] on diagnosing performance problems with USM in SYCL (running on AMD HIP in this case). It's a good tour of why this is hard, and why for the vast majority of users it's far better to just use CUDA and not have to deal with any of this bullshit - things pretty much just work.

When targeting compute shaders, you pretty much have to manage buffers manually, and also do copying between host and device memory explicitly (when needed - on hardware such as Apple Silicon, you prefer to not copy). I personally don't have a problem with this, as I like things being explicit, but it is definitely one of the ergonomic advantages of modern CUDA, and one of the reasons why fully automated conversion to other runtimes is not going to work well.

[1]: https://stackoverflow.com/questions/76700305/4000-performanc...



https://enccs.github.io/sycl-workshop/unified-shared-memory/ seems to suggest that USM is still a hardware-specific feature in SYCL 2020, so compatibility with hardware that requires a buffer copying approach is still maintained. Is this incorrect?


Good call. So this doesn't look like a blocker to SYCL compatibility. I'm interested in learning more about this.


Unified shared memory is an intel specific extension of OpenCL.

SYCL builds on top of OpenCL so you need to know the history of OpenCL. OpenCL 2.0 introduced shared virtual memory, which is basically the most insane way of doing it. Even with coarse grained shared virtual memory, memory pages can transparently migrate from host to device on access. This is difficult to implement in hardware. The only good implementations were on iGPUs simply because the memory is already shared. No vendor, not even AMD could implement this demanding feature. You would need full cache coherence from the processor to the GPU, something that is only possible with something like CXL and that one isn't ready even to this day.

So OpenCL 2.x was basically dead. It has unimplementable mandatory features so nobody wrote software for OpenCL 2.x.

Khronos then decided to make OpenCL 3.0, which gets rid of all these difficult to implement features so vendors can finally move on.

So, Intel is building their Arc GPUs and they decided to create a variant of shared virtual memory that is actually implementable called unified shared memory.

The idea is the following: All USM buffers are accessible by CPU and GPU, but the location is defined by the developer. Host memory stays on the host and the GPU must access it over PCIe. Device memory stays on the GPU and the host must access it over PCIe. These types of memory already cover the vast majority of use cases and can be implemented by anyone. Then finally, there is "shared" memory, which can migrate between CPU and GPU in a coarse grained matter. This isn't page level. The entire buffer gets moved as far as I am aware. This allows you to do CPU work then GPU work and then CPU work. What doesn't exist is a fully cache coherent form of shared memory.

https://registry.khronos.org/OpenCL/extensions/intel/cl_inte...



Compute shaders are not capable of using modern GPU features like tensor cores or many of the other features needed to feed tensor cores data fast enough (e.g. TMA/cp.async.shared)


Can anybody with a deep knowledge of the AI space, explain to me what's the real moat of CUDA ?

It's clear to everybody that it's not the hardware but the software - which is the CUDA ecosystem.

I've played a bit in the past with ML, but at the level of understanding I had - training some models, tweaking things, I was using higher level libraries and as far as I know, it's pretty much an if statement in those libraries to decide which backend use.

So let's suppose Intel and others does manage to implement a viable competitor - am I wrong in thinking that the transitions for many users would be seamless ? That's probably not the case for researchers and people pushing the boundaries, but for most companies, my understanding is there would be not a lot of migration costs involved ?



Your understanding is correct, but the predicates are not easy at all. The amount of work going into CUDA is enormous, and NVIDIA is not standing still waiting for their competitors to catch up.


You need performance in the high level libraries to match, on a flops/$ basis. That’s “it”. That’s easier said than done, though. Even google’s TPUs still struggle to match H100s at flops/$, and they’re really annoying to use unless you’re using Jax.


I see the situation as a lot like the original IBM PC wars. Originally you had the IBM PC and a bunch of "compatibles" that weren't drop-in compatible but half-assed compatible - many programs needed to be re-compiled to run on them. And other large American companies made these - they didn't expect to commodify the PC, they just wanted a small piece of a big market.

The actual PC clones, pure drop-in compatibles, were made in Taiwan and they took over the market. Which is to say that large companies don't want a commodified market where prices are low and everyone competes on a level playing field - which is what "seamless transition" gets you. So that's why none of these companies are working to create that.



I would summarize it as follows: Nvidia has taken the bottom-up approach. From (parallel processing) hardware to their development environment developed upon it. The competition (Intel) appears to be attempting to break into the market using a top-down approach. Hoping to get some share of the inference market (using their sequential processing hardware). Basically leveraging on the innovations happening at Nvidia. CUDA will always remain one step ahead.


It is incorrect to say that the moat is the software. The moat is primarily the compute hardware, which is still incredibly good for the price, plus really good networking equipment. CUDA is not a significant moat for massive LLM training runs, as you can see from Anthropic moving from CUDA to Trainium (and so presumably rewriting all of their kernels to Trainium).


Intel and AMD have had years to provide similar capabilities on top of OpenCL.

Maybe they should look into their own failures first.



SYCL is a better analog to Cuda than OpenCL, and Intel have their own implementation of that. Don't really see anyone writing anything in SYCL though, and when I looked into trying it out it was a bit of a mess with different implementations, each supporting their own subset of OSs and hardware.

https://www.intel.com/content/www/us/en/developer/tools/onea...



> Don't really see anyone writing anything in SYCL though

My work involves writing software that runs on many GPU platforms at once. So far we have been going through the Kokkos route, but SYCL is looking pretty good to me recent days. There is some consolidation happening in this space (Codeplay gave up working on their own implementation and merged with Intel). It was pretty easy to setup on my Linux machine for Nvidia card. Documentation is very good and professional, unlike AMD's, which can be frankly horrible at times. And Intel has a good track record with software.

I genuinely believe if someone is going to dethrone CUDA, at this point SYCL (oneAPI) is a far more likely candidate than Rocm/HIP.



I am unfamiliar with the implementation, but why would it be difficult to implement a Cuda-compatible software layer on top of other platforms?

This would be the first step. Then, if we want to move away from Cuda into hardware that's as ubiquitous and performant as Nvidia's (or better), someone would need to write an abstraction layer that's more convenient to use than Cuda. I did play a little bit with Cuda and OpenCL, but not enough to hate either.



Because that is an herculean task, given the hardware semantics, the amount of languages that target PTX, the graphical debugging tools that expose every little detail of the cards like debugging on the CPU, and the libraries ecosystem.

Any Cuda-compatible software layer only has two options, be a second class CUDA implementation by being compatible with a subset like AMD ROCm and HIP effforts, or be compatible with everything always playing catchup.

The only way is to use middleware that just like in 3D APIs, abstract the actual compute API being used, as man language bindings are doing nowadays.



Worth noting that cuda seems prone to using inline ptx assembly and that latter is really obnoxious to deal with on other platforms.

Implementing programming languages and runtimes is pretty difficult in general. Note that cuda doesn't have the same semantics as c++ despite looking kind of similar. Wherever you differ from expected behaviour people consider it a bug, and implementing based on cuda's docs wouldn't get you the behaviour people expect.

Pretty horrendous task overall. It would be much better for people to stop developing programs that only run on a gnarly proprietary language.



CUDA was redesigned to follow C++'s memory model introduced in C++11, yet another compatibility pain point with other hardware vendors.


SYCL is gaining traction, especially in the HPC community since it can target AMD, Nvidia and Intel hardware with one codebase. A fun fact is the GROMACS (a major application for molecular dynamics, and big consumer of HPC time) recommends SYCL for running on AMD hardware!


SYCL builds on top of OpenCL, it is basically the reboot of OpenCL C++, after OpenCL 2.0 SPIR failure.

Yet another example on how Intel and AMD failed to take up on CUDA.



SYCL isn't based on OpenCL.

SYCL (SYCL-2020 spec) supports multiple backends, including Nvidia's CUDA, AMD's HIP, OpenCL, Intel's Level-zero, and also running on the host CPU. This can either be done with Intel's DPC++ w/ Codeplay's plugins, or using AdaptiveCpp (aka. hipSYCL, aka openSYCL). OpenCL is just another backend.

It is also a very long way from OpenCL C++. The code is a single C++ file, and you don't need to write any special kernel language. The vast majority of SYCL is just C++, so -if you avoid a couple of features- you can use SYCL in library-only form without even any special compiler! This is possible for instance with AdaptiveCpp.



It's pretty much based on the same underlying featureset. Which is why trying to target Vulkan Compute from it is messy enough, whereas OpenCL is a natural target.


Now, but that isn't how it started after the OpenCL 3.0 reboot, where 3.0 == 1.0.

Also some of that work, we have to thank Codeplay for, before their acquistion from Intel.



Similar capabilities, similar performance and similar (actually better performance). They don't have any of these at this time. People don't want to buy slower and maybe cheaper computation on nvidia hardware - why would they want to do this on intel hardware? Your app will have to change, it just seems like an obvious non-starter.

I'm not an expert here, am I missing something? Saying the x86 industry is motivated to move away from what nvidia provides, intel needs to tick some of these 'better somehow' boxes.



They've both implemented opencl. Nvidia has too. The industry could have built on that common language. Instead, it built on cuda, and complains that the other hardware vendors don't have cuda.

I attribute this to opencl being the common subset a bunch of companies could agree could be implemented. I wrote some code that compiles as cuda, opencl, C++ and openmp, and the entire exercise was repeatedly "what, opencl can't do that either? damn it".



Intel tried with OneAPI years ago. Turns out they were decades behind, so catching up takes a while…


AMD definitely has and is doubling down on ROCm.


Perhaps they're doubling down, but even doubling down is not enough to say that they're serious about it since they've been so neglectful for so many years - for example, right now they explicitly say that many of AMD GPUs are not supported by ROCm; if they're not willing to put their money where their mouth is and do the legwork to ensure support for powerful cards they sold just a few years ago, how can they say you should rely on their platform?

Unless a random gamer with a random AMD GPU can go to amd.com and download pre-packaged, officially supported tools that work out of the box on their machine and after a few clicks have working GPU-accelerated pytorch (which IMHO isn't the case, but admittedly I haven't tried this year) then their "doubling down" isn't even meeting table stakes.



People argue for ROCm to support older cards because that is all they have accessible to them. AMD has lagged on getting expensive cards into the hands of end users because they've focused only on building super computers.

I predict that access to the newer cards is a more likely scenario. Right now, you can't rent a MI250 or even MI300x, but that is going to change quickly. Azure is going to have them, as well as others (I know this, cause that's what I'm building now).



The way I see it, the whole point of ROCm support is being able to service the many users who have pre-existing AMD cards and nothing else available. If someone is going to rent a GPU, I don't need to bother with adding extra features for them, because they can just rent a CUDA-capable GPU instead.

I'm considering adding ROCm support for some ML-enabled tool - no matter if it's a commercial product or an open source library - the thing I need from AMD is to ensure that the ROCm support I make will work without hassle for these end-users with random old AMD gaming cards (because these are the only users who need the tool to have ROCm support), and if ROCm upstream explicitly drops support for some cards because AMD no longer regularly test it, well, the ML tool developers aren't going to do that testing for them either; that's simply AMD intentionally refusing to do even the bare minimum (do a lot of testing for a wide variety of hardware to fix compatibility issues) that I'd expect to be table stakes for saying that "AMD is doubling down on ROCm".



> The way I see it, the whole point of ROCm support is being able to service the many users who have pre-existing AMD cards and nothing else available.

ROCm is a stack of a whole lot of stuff. I don't see a stack of software being "the whole point".

> the thing I need from AMD is to ensure that the ROCm support I make will work without hassle for these end-users with random old AMD gaming cards (because these are the only users who need the tool to have ROCm support)

From wikipedia:

"ROCm is primarily targeted at discrete professional GPUs"

They are supporting Vega onwards and are clear about the "whole point" of ROCm.



I'm talking about what would be the point for someone to add ROCm support to various pieces of software which currently require CUDA, as IMHO this is the core context not only of this thread but of the whole discussion of this article - about ROCm becoming a widely used replacement or alternative for CUDA.

> "ROCm is primarily targeted at discrete professional GPUs"

That's kind of true, and that is a big part of the problem - while AMD has this stance, ROCm won't threaten to replace or even meet CUDA, which has a much broader target; if you and/or AMD want to go in this direction, that's completely fine, that is a valuable niche - but limiting the application to that niche clearly is not "doubling down on ROCm" as a competitor for CUDA, and that disproves the TFA claim by Intel that "the entire industry is motivated to eliminate CUDA", because ROCm isn't even trying to compete with CUDA at the core niches which grant CUDA its staying power unless it goes way beyond merely targeting discrete professional GPUs.



> what would be the point for someone to add ROCm support to various pieces of software which currently require CUDA

It isn't just old cards though, CUDA is a point of centralization on a single provider during a time when access to that providers higher end cards isn't even available and that is causing people to look elsewhere.

ROCm supports CUDA through the included HIP projects...

https://github.com/ROCm/HIP

https://github.com/ROCm/HIPCC

https://github.com/ROCm/HIPIFY

The later will regex replace your CUDA methods with HIP methods. If it is as easy as running hipify on your codebase (or just coding to HIP apis), it certainly makes sense to do so.



Well yeah. Before I go renting a super GPU in the cloud, I'd like to get my feet wet with the 5 year old but reasonably well specced AMD GPU (Vega 48) in my iMac...but I can't. It's more rational for me to get an fancy 2021 GPU or a Jetson and stick it in an enclosure or build a Linux box around it. At least I know CUDA is a mature ecosystem and is going to be around for a while, so whatever time I invest in it is likely to pay for itself.

I get your point about AMD not wanting to spend money on supporting old hardware, but how do they expect to build a market without a fan base?



> I get your point about AMD not wanting to spend money on supporting old hardware, but how do they expect to build a market without a fan base?

Look, I get it. You're right. They do need to work on building their market and they really screwed the pooch on the AI boat. The developer flywheel is hugely important and they missed out on that. That said, we can't expect them to go back in time, but we can keep moving forward. Having enough people making noise about wanting to play with their hardware is certainly a step in the right direction.



> People argue for ROCm to support older cards because that is all they have accessible to them.

What they really need is to support the less expensive cards, of which the older cards are a large subset. There are a lot of people who will make contributions and fix bugs if they can actually use the thing. Some CS student at the university has to pay tuition and therefore only has an old RX570, and that isn't going to change in the next couple years, but that kind of student could fix some of the software bugs currently preventing the company from selling more expensive GPUs to large institutions. If the stack supported their hardware.



ROCm works on RX570 ("gfx803", including RX470+RX580 too).

Support was dropped upstream, but only because AMD no longer regularly test it. The code is still there and downstream distributors (like if you just apt-get install libamdhip64-5 && pip3 install torch) usually flip it enabled again.

https://salsa.debian.org/rocm-team/community/team-project/-/...

There is a scary red square in the table, but in my experience it worked completely fine for Stable Diffusion.



I ran 130,000 RX470-580 cards, so I know them quite well. Those cards aren't going to do anything useful with AI/ML. That technology is just too old and things are moving too quickly. It isn't just the card, but the mobo, disks, ram, networking...


That's fine for corporate customers, but how do you expect kids and hobbyists to learn the basics without spending thousands on an A6000 or something?


I believe strongly in "where there is a will, there is a way."

Those kids and hobbyists can't even rent time on high end AMD hardware today. I see that as one piece of the puzzle that I'm personally dedicating my time/resources to resolving.



RX570 is going to do ML faster than a typical desktop CPU. That's all you need for the person who has one to want to use it for Llama or Stable Diffusion, and then want to improve the software for the thing they're now using.

Not everything is a huge datacenter.



What is nonsense is that you think that AMD should dedicate limited resources to supporting a 6 year old card with only 4-8gb of ram (the ones I ran had 8).

I didn't say they are bad cards... they are just outdated at this point.

If you really want to put your words to action... let me know. I'll put you in touch with someone to buy 130,000 of these cards, and you can sell them to every college kid out there... until then, I wouldn't hold AMD over the coals for not wanting to put effort into something like that when they are already lagging behind on their AI efforts as it is. I'd personally rather see them catch up a bit first.



8GB is enough for Stable Diffusion or Llama 13B q4. They're outdated, but non-outdated GPUs are still expensive, so they're all many people can afford.

> I'll put you in touch with someone to buy 130,000 of these cards, and you can sell them to every college kid out there...

Just sell them on eBay? They still go for $50-$100 each, so you're sitting on several million dollars worth of GPUs.

> I'd personally rather see them catch up a bit first.

Growing the community is how you catch up. That doesn't happen if people can't afford the only GPUs you support.



> Growing the community is how you catch up.

Agreed 100%.

> That doesn't happen if people can't afford the only GPUs you support.

On this part, we are going to have to agree to disagree. I feel like being able to at least affordably rent time on the high end GPUs is another alternative to buying them. As I mentioned above, that is something I'm actively working on.



> I feel like being able to at least affordably rent time on the high end GPUs is another alternative to buying them.

There are two problems with this.

The first is high demand. GPU time on a lot of cloud providers is sold out.

The second is that this costs money at all, vs. using the GPU you already have. "Need for credit card" is a barrier to hobbyists and you want hobbyists, because they become contributors or get introduced to the technology and then go on to buy one of your more expensive GPUs.

You want the barrier to adoption to be level with the ground.



> The first is high demand. GPU time on a lot of cloud providers is sold out.

Something I'm trying to help with. =) Of course, I'm sure I'll be sold out too, or at least I hope so, cause that means buying more GPUs! But at least I'm actively putting my own time/energy toward this goal.

> The second is that this costs money at all, vs. using the GPU you already have.

As much as I'd love to believe in some utopia that there is a world where every single GPU can be used for science, I don't think we are ever going to get there. AMD, while large, isn't an infinite resource company. We're talking about a speciality level of engineering too.

> You want the barrier to adoption to be level with the ground.

100% agreed, it is a good goal, but that's a much larger problem than just AMD supporting their 6-7 year old cards.



The older cards are going to be supported by Mesa and RustiCL for the foreseeable future. Rocm is not the only game in town, far from it.


Thing is, for there to be a realistic alternative to CUDA, something needs to become "the only game in town" because people definitely won't add support for ROCm and Mesa and RustiCL and something else; getting support for one non-CUDA thing already is proving to be too difficult, so if the alternatives are fragmented, that makes the situation even worse.


RustiCL is just an OpenCL implementation. You can't have one-size-fits-all because hardware-specific stuff varies a lot across generations. (Which is also why you have newer versions of e.g. Rocm dropping support for older hardware.) The best you can do is have baseline support + extensions, which is the Vulkan approach.




Apple as well. Everyone's failure to commit in this situation enabled a highly-integrated competitor to clean up shop. It's funny how much clearer OpenCL's value prop is in hindsight...


Apple created OpenCL, and after disagreements with Khronos on how to move OpenCL forward, they stopped caring.

Apple also doesn't care about HPC, and has their own stuff on top of Metal Compute Shaders, just like Microsoft has DirectML.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact



Search:
联系我们 contact @ memedata.com