Intel CEO: 'The entire industry is motivated to eliminate the CUDA market'

buildbot · 2023-12-14T19:16:24

As another commenter said, it's CUDA. Intel and AMD and whoever can turn out chips reasonably fast, but nobody gets that it's the software and ecosystem. You have to out-compete the ecosystem. You can pick up a used Mi100 that performs almost like an A100 for 5x less money on eBay for example. Why is it 5x less? Because the software incompatibilities mean you'll spend a ton of time getting it to work compared to an Nvidia GPU.

Google is barely limping along with it's XLA interface to pytorch providing researchers a decent compatibility path. Same with Intel.

Any company in this space should basically setup a giant test suite of IDK, every model on hugging face and just start brute force fixing the issues. Then maybe they can sell some chips!

Intel is basically doing the same shit they always do here, announcing some open initiative and then doing literally the bare minimum to support it. 99% chance openvino goes nowhere. OpenAIs Triton already seems more popular, at least I've heard it referenced a lot more than openvino.

lacker · 2023-12-14T20:00:46

The funny thing to me is that so much of the "AI software ecosystem" is just PyTorch. You don't need to develop some new framework and make it popular. You don't need to support a zillion end libraries. Just literally support PyTorch.

If PyTorch worked fine on Intel GPUs, a lot of people would be happy to switch.

shihab · 2023-12-14T20:23:22

But you can't support Pytorch without a proper foundation in place. They don't need to support zillion _end_ libraries, sure, but they do need to have at least a very good set of standard libraries, equivalent of Cublas, Curand etc.

And they don't. My work recently had me working with rocRAND (Rocm's answer to Curand). It was frankly pretty bad- the design, performance (50% slower in places that don't make any sense because generating random numbers is not exact that complicated), and documentation (God it was awful).

Now, that's a small slice of the larger pie. But imagine if this trend continues for other libraries.

singhrac · 2023-12-14T21:11:55

Generating random numbers is a bit complicated! I wrote some of the samplers in Pytorch (probably replaced by now) and some of the underlying pseudo-random algorithms that work correctly in parallel are not exactly easy... running the same PRNG with the same seed on all your cores will produce the same result, which is probably NOT what you want from your API.

But, to be honest, it's not that hard either. I'm surprised their API is 2x slower, Philox is 10 years old now and I don't think there's a licensing fee?

shihab · 2023-12-15T00:20:58

> Generating random numbers is a bit complicated!

I know! I just wrote a whole paper and published a library on this!

But really, perhaps not as much as many from outside might think. The core of a Philox implementation can be around 50 lines of C++ [1], with all the bells and whistles maybe around 300-400. That implementation's performance equals CuRAND's , sometimes even surpasses it! (the API is designed to avoid maintaining any rng states on device memory, something curand forces you to do).

> running the same PRNG with the same seed on all your cores will produce the same result

You're right. Solution here is to utilize multiple generator objects, one per thread, ensuring each produces statistically independent random streams. Some good algorithms (Philox for example), allow you to use any set of unique values as seeds for your threads (e.g. thread id).

[1] https://github.com/msu-sparta/OpenRAND/blob/main/include/ope...

carterschonwald · 2023-12-15T01:34:58

Cool! I’ll have a lookseee. I’ve my own experiments in this space.

eternityforest · 2023-12-14T23:19:44

I wonder if the next generation chips are going to just have a dedicated hardware RNG per-core if that's an issue?

lelanthran · 2023-12-15T14:03:35

Why bother?

It's not the generation that matters so much, it's the gathering of entropy, which comes from peripherals and not possible to generate on-die.

If you don't need cryptographically secure randomness, you still want the entropy for generating the seeds per thread/die/chip.

eternityforest · 2023-12-15T19:00:35

It absolutely is possible to generate entropy on-die, assuming you actually want entropy and not just a unique value that gets XORed with the seed, so you can still have repeatable seeds.

Pretty much every chip has an RNG which can be as simple as just a single free running oscillator you sample

lelanthran · 2023-12-15T21:38:54

> Pretty much every chip has an RNG which can be as simple as just a single free running oscillator you sample

Every chip may have some sort of noise to sample, but they are nowhere near good sources of entropy.

Entropy is not a binary thing (you either have it or don't), it's a spectrum and entropy gathered on-die is poor entropy.

Look, I concede that my knowledge on this subject is a bit dated, but the last time I checked there were no good sources of entropy on-die for any chip in wide use. All cryptographically secure RNGs depend on a peripheral to grab noise from the environment to mix into the entropy pool.

A free-running oscillator is a very poor source of entropy.

eternityforest · 2023-12-15T22:13:25

For non-cryptographic applications, a PRNG like xorshift reseeded by a few bits from an oscillator might be enough.

As I understand it, the reason they don't use on-chip RNGs by themselves isn't due to lack of entropy, it's because people don't trust them not to put a backdoor on the chips or to have some kind of bug.

Intel has https://en.m.wikipedia.org/wiki/RDRAND but almost all chips seem to now have some kind of RNG.

paulmd · 2023-12-15T04:28:38

for GPGPU, the better approach is CBRNG like random123.

https://github.com/DEShawResearch/random123

if you accept the principles of encryption, then the bits of the output of crypt(key, message) should be totally uncorrelated to the output of crypt(key, message+1). and this requires no state other than knowing the key and the position in the sequence.

the direct-port analogy is that you have an array of CuRand generators, generator index G is equivalent to key G, and you have a fixed start offset for the particular simulation.

moreover, you can then define the key in relation to your actual data. the mental shift from what you're talking about is that in this model, a PRNG isn't something that belongs to the executing thread. every element can get its own PRNG and keystream. And if you use a contextually-meaningful value for the element key, then you already "know" the key from your existing data. And this significantly improves determinism of the simulation etc because PRNG output is tied to the simulation state, not which thread it happens to be scheduled on.

(note that the property of cryptographic non-correlation is NOT guaranteed across keystreams - (key, counter) is NOT guaranteed to be uncorrelated to (key+1, counter), because that's not how encryption usually is used. with a decent crypto, it should still be very good, but, it's not guaranteed to be attack-resistant/etc. so notionally if you use a different key index for every element, element N isn't guaranteed to be uncorrelated to element N+1 at the same place in the keystream. If this is really important then maybe you want to pass your array indexes through a key-spreading function etc.)

there are several benefits to doing it like this. first off obviously you get a keystream for each element of interest. but also there is no real state per-thread either - the key can be determined by looking at the element, but generating a new value doesn't change the key/keystream. so there is nothing to store and update, and you can have arbitrary numbers of generators used at any given time. Also, since this computation is purely mathematical/"pure function", it doesn't really consume any memory-bandwidth to speak of, and since computation time is usually not the limiting element in GPGPU simulations this effectively makes RNG usage "free". my experience is that this increases performance vs CuRand, even while using less VRAM, even just directly porting the "1 execution thread = 1 generator" idiom.

Also, by storing "epoch numbers" (each iteration of the sim, etc), or calculating this based on predictions of PRNG consumption ("each iteration uses at most 16 random numbers"), you can fast-forward or rewind the PRNG to arbitrary times, and you can use this to lookahead or lookback on previous events from the keystream, meaning it serves as a massively potent form of compression as well. Why store data in memory and use up your precious VRAM, when you could simply recompute it on-demand from the original part of the original keystream used to generate it in the first place? (assuming proper "object ownership" of events ofc!) And this actually is pretty much free in performance terms, since it's a "pure function" based on the function parameters, and the GPGPU almost certainly has an excess of computation available.

--

In the extreme case, you should be able to theoretically "walk" huge parts of the keystream and find specific events you need, even if there is no other reference to what happened at that particular time in the past. Like why not just walk through parts of the keystream until you find the event that matches your target criteria? Remember since this is basically pure math, it's generated on-demand by mathing it out, it's pretty much free, and computation is cheap compared to cache/memory or notarizing.

(ie this is a weird form of "inverted-index searching", analogous to Elastic/Solr's transformers and how this allows a large number of individual transformers (which do their own searching/indexing for each query, which will be generally unindexable operations like fulltext etc) to listen to a single IO stream as blocks are broadcast from the disk in big sequential streaming batches. Instead of SSD batch reads you'd be aiming for computation batch reads from a long range within a keystream. (And this is supposition but I think you can also trade back and forth between generator space and index hitrate by pinning certain bits in the output right?)

--

Anyway I don't know how much that maps to your particular use-case but that's the best advice I can give. Procedural generation using a rewindable, element-specific keystream is a very potent form of compression, and very cheap. But, even if all you are doing is just avoiding having to store a bunch of CuRand instances in VRAM... that's still an enormous win even if you directly port your existing application to simply use the globalThreadIdx like it was a CuRand stateful instance being loaded/saved back to VRAM. Like I said, my experience is that because you're changing mutation to computation, this runs faster and also uses less VRAM, it is both smaller and better and probably also statistically better randomness (especially if you choose the "hard" algorithms instead of the "optimized" versions like threefish instead of threefry etc). The bit distribution patterns of cryptographic algorithms is something that a lot of people pay very very close attention to, you are turning a science toy implementation into a gatling gun there simply by modeling your task and the RNG slightly differently.

That is the reason why you shouldn't do the "just download random numbers", as a sibling comment mentions (probably a joke) - that consumes VRAM, or at least system memory (and pcie bandwidth). and you know what's usually way more available as a resource in most GPGPU applications than VRAM or PCIe bandwidth? pure ALU/FPU computation time.

buddy, everyone has random numbers, they come with the fucking xbox. ;)

paulmd · 2023-12-15T05:24:48

thinking this through a little bit, you are launching a series of gradient-descent work tasks, right? taskId is your counter value, weightIdx is your key value (RNG stream). That's how I'd port that. Ideally you want to define some maximum PRNG usage for each stage of the program, which allows you to establish fixed offsets from the epoch value for a given event. Divide your keystream in whatever advantageous way, based on (highly-compressible) epoch counters and event offsets from that value.

in practice, assuming a gradient-descent event needs a lot of random numbers, having one keystream for a single GD event might be too much and that's where key-spreading comes in. if you take the "weightIdx W at GradientDescentIdx G" as the key, you can have a whole global keystream-space for that descent stage. And the key-spreading-function lets you go between your composite key and a practical one.

https://en.wikipedia.org/wiki/Key_derivation_function

(again, like threefry, there is notionally no need for this to be cryptographically secure in most cases, as long as it spreads in ways that your CBRNG crypto algorithm can tolerate without bit-correlation. there is no need to do 2 million rounds here either etc. You should actually pick reasonable parameters here for fast performance, but good enough keyspreading for your needs.)

I've been out of this for a long time, I've been told I'm out of date before and GPGPUs might not behave exactly this way anymore, so please just take it in the spirit it's offered, can't guarantee this is right but I've specifically gazed into the abyss the CuRand situation a decade ago and this was what I managed to come up with. I do feel your pain on the stateful RNG situation, managing state per-execution-thread is awful and destroys simulation reproducibility, and managing a PRNG context for each possible element is often infeasible. What a waste of VRAM and bandwidth and mutation/cache etc.

And I think that cryptographic/pseudo-cryptographic PRNG models are frankly just a much better horse to hook your wagon to than scientific/academic ones, even apart from all the other advantages. Like there's just not any way mersenne twister or w/e is better than threefish, sorry academia

--

edit: Real-world sim programs are usually very low-intensity and have effectively unlimited amounts of compute to spare, they just ride on bandwidth (sort/search or sort/prefix-scan/search algorithms with global scope building blocks often work well).

And tbh that's why tensor is so amazing, it's super effective at math intensity and computational focus, and that's what GPUs do well, augmented by things like sparse models etc. Make your random not-math task into dense or sparse (but optimized) GPGPU math, plus you get a solution (reasonable optimum) to an intractible problem in realtime. The experienced salesman usually finds a reasonable optimum, but we pay him in GEMM/BLAS/Tensor compute time instead of dollars.

Sort/search or sort/prefix-sum/search often works really well in deterministic programs too. Do you ever have a "myGroup[groupIdx].addObj(objIdx) stage? that's a sort and prefix-sum operation right there, and both of those ops run super well on GPGPU.

lumost · 2023-12-14T20:33:00

Folks also underestimate how complex these libraries are. There are dozens of projects to make BLAS alternatives which give up after ~3-6 months when they realize that this project will take years to be successful.

eternityforest · 2023-12-14T23:21:08

How does that work? Why not pick up where the previous team left off instead of everyone starting new ones? Or are they all targeting different backends and hardware?

haltist · 2023-12-14T23:43:34

It's a compiler problem and there is no money in compilers [1]. If someone made an intermediate representation for AI graphs and then wrote a compiler from that intermediate format into whatever backend was the deployment target then they might be able to charge money for support and bug fixes but that would be it. It's not the kind of business anyone wants to be in so there is no good intermediate format and compiler that is platform agnostic.

1: https://tinygrad.org/

whatshisface · 2023-12-15T04:09:07

JAX is a compiler.

haltist · 2023-12-15T05:07:00

So are TensorFlow and PyTorch. All AI/ML frameworks have to translate high-level tensor programs into executable artifacts for the given hardware and they're all given away for free because there is no way to make money with them. It's all open source and free. So the big tech companies subsidize the compilers because they want hardware to be the moat. It's why the running joke is that I need $80B to build AGI. The software is cheap/free, the hardware costs money.

rightbyte · 2023-12-15T09:13:05

> generating random numbers

You can't bench implementations of random numbers against each other purely on execution speed.

A better algorithm (better statistical properties) will be slower.

codetrotter · 2023-12-15T10:32:38

I have the fastest random number generator in the world. And it works in parallel too!

https://i.stack.imgur.com/gFZCK.jpg

shihab · 2023-12-15T19:51:01

Yeah. In this instance, I was talking about the same algorithm (phillox), the difference is purely in implementation.

slavik81 · 2023-12-15T00:57:31

If you haven't already, please consider filing issues on the rocrand GitHub repo for the problems you encountered. The rocrand library is being actively developed and your feedback would be valuable for guiding improvements.

shihab · 2023-12-16T01:00:03

Appreciate it, will do.

jart · 2023-12-15T07:49:39

I honestly don't see why it's so hard. On my project we wrote our own gemm kernels from scratch so llama.cpp didn't need to depend on cublas anymore. Only took a few days and a few hundred lines of code. We had to trade away 5% performance.

soulbadguy · 2023-12-15T09:00:43

For a given set kernels, and a limited set of architectures, the problem is relatively easy.

But covering all the important kernels acros all the crazy architecture out there and with relatively good performance and numerical accuracy ... Much harder

nradov · 2023-12-14T23:18:23

Instead of generating pseudorandom numbers you can just download files of them.

https://archive.random.org/

hcrean · 2023-12-15T02:16:05

Or you could just re-use the same number; no one can prove it is not random.

https://xkcd.com/221/

latchkey · 2023-12-14T20:13:36

This is a big reason why AMD did this deal with PyTorch...

https://pytorch.org/blog/experience-power-pytorch-2.0/

singhrac · 2023-12-14T21:13:10

Just to point out it does, kind of: https://github.com/intel/intel-extension-for-pytorch

I've asked before if they'll merge it back into PyTorch main and include it in the CI, not sure if they've done that yet.

In this case I think the biggest bottleneck is just that they don't have a fast enough card that can compete with having a 3090 or an A100. And Gaudi is stuck on a different software platform which doesn't seem as flexible as an A100.

spacemanspiff01 · 2023-12-14T21:33:26

They could compete on ram, if the software was there. Just having a low cost alternative to the 4060ti would allow them to break into the student/hobbies/open source market.

I tried the a770, but returned it. Half the stuff does not work. They have the CPU side and GPU development on different branches (GPU seems to be ~6 months behind CPU) and often you have to compile it yourself, (if you want torchvision or torchaudio) it also currently on 2.0.1 of pytorch so somewhat lagging, and does not have most of the performance analysis software available. You also, do need to modify your pytorch code, often more than just replacing cuda for xpu as the device. They are also doing all development internally, then pushing intermittently to public. A lot of this would not be as bad if there was a better idea of feature timeline, or if they made their CI public. (Trying to build it myself involved a extremely hacky bash script, that inevitably failed halfway through.)

selfhoster11 · 2023-12-14T23:20:08

The amount of VRAM is the absolute killer USP for the current large AI model hobbyist segment. Something that had just as much VRAM as a 3090 but at half the speed and half the price would sell like hot cakes.

eyegor · 2023-12-15T01:21:41

You are describing the ebay market for used nvidia tesla cards. The k80, p40, or m40 are widely available and sell for ~$100 with 24gb vram. The m10 even has 32gb! The problem for ai hobbyists is it won't take long to realize how many apis use the "optical flow" pathways and so on nvidia they'll only run at acceptable speeds on rtx hardware, assuming they run at all. Cuda versions are pinned to hardware to some extent.

icelancer · 2023-12-15T02:29:36

Yep. I have a fleet of P40s that are good at what they do (Whisper ASR primarily) but anything even remotely new... nah. fp16 support is missing so you need P100 cards, and usually that means you are accepting 16GB of VRAM rather than 24GB.

Still some cool hardware.

muxr · 2023-12-15T13:14:21

For us hobbyists used 3090 or new 7900xtx seem to be the way. But even then you still need to build a machine with 3 or 4 of these GPUs to get enough VRAM to play with big models.

icelancer · 2023-12-15T19:36:18

For sure - our prod machine has 6x RTX 3090s on some old cirrascale hardware. But P40s are still good for last-gen models. Just nothing new unfortunately.

selfhoster11 · 2023-12-15T08:15:24

Out of these three, only P40 is worth the effort to get running vs the capabilities they offer. That's also before considering that other than software hacks or configuration tweaks, those cards require specialised cooling shrouds for adequate cooling in tower-style cases.

If your time or personal energy is worth >$0, these cards work out to much more than $100. And you can't even file the time burnt on getting them to run as any kind of transferable experience.

That's not to say I don't recommend getting them - I have a P4 family card and will get at least one more, but I'm not kidding myself that the use isn't very limited.

icelancer · 2023-12-15T19:39:07

The K80 is literally worth P40s are definitely the best value along with P100s. I still think there is a good amount of value to be extracted from both of those cards, especially if you are interested in using Whisper ASR, video transcoding, or CUDA models that were relevant before LLMs (a time many people have forgotten apparently).

singhrac · 2023-12-14T23:51:52

This is pretty disappointing to hear. I’m really surprised they can’t even get a clean build script for users, let alone integrate into the regular Pytorch releases.

paulmd · 2023-12-15T05:37:45

oh no, I just bought a refurbed A770 16GB for tinkering with GPGPU lol. It was $220, return?

zozbot234 · 2023-12-14T20:48:02

PyTorch includes some Vulkan compat already (though mostly tested on Android, not on desktop/server platforms), and they're sort of planning to work on OpenCL 3.0 compat, which would in turn lead to broad-based hardware support via Mesa's RustiCL driver.

(They don't advertise this as "support" because they have higher standards for what that term means. PyTorch includes a zillion different "operators" and some of them might be unimplemented still. Besides performance is still lacking compared to CUDA, Rocm or Metal on leading hardware - so only useful for toy models.)

ndneighbor · 2023-12-14T23:56:26

OneAPI isn't bad for PyTorch, the performance isn't there yet but you can tell it's an extremely top priority for Intel.

seanhunter · 2023-12-15T05:15:58

But this is the the thing. Speaking as someone who dabbles in this area rather than any kind of expert, it’s baffling to me that people like Intel are making press releases and public statements rather than (I don’t know) putting in the frikkin work to make performance of the one library that people actually use decent.

You have a massive organization full of gazillions of engineers many of whom are really excellent. Before you open your mouth in public and say something is a priority, deploy a lot of them against this and manifest that priority by actually doing the thing that is necessary so people can use your stuff.

It’s really hard to take them seriously when they haven’t (yet) done that.

pas · 2023-12-15T10:24:12

You know how it works. The same busybodies who are putting out this useless noise releases are the ones who squandered Intel's lead, and now are patting themselves on the back for figuring out that with this they'll again be on top for sure!

There was a post on HN a few months ago about how Nvidia's CEO still has meetings with engineers in the trenches. Contrast that with what we know of Intel, which is not much good, and a lot of bad. (That they are notoriously not-well-paying, because they were riding on their name recognition.)

flakiness · 2023-12-15T00:18:45

Intel has to do it by themselves. NVIDIA just lets Meta/OpenAI/Google engineers do it for them. Such a handicapped fight.

seanhunter · 2023-12-15T05:16:45

It wasn’t always like this. Nvidia did the initial heavy lifting to get cuda off the ground to a point where other people could use it.

joe_the_user · 2023-12-15T02:18:35

That's because CUDA is a clear, well-functioning library and Intel has no equivalent. It makes any "you just have to get Pytorch working" a little less plausible.

foobiekr · 2023-12-14T19:54:35

It's not just Intel. Open initiatives and consortiums (the phase two of the same) are always the losers ganging up hoping that it will give them the leg up they don't have. If you're older you'll have seen this play out over and over in the industry - the history of Unix vs. Windows NT from the 1990s was full of actions like this, networking is going through it again for the nth time (this time with UltraEthernet) and so on. OpenGl was probably the most successful approach, barely worked, and didn't help any of the players who were not on the road to victory already. Unix 95 didn't work, unix 98 didn't work, etc.

AnthonyMouse · 2023-12-14T20:54:14

You're just listing the ones that didn't knock it out of the park.

TCP/IP completely displaced IPX to the point that most people don't even remember what it was. Nobody uses WINS anymore, even Microsoft uses DNS. It's rare to find an operating system that doesn't implement the POSIX API.

The past is littered with the corpses of proprietary technologies displaced by open standards. Because customers don't actually want vendor-locked technology. They tolerate it when it's the only viable alternative, but make the open option good and the proprietary one will be on its way out.

tormeh · 2023-12-15T01:17:29

Open source generally wins once the state of the art has stopped moving. When a field is still experiencing rapid change closed source solutions generally do better than open source ones. Until we somehow figure out a relatively static set of requirements for running and training LLMs I wouldn’t expect any open source solution to win.

kelipso · 2023-12-15T02:40:35

That doesn't really make sense. Pretty much all LLMs are trained in pytorch, which is open source. LLMs only reached the state it is now because many academic conferences insisted that paper submissions have open source code attached to it. So much of the ML/AI ecosystem is open source. Pretty much only CUDA is not open source.

DeathArrow · 2023-12-15T07:41:48

>Pretty much only CUDA is not open source.

What stops Intel to make their own CUDA and plug in into pytorch?

nemothekid · 2023-12-15T08:09:43

CUDA is huge and nvidia spent a ton in a lot of "dead end" use cases optimizing it. There have been experiments with CUDA translation layers with decent performance[1]. There are two things that most projects hit:

1. The CUDA API is huge; I'm sure Intel/AMD will focus on what they need to implement pytorch and ignore every other use case ensuring that CUDA always has the leg up in any new frontier

2. Nvidia actually cares about developer experience. The most prominent example is Geohotz with tinygrad - where AMD examples didn't even work or had glaring compiler bugs. You will find nvidia engineer in github issues for CUDA projects. Intel/AMD hasn't made that level of investment and thats important because GPUs tend to be more fickle than CPUs.

[1] https://github.com/vosen/ZLUDA

mschuster91 · 2023-12-15T07:52:21

The same shit as always, patents and copyright.

binkHN · 2023-12-15T05:02:08

You didn't have a choice when it came to protocols for the Internet; it's TCP/IP and DNS or you don't get to play. Everyone was running dual stack to support their LAN and Internet and you had no choice with one of them. So, everything went TCP/IP and reduced overall complexity.

pjmlp · 2023-12-14T21:35:56

> It's rare to find an operating system that doesn't implement the POSIX API.

Except that it has barelly improved beyond CLI and daemons, still thinks terminals are the only hardware, everything else that matters isn't part of it, not even more modern networking protocols that aren't exposed in socket configurations.

AnthonyMouse · 2023-12-14T21:42:06

Its purpose was to create compatibility between Unix vendors so developers could write software compatible with different flavors. The primary market for Unix vendors is servers, which to this day are still about CLI and daemons, and POSIX systems continue to have dominant market share in that market.

pjmlp · 2023-12-15T07:02:53

Nah, POSIX on servers is only relevant enough for language runtimes and compilers, which then use their own package managers and cloud APIs for everything else.

Alongside a cloud shell, which yeah, we now have a VT100 running on a browser window.

There is a reason why there are USENIX papers on the loss of POSIX relevance.

AnthonyMouse · 2023-12-15T20:30:52

> Nah, POSIX on servers is only relevant enough for language runtimes and compilers, which then use their own package managers and cloud APIs for everything else.

Those things are a different level of abstraction. The cloud API is making POSIX system calls under the hood, which would allow the implementation of the cloud API to be ported to different POSIX-compatible systems (if anybody cared to).

> There is a reason why there are USENIX papers on the loss of POSIX relevance.

The main reason POSIX is less relevant is that everybody is using Linux and the point of POSIX was to create compatibility between all the different versions of proprietary Unix that have since fallen out of use.

DeathArrow · 2023-12-15T07:52:20

Maybe it's POSIX that is holding new developments back because people think it's good enough. It's not '60s anymore. I would have expected to have totally new paradigms in 2023 if you asked me 23 years ago. Even NT kernel seems more modern.

While POSIX was state of the art when it was invented, it shouldn't be today.

Lots of research was thrown in recycle bin because "Hey, we have POSIX, why reinvent the wheel?", up to the point that nobody wants to do operating systems research today, because they don't want their hard work to get thrown into the same recycle bin.

I think that people who invented POSIX were innovators and have they live today, they would come with a totally new paradigm, more fit to today's needs and knowledge.

nvm0n2 · 2023-12-14T22:15:38

Arguably the dominant APIs in the server space are the cloud APIs not POSIX.

AnthonyMouse · 2023-12-14T22:37:28

Many of which are also open, like OpenStack or K8s, or have third party implementations, like Ceph implementing the Amazon S3 API.

natbennett · 2023-12-15T01:35:31

Also all reimplementations of proprietary technology.

The S3 API is a really good example of the “OSS only becomes dominant when development slows down” principle. As a friend of mine who has had to support a lot of local blob storage says, “On the gates of hell are emblazoned — S3 compatible.”

AnthonyMouse · 2023-12-15T20:20:54

> Also all reimplementations of proprietary technology.

That's generally where open standards come from. You document an existing technology and then get independent implementations.

Unix was proprietary technology. POSIX is an open standard.

Even when the standard comes at the same time as the first implementation, it's usually because the first implementer wrote the standard -- there has to be one implementation before there are two.

paulddraper · 2023-12-14T21:49:43

I think OP's point was less that the tech didn't work (e.g. OpenGL was fantastically successful) and that it didn't produce a good outcome for the "loser" companies that supported it.

AnthonyMouse · 2023-12-14T22:31:59

The point of the open standard is to untether your prospective customers from the incumbent. That means nobody is going to monopolize that technology anymore, but that works out fine when it's not the thing you're trying to sell -- AMD and Intel aren't trying to sell software libraries, they're trying to sell GPUs.

And this strategy regularly works out for companies. It's Commoditize Your Complement.

If you're Intel you support Linux and other open source software so you can sell hardware that competes with vertically integrated vendors like DEC. This has gone very well for Intel -- proprietary RISC server architectures are basically dead, and Linux dominates much of the server market in which case they don't have to share their margins with Microsoft. The main survivor is IBM, which is another company that has embraced open standards. It might have also worked out for Sun but they failed to make competitive hardware, which is not optional.

We see this all over the place. Google's most successful "messaging service" is Gmail, using standard SMTP. It's rare to the point of notability for a modern internet service to use all proprietary networking protocols instead of standard HTTP and TCP and DNS, but many of them are extremely successful.

And some others are barely scraping by, but they exist, which they wouldn't if there wasn't a standard they could use instead of a proprietary system they were locked out of.

paulddraper · 2023-12-14T22:49:57

> The point of the open standard is to untether your prospective customers from the incumbent.

That is the point.

But FWIW the incumbent adopts it and dominates anyway. (Though you now are technically "untethered.")

AnthonyMouse · 2023-12-14T23:04:34

> But FWIW the incumbent adopts it and dominates anyway. (Though you now are technically "untethered.")

That's assuming the incumbent's advantage isn't rooted in the lock-in.

If ML was suddenly untethered from CUDA, now you're competing on hardware. Intel would still have mediocre GPUs, but AMD's are competitive, and Intel's could be in the near future if they execute competently.

The open standard doesn't automatically give you the win, but it puts you in the ring.

And either of them have the potential to gain an advantage over Nvidia by integrating GPUs with their x86_64 CPUs, e.g. so the CPU and GPU can share memory, avoiding copying over PCIe and giving the CPU direct access to HBM. They could even put a cut down but compatible version of the technology in every commodity PC by default, giving them a huge installed base of hardware that encourages developers to target it.

fbdab103 · 2023-12-15T04:01:39

If the software side no longer mattered, I would expect all three vendors would magically start competing on available RAM. A slower card with double today's RAM would absolutely sell.

sakras · 2023-12-15T06:45:20

> A slower card with double today's RAM would absolutely sell

Absolutely, SQL analytics people (like me) have been itching for a viable GPU for analytics for years now. The price/performance just isn't there yet because there's such a bias towards high compute and low memory.

gamblor956 · 2023-12-15T00:34:26

Windows still uses WINS and NetBIOS when DNS is unavailable.

binkHN · 2023-12-15T05:06:36

In a business-class network, even one running Windows, if DNS breaks "everything" is going to break.

ilaksh · 2023-12-14T21:07:23

Nvidia is probably ten times more scared of this guy https://github.com/ggerganov than Intel or AMD.

ebb_earl_co · 2023-12-14T22:10:24

Can you expand on this? This is my first time seeing this guy’s work

pbronez · 2023-12-14T22:30:03

He’s the main developer of Llama.cpp, which allows you to run a wide range of open-weights models on a wide range of non-NVIDIA processors.

fl0id · 2023-12-15T01:40:45

but it's all inference, and most of Nvidias moat is in training afaik.

ilaksh · 2023-12-15T06:39:59

There is an example of training https://github.com/ggerganov/llama.cpp/tree/1f0bccb27929e261...

But that's absolutely false about the Nvidia moat being only training. Llama.cpp makes it far more practical run inference on a variety of devices. Including ones with or without Nvidia hardware.

refulgentis · 2023-12-15T04:07:12

People have really bizarre overdramatic misunderstandings of llama.cpp because they used it a few times to cook their laptop. This one really got me giggling though.

ilaksh · 2023-12-15T06:44:34

I am integrating llama.cpp into my application. I just went through one of their text generation examples line-by-line and converted it into my own class.

This is a leading-edge software library that provides a huge boost for non-Nvidia hardware in terms of inference capability with quantized models.

If you don't understand that, then you have missed an important development in the space of machine learning.

refulgentis · 2023-12-15T18:51:12

Jeez. Lol.

At length:

- yes, local inference is good. I can't say this strongly enough: llama.cpp is a fraction of a fraction of local inference.

- avoid talking down to people and histrionics. It's a hot field, you're in it, but like all of us always, you're still learning. When faced with a contradiction, check your premises, then share them.

UncleOxidant · 2023-12-14T19:49:37

This right here. Until Intel (and/or AMD) get serious about the software side and actually invest the money CUDA isn't going anywhere. Intel will make noises about various initiatives in that direction and then a quarter or two later they'll make big cuts in those divisions. They need to make a multi-year commitment and do some serious hiring (and they'll need to raise their salaries to market rates to do this) if they want to play in the CUDA space.

latchkey · 2023-12-14T20:16:31

Literally, every single announcement (and action) coming out of AMD these days is that they are serious about the software. I don't see any reason at this point to doubt them.

The larger issue is that they need to fix the access to their high end GPUs. You can't rent a MI250... or even a MI300x (yet, I'm working on that myself!). But that said, you can't rent an H100 either... there are none available.

lmm · 2023-12-15T00:25:42

> Literally, every single announcement (and action) coming out of AMD these days is that they are serious about the software. I don't see any reason at this point to doubt them.

They're having to announce it so much because people are rightly sceptical. Talk is cheap, and their software has sucked for years. Have they given concrete proof of their commitment, e.g. they've spent X dollars or hired Y people to work on it (or big names Z and W)?

latchkey · 2023-12-15T01:15:26

Agreed. Time will tell.

MI300x and ROCm 6 and their support of projects like Pytorch, are all good steps in the right direction. HuggingFace now supports ROCm.

DeathArrow · 2023-12-15T08:02:31

>Literally, every single announcement (and action) coming out of AMD these days is that they are serious about the software. I don't see any reason at this point to doubt them.

I have a bridge to sell.

65a · 2023-12-15T08:24:04

It is way, way better in the last year or so. Perfectly reasonable cards for inference if you actually understand the stack and know how to use it. Is nVidia faster? Sure, but at twice the price for 20-30% gains. If that makes sense for you, keep paying the tax.

latchkey · 2023-12-15T16:50:19

Not just tax, but fighting with centralization on a single provider and subsequent unavailability.

_7il4 · 2023-12-14T21:41:40

latchkey · 2023-12-14T21:43:33

ROCm is open source.

and: https://www.youtube.com/watch?v=pVl25BbczLI

latchkey · 2023-12-14T21:51:43

What _7il4 removed were these two comments:

"AMD is not serious, and neither is Intel for that matter. Their software are piles of proprietary garbage fires. They may say they are serious but literally nothing indicates they are."

"Yes, and ROCm also doesn't work on anything non-AMD. In fact it doesn't even work on all recent AMD gpus. T"

frognumber · 2023-12-14T23:58:24

It's not too polite to repost what people removed. Errors on the internet shouldn't haunt people forever.

However, my experience is that the comments about AMD are spot-on, with the exception of the word "proprietary."

Intel hasn't gotten serious yet, and has a good track record in other domains (compilers, numerical libraries, etc.). They've been flailing for a while, but I'm curious if they'll come up with something okay.

eyegor · 2023-12-15T02:11:23

As a long time user of intels scientific compiler/accelerator stack, I'm not sure if I'd call it a "good track record". Once you get all their libs working they tend to be fairly well optimized but they're always a huge hassle to install, configure, and distribute. And when I say a hassle to install, I'm talking about hours to run their installers.

They have a track record of taking open projects, adding proprietary extensions, and then requiring those extensions to work with other tools. This sounds fine, but they are very slow/never update the base libs. From version to version they'll muck with deep dependencies, sometimes they'll even ship different rules on different platforms (I dare you to try to static link openmp in a recent version of oneapi targeting windows). If you ship a few tools (let's say A and B) that use the same dynamic lib, it's a royal pain to make sure they don't conflict with each other if you update software A but not B. Ranting about consumer junk aside, their cluster focused tooling on Linux tends to be quite good, especially compared to amd.

latchkey · 2023-12-15T01:07:03

Normally, I wouldn't do that, but this time I felt like both comments were intentionally inflammatory and then the context for my response was lost.

"literally nothing" is also wrong given that they just had a large press announcement on Dec 6th (yt as part of my response below), where they spent 2 hours saying (and showing) they are serious.

The second comment was made and then immediately deleted, in a way that was to send me a message directly. It is what irked me enough to post their comments back.

_7il4 · 2023-12-15T05:59:01

I deleted it because I was in an extremely bad mood and later realized it was simply wrong of me to post it and vent my unrelated frustration in those comments. I think it's in extremely bad taste to repost what I wrote when I made the clear choice to delete it.

dang · 2023-12-16T03:10:26

(Posting here because I don't have an email address for you.)

I reassigned your comments in this thread to a random user ID, so it's as if you had used a throwaway account to post them and there's no link to your main account. I also updated the reference to your username in another comment. Does that work for you?

latchkey · 2023-12-15T16:59:55

You're right. I apologize. I've emailed dang to ask him to remove this whole thread.

FeepingCreature · 2023-12-15T08:07:17

If you didn't say it, the rocks would cry out. Personally, I'll believe AMD is maybe serious about Rocm if they make it a year without breaking their Debian repository.

Qwertious · 2023-12-15T02:36:27

>It's not too polite to repost what people removed. Errors on the internet shouldn't haunt people forever.

IMO the comment deletion system handles deleting your own comment wrong - it should grey out the comment and strikethrough it, and label it "comment disavowed" or something with the username removed, but it shouldn't actually delete the comment.

Deleting the comment damages the history, and makes the comment chain hard to follow.

frognumber · 2023-12-15T02:57:07

I'm not sure history should be sacred. In the world of science, this is important, but otherwise, we didn't used to live in a universe where every embarrassing thing we did in middle school would haunt us in our old age.

I feel bad for the younger generation. Privacy is important, and more so than "the history."

code_biologist · 2023-12-15T04:05:20

Long ago, under a different account, I emailed dang about removing some comment content that became too identifying only years after the comments in question. He did so and was very gracious about it. dang, you're cool af and you make HN a great place!

krapp · 2023-12-15T04:09:46

You're being very obsequious about having to personally get approval from a moderator to do on your behalf something every other forum lets you do on your own by default.

frognumber · 2023-12-16T02:29:30

Plus, dang does a good job.

Not a perfect job -- everyone screws up once in a while -- but this forum works better than most.

mepian · 2023-12-14T21:45:08

How’s SYCL proprietary exactly?

viewtransform · 2023-12-15T06:29:28

AMD made a presentation on their AI software strategy at Microsoft Ignite two weeks ago. Worth a watch for the slides and live demo

https://youtu.be/7jqZBTduhAQ?t=61

goldenshale · 2023-12-15T06:07:56

Seriously, why don't they just dedicate a group to creating the best pytorch backend possible? Proving it there will gain researcher traction and prove that their hardware is worth porting the other stuff over to.

rightbyte · 2023-12-15T09:27:23

You can't "just" do stuff like this. You need the right guy and big corps have no clue who is capable.

_the_inflator · 2023-12-14T19:22:21

I agree. That’s what MS didn’t understand with cloud and Linux at first.

There is more than just a hardware layer to adoption.

CUDA is a platform is an ecosystem is also some sort of attitude. It won’t go away. Companies invested a lot into it.

Keyframe · 2023-12-14T20:16:41

Guys need to do ye olde embrace, extend maneuver. What wine did, what Javas of the world did. CUDA driver API and CUDA runtime API either translation or implementation layer that offers compatibility and speed. I see no way around it at this point, for now.

joe_the_user · 2023-12-15T02:03:31

They could do that. That would eliminate Nvidia's monopoly. AMD has made gestures in that direction with Hip. But they ultimately don't want to do that - Hip support is half-assed and inconsistent. AMD creates and abandons a variety of APIs. So the conclusion is the other chip makers whine about Nvidia's monopoly but don't want to end it - they just want maneuver to get their smaller monopolies of some sort or other.

Keyframe · 2023-12-15T09:22:52

You're right. At this stage CUDA is de facto what the standard is around. Just like in ISA wars x86 was. Doesn't matter if you have POWER whatever when everything's on the other thing. I get why not though, it would drag the battle onto Nvidia's home turf. At least it would be a battle though.

papichulo2023 · 2023-12-14T20:06:24

Wasnt Intel the biggest supporter of OpenCV? I dont know any open source heavely supported by Nvidia

tw04 · 2023-12-15T02:53:29

> but nobody gets that it's the software and ecosystem.

Really? Because I’m confident Intel knows exactly what it’s about. Have you looked at their contributions to Linux and open source in general? They employ thousands of software developers.

madaxe_again · 2023-12-15T07:03:13

Which is why intel left nvidia and everyone else in the dust with CUDA, which they developed.

Oh, wait…

tw04 · 2023-12-15T16:01:38

What exactly do you think Intel was going to write CUDA for? Their GPU products have been on the market less than 2 years, and they're still trying to get their arms wrapped around drivers.

Them understanding what was coming doesn't mean they have a magic wand to instantly have a fully competitive product. You can't write a CUDA competitor until you've gotten the framework laid. The fact they invested so heavily in their GPUs makes it pretty obvious they weren't caught off-guard, but catching up takes time. Sometimes you can't just throw more bodies at the problem...

madaxe_again · 2023-12-15T17:18:29

That’s my point though - they’re catching up, not leading, which rather implies that they absolutely missed a beat, and don’t therefore understand where the market is going before it goes there.

nox100 · 2023-12-15T08:37:51

Is Mojo trying to solve this?

https://www.youtube.com/watch?v=SEwTjZvy8vw

dheera · 2023-12-15T00:14:18

Both AMD and Intel (and Qualcomm to some degree) just don't seem to get how you beat NVIDIA.

If they want to grab a piece of NVIDIA's pie, they do NOT need to build something better than an H100 right away. There are a million consumers who are happy with a 4090 or 4080 or even 3080 and would love for something that's equally capable at half price, and moreover, actually available for purchase, from Amazon/NewEgg/wherever, and without a "call for pricing" button. AMD and Intel are much better at making their chips available for purchase than NVIDIA. But that's not enough.

What they DO need to do to take a piece of NVIDIA's pie is to build "intelcc", "amdcc", and "qualcommcc" that accept the EXACT SAME code that people feed to "nvcc" so that it compiles as-is, with not a single function prototype being different, no questions asked, and works on the target hardware. It needs to just be a drop-in replacement for CUDA.

When that is done, recompiling PyTorch and everything else to use other chips will be trivial.

treprinum · 2023-12-15T13:10:45

That's not going to work because each GPU has a different internal architecture and aligning the way data is fed with how stream processors operate is different for each architecture (stuff like how memory buffers are organized/aligned/paged etc.). AMD is very incompatible to Nvidia at the lowest level so things/approaches that are fast on Nvidia can be 10x slower on AMD and vice versa.

dheera · 2023-12-15T16:45:10

> things/approaches that are fast on Nvidia can be 10x slower on AMD and vice versa.

That's fine but the job of software is to abstract that out. The code should at least compile, even if it is 10x less efficient and if multiple awkward sets of instructions need to be used instead of one.

If Pytorch can be recompiled for AMD overnight with zero effort (only `ln -s amdcc nvcc`, `ln -s /usr/local/cuda /usr/local/amda`) they will gain some footing against Nvidia.

DeathArrow · 2023-12-15T08:04:45

>Both AMD and Intel (and Qualcomm to some degree) just don't seem to get how you beat NVIDIA.

That's simple, just do what Nvidia did. Be better than the competition.

flamedoge · 2023-12-15T05:59:05

thats what hip is though. a recompiler for cuda code. its not good enough for ptx assembly to amdgpu yet

MrBuddyCasino · 2023-12-15T08:14:22

They still don't get that those who are serious about hardware, must make their own software.

claytonjy · 2023-12-14T19:48:05

huh, I didn't even know openvino supported anything but CPUs! TIL

2OEH8eoCRo0 · 2023-12-15T00:24:02

> the software incompatibilities mean you'll spend a ton of time getting it to work compared to an Nvidia GPU.

Leverage LLMs to port the SW

brindlejim · 2023-12-14T19:00:33

Fun fact: More than half of all engineers at NVIDIA are software engineers. Jensen has deliberately and strategically built a powerful software stack on top of his GPUs, and he's spent decades doing it.

Until Intel finds a CEO who is as technical and strategic, as opposed to the bean-counters, I doubt that they will manage to organize a successful counterattack on CUDA.

tester756 · 2023-12-14T21:46:49

>finds a CEO who is as technical and strategic, as opposed to the bean-counters

Did you just call Gelsinger a "non-technical"? wow, how out of touch with reality

>Gelsinger first joined Intel at 18 years old in 1979 just after earning an associate degree from Lincoln Tech.[9] He spent much of his career with the company in Oregon,[12] where he maintains a home.[13] In 1987, he co-authored his first book about programming the 80386 microprocessor.[14][1] Gelsinger was the lead architect of the 4th generation 80486 processor[1] introduced in 1989.[9] At age 32, he was named the youngest vice president in Intel's history.[7] Mentored by Intel CEO Andrew Grove, Gelsinger became the company's CTO in 2001, leading key technology developments, including Wi-Fi, USB, Intel Core and Intel Xeon processors, and 14 chip projects.[2][15] He launched the Intel Developer Forum conference as a counterpart to Microsoft's WinHEC.

jillesvangurp · 2023-12-15T08:31:16

Gelsinger is a typical hardware engineer out of his depth competing against what is effctively a software play. This is a recurring theme in the industry where you have successful hardware companies with strong hardware focused leadership fail over time because they don't get software.

I used to work at Nokia Research. The problem was on full display during the period Apple made it's entry into mobile. We had plenty of great software people throughout the company. But the leadership had grown up in a world where Nokia was basically making and selling hardware. Radio engineers and hardware engineers basically. They did not get software all that well. And of course what Apple did was executing really well on software for what was initially a nice but not particularly impressive bit of hardware. It's the software that made the difference. The hardware excellence came later. And the software only got better over time. Nokia never recovered from that. And they tried really hard to fix the software. It failed. They couldn't do it. Symbian was a train wreck and flopped hard in the market.

Intel is facing the same issue here. Their hardware is only useful if there's great software to do something with it. The whole point of hardware is running software. And Intel is not in the software business so they need others to do that for them. Similar to Nokia, Apple came along and showed the world that you don't need Intel hardware to deliver a great software experience. Now their competitor NVidia is basically stealing their thunder in the AI and 3D graphics market. Intel wants in but just like they failed to get into the mobile market (they tried, with Nokia even), their efforts to enter this market are also crippled by their software ineptness.

This is a lesson that many IOT companies struggle with as well. Great hardware but they typically struggle with their software ecosystems and unlocking the value of the hardware. So much so that one Finnish software company in this space (Wirepas), has been running an absolute genius marketing campaign with the beautiful slogan "Most IOT is shit". Check out their website. Some very nice Finnish humor on display there. Their blunt message is that most hardware focused IOT companies are hopelessly clumsy on the software front and they of course provide a solution.

donor20 · 2023-12-15T11:56:27

apple did initially want to build on what they saw as the best fab - Intel. unlock the power Intel would bring for their phone. But they had some design objectives focused on user experience (power/cost) and Intel didn't see the value. Intel then scrambled to try and build what apple had asked for but without the software.

Nokia kept doing crazy hardware to show off on hardware side. But these old companies can't stop nickle and dimeing - so you'd get stuff w crazy drm etc. And software wasn't there or invested in fully.

whomst · 2023-12-15T20:17:39

My brother in Christ he was literally the CEO of VMware

brindlejim · 2023-12-14T23:38:47

Intel spent more than a decade under Otellini, Krzanich and Swan. Bean counters. Gelsinger was appointed out of desperation, but the problem runs much deeper. I doubt that culture is gone. It has already cost Intel many opportunities.

alexey-salmin · 2023-12-15T00:41:18

Otellini wasn't an engineer but still he made the historical x86-mac deal, pushed like crazy for x86-android and owned the top500 with xeon phi.

The downfall began with Krzanich who had no goal besides raising the stock price and no strategy other than cutting long-term projects and other costs that got in the way. What a shame.

SilverBirch · 2023-12-15T22:27:56

>owned the top500 with xeon phi.

This is interesting - because what I heard (within Intel at the time, circa 2015) was Xeon Phi was a disaster. The programming model was bad and they couldn't sell them.

baq · 2023-12-15T10:49:28

Otellini also made the historical decision to pass on the iPhone chip...

selimthegrim · 2023-12-15T04:45:26

Krzanich started out as an engineer

bell-cot · 2023-12-15T09:50:04

"Optimize for Wall Street" is a disease to which even the seemingly-best minds can succumb.

tester756 · 2023-12-14T23:46:34

>Intel spent more than a decade under Otellini, Krzanich and Swan. Bean counters.

It still doesn't change mistake in your original message.

>Gelsinger was appointed out of desperation, but the problem runs much deeper.

How much "much deeper"? VPs? middle level managers? engineers?

The example goes from the top, so if he can change the culture at the top, it will eventually get deeper.

jbm · 2023-12-15T02:13:39

> as technical and strategic

It seems that this is an AND not an OR.

ksec · 2023-12-15T03:45:14

>It still doesn't change mistake in your original message.

Precisely. The problem I found on HN is that It is hard to have any meaningful discussion on anything Hardware. Especially when it is mixed with business or economics models.

I was greedy and was hoping Intel could fall closer to $20 in early 2023 before I load up more of their stock. Otherwise I would put more money where my mouth is.

ksec · 2023-12-15T02:43:43

>Gelsinger was appointed out of desperation,

You will need to observe Intel more closely. It was not out of desperation. And Gelsinger is more technical and strategic than you implies.

natbennett · 2023-12-15T01:40:21

If you read carefully you’ll see the comment is “as technical and strategic.”

That’s very different from “non-technical.”

He was clearly capable of leading a team to develop a new processor but that’s not the issue here.

hackernewds · 2023-12-15T04:01:48

1989 is 34 years ago.

roenxi · 2023-12-14T19:40:11

Gelsinger is saying "the entire industry" and that seems likely to be a simple fact. Every single player, other than Nvidia, has an incentive to minimise the importance of CUDA as a proprietary technology. That is a lot more programmers than Nvidia can afford to employ.

Even if Intel falls over its own feet, the incentives to bring in more chip manufacturers are huge. It'll happen, the only question is whether the timeframe is months, years or a decade. My guess is shorter timeframes, this seems to mostly be matrix multiplication and there is suddenly a lot of money and attention on the matter. And AMD's APU play [0] is starting to reach the high end of the market with the MI300A which is an interesting development.

[0] EDIT: For anyone not following that story, they've been unifying system and GPU memory; so if I've understood this correctly there isn't any need to "copy data to the GPU" any more on those chips. Basically the CPU will now have big extensions for doing matrix math. Seems likely to catch on. Historically they've been adding that tech to low-end CPU so it isn't useful for AI work, now they're adding it to the big ones.

amadeuspagel · 2023-12-14T20:36:34

> That is a lot more programmers than Nvidia can afford to employ.

How many programmers one can employ is determined by profits, and Nvidia has monopoly profits thanks to CUDA, while "the entire industry" can at best hope to create some commiditized alternative to CUDA. Companies with real market power can beat entire industries of commodity manufacturers, Apple is the prime example.

AnthonyMouse · 2023-12-14T21:05:17

AMD and Intel together have more revenue than Nvidia, even without considering any other player in the industry or any community contributions they get from being open source.

julienfr112 · 2023-12-14T21:25:48

it's not about revenue, it's about investment. It is closely related to future profit. Not so much to current revenue ...

AnthonyMouse · 2023-12-14T21:32:18

Profit is revenue minus costs. Investment is costs. If you're reinvesting everything you take in your current-year profit would be zero because you're making large investments in the future.

flkenosad · 2023-12-15T13:08:01

How many programmers do you really need though to catch up to what CUDA has already? The path has been laid. There's no need for experimentation. Just copy what NVIDIA did. No?

tiahura · 2023-12-15T01:52:50

You're forgetting Microsoft.

https://www.cnbc.com/2023/12/06/meta-and-microsoft-to-buy-am...

raincole · 2023-12-15T02:13:50

> Gelsinger is saying "the entire industry" and that seems likely to be a simple fact. Every single player, other than Nvidia, has an incentive to minimise the importance of CUDA as a proprietary technology. That is a lot more programmers than Nvidia can afford to employ.

I mean, this statement is technically true, but it's true for any proprietary technology. If things work like this then we won't have any industry where proprietary techs/formats are prevalent.

roenxi · 2023-12-15T02:42:37

I suppose, but it is a practical matter here. CUDA is a library for memory management and matrix math targeted at researchers, hyper-productive devs and enthusiasts. It looks like it'll be highly capital intensive, requiring hardware that runs in some of the biggest, nastiest, OSS-friendliest data-centres in the world who all design their own silicon. The generations of AMD GPU that matter - the ones out and on people's machines - aren't supported for high quality GPGPU compute right now. Alright, that means CUDA is a massive edge right now. But that doesn't look like a defensible moat.

I was interested in being part of this AI thing, what stopped me wasn't lack of CUDA, it was that my AMD card reliably crashes under load doing compute workloads. Then when I see George Hotz having a go, the problem isn't lack of CUDA; it was that his AMD card crashed under compute workloads (technically I think it was running the demo suite). That is only anecdata, but 2 for 2 is almost a significant number of people with the small number of players and lack of big money in AI historically.

Lacking CUDA specifically might be a problem here, but I've never seen AMD fall down at that point. I've only ever see them fall down at basic driver bugs. And I don't see how CUDA would matter all that much because I can implement most of what I need math-wise in code. If I see a specific list of common complaints maybe I'll change my mind, but I'm just not detecting where the huge complexity is. I can see CUDA maintaining an edge for years because it is convenient, but I really don't see how it can stay essential. The card can already do the workload in theory and in practice assuming the code path doesn't bug out. I really don't need CUDA, all I want rocBLAS to not crash. I suspect that'd go a long way in practice.

flamedoge · 2023-12-15T06:04:38

AMD could use testers(cough clients i mean) like you. Jokes aside, please report bugs to rocm github..

slavik81 · 2023-12-15T08:08:50

Unless their hardware is on the official support list, I wouldn't be too hopeful for a quick resolution. Still, it's even less likely to get fixed if it's not reported.

If nothing else, I would be curious to know more about the issue. Personally, I want to know how well ROCm functions on every AMD GPU.

droopyEyelids · 2023-12-14T20:09:19

I'm not an expert here, but with:

> That is a lot more programmers than Nvidia can afford to employ

How do you account for the increased complexity those developers have to deal with in an environment where there are multiple companies with conflicting incentives working on the standard?

My gut reaction is to worry if this is one of those problems like "9 people working together can't have a baby in one month".

roenxi · 2023-12-14T21:00:28

I actually find that a really interesting question with a really interesting answer - the scaling properties of large groups of people are unintuitive. In this case, my guess would be high market complexity, and the entire userbase to ignore that complexity in favour of 1-2 vendors with simple and cheap options. So the market overall will just settle on de-facto standards.

Of course, based on what we see right now that standard would be Nvidia's CUDA; but while CUDA is impressive I don't think running neural nets requires that level of complexity. We're not talking about GUIs which are one of the stickiest and most complicated blocks of software we know about, or complex platform-specific operations. I'd expect that the need for specialist libraries to do inference to go away in time and CUDA to be mainly useful for researching GPU applications to new problems. Training will likely just come down to raw ops/second in hardware rather than software.

It isn't like this stuff can't already run on other cards. AMD cards can run stable diffusion or LLMs. The issue is just that AMD drivers tend to crash. That is simultaneously a huge and a tiny problem - if they focus on it it won't be around for long. CUDA is an advantage, but not a moat.

natbennett · 2023-12-14T19:30:12

I find the hero-worship of Pat Gelsinger — examples in sibling comments — really weird. My impression of him at VMware was very beancounter-y, not especially technical, and too caught up in personal vendettas and status games to make good technical leadership decisions.

Granted, I may have just gotten off on the wrong foot. The first thing he said to Pivotal during the acquisition announcement was, “You were our cousins, but you’re now more like children.” So the whole tone was just weird.

baq · 2023-12-14T19:26:45

This was true for Intel for at least 10 years and I’m pretty sure for much longer than that. It was probably true for nvidia for about as long as they exist.

Hardware without software is just expensive sand. Every semiconductor company knows this. Intel was the one to perfect the whole package with x86 in the first place…

In the GPU compute space CUDA is x86. It’s ubiquitous, de facto standard and will be disrupted. Question is if it takes a year or a decade.

JonChesterfield · 2023-12-14T22:05:25

The stereotype is hardware engineers all think software is easy. So while semiconductor firms know software is important, they're often optimistic about the ease of creating it.

Cuda is enormous, very complicated and fits together relatively well. All the semiconductor startups have a business plan about being transparent drop in replacements for cuda systems, built by some tens of software engineers in a year or two. That only really makes sense if you've totally misjudged the problem difficulty.

mi_lk · 2023-12-14T19:14:02

You don't know Pat Gelsinger do you?

ActionHank · 2023-12-14T20:14:39

We're in 2023, he's been in the CEO seat for 2 years already. He's had plenty of time to show the world his intent and where they are going. All that has happened is they launched a very mid GPU and have yielded more ground to AMD. Meanwhile AMD continue to eat away at Intel's talent pool, market share, and still managed to push into the AI space.

He should be sweating.

pgeorgi · 2023-12-14T20:30:23

> for 2 years already ... All that has happened is they launched a very mid GPU

Hardware development cycles are closer to 5 years. So while he might have gotten some adjustments done on the designs so far, if he turned the ship around it'll take a while longer to materialize.

The software side is more agile, so any tea leave reading to discern what Gelsinger's strategy looks like is best done over there.

jorvi · 2023-12-14T23:48:32

Not only that, for a “first” (not sure how much of Larrabee was salvaged) discrete GPU attempt, Intel Arc is fantastic. Look at the first GPUs Nvidia and ATI launched.

It’s only when you put them up against Nvidia and AMDs comes-with-decades-of-experience offerings that Intel’s GPUs seem less than stellar.

jacoblambda · 2023-12-15T00:56:59

Yeah Arc is incredible in how much it accomplished as a first attempt and as long as they keep at it without chopping it up into a bunch of artificially limited market segments then it'll probably be incredibly competitive in a few generations.

signatoremo · 2023-12-14T22:12:57

His intent is “5 nodes in 4 years” - [0]. The goal is to reclaim the node leadership from TSMC by 2025.

They announced the first chips based on Intel 4 today, which is more or less equivalent to TSMC’s 5nm.

They may fail, but the goal is clear and ambitious.

[0] - https://www.xda-developers.com/intel-roadmap-2025-explainer/

ksec · 2023-12-15T02:45:25

>All that has happened is they launched a very mid GPU and have yielded more ground to AMD.

And you dont blame that to Raja Koduri but to Gelsinger?

drexlspivey · 2023-12-14T20:09:48

or Lisa Su

logicchains · 2023-12-15T08:07:23

Ultimately it comes down to Intel and AMD being penny wise, pound foolish. They're unwilling to hire a lot of quality software engineers, because they're expensive, so they just continue to fall behind NVidia.

seanalltogether · 2023-12-15T08:54:20

Nvidia GPU moat has always been their software. Game ready drivers are a big deal for each AAA game launch and they always help to push their fps numbers on reviewers charts. I feel like for 20 years I've been reading people online complain about ATI/AMD drivers and how they want to go back to an Nvidia card the next chance they get.

epolanski · 2023-12-15T10:09:48

This hasn't been true for more than a decade at this point, and in fact AMD tends to have the better driver support, especially long term.

overdrive23 · 2023-12-15T11:15:23

Well, you said a decade so I'll take an easy one from 4 years ago.

"Yes, you know all those r/amd and r/nvidia posts about people ditching their RX 5700 XT and switching over to an Nvidia RTX 2070 Super… AMD is reading them and has obviously been jamming them down the throats of its software engineers until they could release a driver patch which addresses the issues Navi users have been experiencing."

https://www.pcgamesn.com/amd/radeon-rx-5700-drivers-black-sc...

muxr · 2023-12-15T13:16:44

People used the same argument when saying AMD would never beat Intel in CPUs. Intel has a lot of software engineers. Also these days AMD has a good number of software folks, thanks to the Xilinx acquisition and the organic investments in this area.

ac29 · 2023-12-15T00:59:04

Intel has over 15,000 software engineers, per their website. I couldn't find a number for NVIDIA, but it looks like they have a bit above 26k total employees.

So, its very likely Intel has more software engineers than NVIDIA. Intel has far more products than NVIDIA though, so NVIDIA almost certainly has more software engineers working on GPU.

rightbyte · 2023-12-15T09:29:51

I would say that over a certain number of devs the output decreases.

I think it was 500 people working on Windows XP? A hundred for Windows 95. Etc.

baq · 2023-12-15T10:51:24

Intel's/NVidia's/AMD's GPU drivers alone are probably more LOC at this point than the whole of XP...

papichulo2023 · 2023-12-14T19:24:59

Are you saying they able to develop without a PM every 5 engineers? Insane.

specialist · 2023-12-15T15:33:47

Yes and, for different reasons:

> ...as opposed to the bean-counters

Intel whining about the CHIPS Act, while doing huge layoffs, while continuing massive stock buybacks, while continuing to pay dividends...

I'm not impressed.

I'd expect a corporation facing an existential crisis would make some tough decisions and accept the consequences. I know that Wall St (investors) is the tail wagging the dog. But I expect a leader to hold off the vultures (ghouls) long enough to do the necessary pivots.

At least Intel's doom spiral isn't as grimm as the condundrum facing the automobile manufacturers. They need to prop up their declining business units, pay off their loan sharks (investors), somehow resolve their death embrace with the unions, transform culture (bust up their organizational siloes), transform relationships with their suppliers, AND somehow get capital (cheap enough) to fund their pivots.

I'm sure Intel has the technical chops to do great things. But financially their half-measures strategy doesn't instill confidence.

Source: Just a noob reading the headlines. Am probably totally off base.

sulam · 2023-12-14T19:23:08

Yeah, they should get someone who actually architected a successful chip, like the 486. Maybe a boomerang who used to be CTO. Get rid of this beancounter and hire someone like that!!

/S

bartwr · 2023-12-14T19:00:43

If they create a better tool chain, ecosystem, and programming experience than CUDA and compatible with all computational platforms at their peak performance - awesome! Everyone wins!

Until then, it's a bit funny claim, especially considering what a failure OpenCL was (programmer's experience and fading support). Or trying to do GPGPU with compute shaders in DX/GL/Vulkan. Are they really "motivated"? Because they had so many years and the results are miserable... And I don't think they invested even a fraction of what got invested into CUDA. Put your money where your mouth is.

Figs · 2023-12-14T21:12:14

I wish AMD or Intel would just ship a giant honking CPU with 1000s of cores that doesn't need any special purpose programming languages to utilize. Screw co-processors. Screw trying to make yet another fucked up special purpose language -- whether that's C/C++-with-quirks or a half-assed Python clone or whatever. Nuts to that. Just ship more cores and let me use real threads in regular programming languages.

aseipp · 2023-12-15T00:46:08

It doesn't work if you're going against GPUs. All the nice goodies we are accustomed to on large desktop x86 machines with gigantic caches and huge branch predictor area and OOO execution engines -- the features that yield the performance profile we expect -- simply do not translate or scale up to thousands of cores per die. To scale that up, you need to redesign the microarchitecture in a fundamental way to allow more compute-per-mm^2 of area, but at that point none of the original software will work in any meaningful capacity because the pipeline is so radically different, it might as well be a different architecture entirely. That means you might as well just write an entirely different software stack, too, and if you're rewriting the software, well, a different ISA is actually the easy part. And no, shoving sockets on the mobo does not change this; it doesn't matter if it's a single die or multi socket. The same dynamics apply.

ac29 · 2023-12-15T01:09:00

While the first >1000 core x86 processor is probably a little ways out, Intel is releasing a 288-core x86 processor in the first half of 2024 (Sierra Forest). I assume AMD will have something similarly high core in 2024-25 as well.

aseipp · 2023-12-15T01:27:20

To be clear, you can probably make a 1000 core x86 machine, and those 1000 cores can probably even be pretty powerful. I don't doubt that. I think Azure even has crazy 8-socket multi-sled systems doing hundreds of cores, today. But this thread is about CUDA. Sierra Forest will get absolutely obliterated by a single A100 in basically any workload where you could reasonably choose between the two as options. I'm not saying they can't exist. Just that they will be (very) bad in this specific competition. I made an edit to my comment to reflect that.

But what you mention is important, and also a reason for the ultimate demise of e.g. Xeon Phi. Intel surely realized they could just scale their existing Xeon designs up-and-out further than expected. Like from a product/SKU standpoint, what is the point of having a 300 core Phi where every core is slow as shit, when you have a 100 core 4-socket Xeon design on the horizon, using an existing battle-tested design that you ship billions of dollars worth every year? Especially when the 300 core Xeon fails completely against the competition. By the time Phi died, they were already doing 100-cores-per-socket systems. They essentially realized any market they could have had would be served better by the existing Xeon line and by playing to their existing strengths.

dahart · 2023-12-15T05:44:47

> Intel is releasing a 288-core x86

This made me wonder a couple of things-

What kind of workloads and problems is that best suited for? It’s a lot of cores for a CPU, but for pure math/compute, like with AI training and inference and with graphics, 288 cores is like ~1.5% of the number of threads of a modern GPU, right? Doesn’t it take particular kinds of problems to make a 288 core CPU attractive?

I also wondered if the ratio of the highest core count CPU to GPU has been relatively flat for a while? Which way is it trending- which of CPUs or GPUs are getting more cores faster?

imtringued · 2023-12-15T08:42:45

You could do sparse deep learning with much, much larger models with these CPUs. As paradoxical as it might sound, sparse deep learning gets more compute bound as you add more cores.

why_only_15 · 2023-12-15T08:57:35

I'd be curious to learn more about how it's compute bound and what specifically is compute bound. On modern H100s you need ~600 fp8 operations per byte loaded from memory in order to be compute bound, and that's with full 128-byte loads each time. Even integer/fp32 vector operations need quite a few operations to be compute bound (~20 for vector fp32).

imtringued · 2023-12-15T11:35:58

I think you misunderstood what I mean. Sparse ML is inherently memory latency bound since you have a completely unpredictable access pattern prone to cache misses. The amount of compute you perform is a tiny blip compared to the hash map operations you perform. What I mean is that as you add more cores, there are sharing effects because multiple cores are accessing the same memory location at the same time. The compute bound sections of your code become a much greater percentage of the overall runtime as you add cores, which is surprising, since adding more compute is the easy part. Pay attention to my words "_more_ compute bound".

Here is a relevant article: https://www.kdnuggets.com/2020/03/deep-learning-breakthrough...

NekkoDroid · 2023-12-15T21:59:58

288 Cores or Threads? Cuz to my knowledge AMD already has a 128 Core, 256 Thread Processor with the Epyc 9754

dahart · 2023-12-15T06:09:00

Apple might be sort-of trying to build the honking CPU, but it still requires different language extensions and a mix of different programming models.

And what you suggest could be done, but it would likely flop commercially if you made it today, which is why they aren’t doing it. SIMD machines are faster on homogenous workloads, by a lot. It would be a bummer to develop a CPU with thousands of cores that is still tens or hundreds of times slower than a comparably priced GPU.

SIMD isn’t going away anytime soon, or maybe ever. When the workload is embarrassingly parallel, it’s cheaper and more efficient to use SIMD over general purpose cores. Specialized chiplets and co-processors are on the rise too, co-inciding with the wane of Moore’s law; specialization is often the lowest hanging fruit for improving efficiency now.

There’s going to be plenty of demand for general programmers but maybe worth keeping in mind the kinds of opportunities that are opening up for people who can learn and develop special purpose hardware and software.

JonChesterfield · 2023-12-14T22:12:56

Well, that is what a GPU is. Cuda / openmp etc are attempts at conveniently programming a mixed cpu/gpu system.

If you don't want that, program the GPU directly in assembly or C++ or whatever. A kernel is a thread - program counter, register file, independent execution from the other threads.

There isn't a Linux kernel equivalent sitting between you and the hardware so it's very like bare metal x64 programming, but you could put a kernel abstraction on it if you wanted.

Core isn't very well defined, but if we go with "number of independent program counters live at the same time" it's a few thousand.

X64 cores are vaguely equivalent to GCN compute units, 100 or so if either in a 300W envelope. X64 has two threads and a load of branch prediction / speculation hardware. GCN has 80 threads and swaps between them each cycle. Same sort of idea, different allocation of silicon.

btown · 2023-12-15T16:14:52

The closest Intel got to this was Xeon Phi / Knights Landing https://en.wikipedia.org/wiki/Xeon_Phi with 60+ cores per chip, each able to run 4 threads simultaneously - each of which could run arbitrary x86 code. Discontinued due to low demand in 2020 though.

In practice, people weren’t afraid to roll up their sleeves and write CUDA code. If you wanted good performance you had to think about data parallelism anyways, and at that point you’re not benefiting from x86 backwards compatibility. It was a fascinating dream while it lasted though.

pjmlp · 2023-12-14T21:37:58

It was called Larrabee and XeonPhi, they botched it, and the only thing left from that effort is AVX.

vkazanov · 2023-12-14T22:00:09

I used to play with these toys 7-8 years ago. We tried everything, and it was bad at it all.

Traditional compute? The cores were too weak.

Number crunching? Okay-ish but gpus were better.

Useless stuff.

pjmlp · 2023-12-15T06:40:51

Hence why " they botched it".

jauntywundrkind · 2023-12-14T23:45:07

They seemed exceedingly hard to use well but interestingly capable & full of promise. And they were made in a much more primitive software age.

I'd love to hear about what didn't work. OpenMP support seemed ok maybe but OpenMP is just a platform, figuring out software architectures that's mechanistically sympathetic to the system is hard. It would be so interesting to see what Xeon Phi might have been if we had Calcite or Velox or OpenXLA or other execution engine/optimizers that can orchestrate usage. The possibility of something like Phi seems so much higher now.

There's such a consensus around Phi tanking, and yes, some people came and tried and failed. But most of those lessons, of why it wasn't working (or was!) never survived the era, never were turned into stories & research that illuminates what Phi really was. My feeling is that most people were staying the course on GPU stuff, and that there weren't that many people trying Phi. I'd like more than the heresay heaped at Phi's feed to judge by.

vkazanov · 2023-12-15T09:34:24

Well... Back then in my shop they would just assign programmers to things, together with a couple of mathematicians.

Math guys came up with a list of algorithms to try for a search engine backend.

What we needed was matrix multiplication and maybe some decision tree walking (that was some time ago, trees were still big back then, NNs were seen as too compute-intensive for no clear benefits). So we thought that it might be cool to have a tool that would support both. Phi sounded just right for both.

And things written to AVX-512 did work. Software surpisingly easy to port.

But then comes the usual SIMD/CPU trouble: every SIMD generation wants a little software rewrite. So for both Phi generations we had to update our code. For things not compatible with the SIMD approach (think tree-walking) it is just a weak x86.

In theory Phi's were universal, in practice what we got was: okay number crunching, bad generic compute.

GPU was somewhat similar: the software stack was unstable, CUDA just did not materialize as a standard yet. But every generation introduced a massive increase in compute available. And boy did NVIDIA move fast...

So GPU situation was: amazing number crunching, no generic compute.

And then there were a few ML breakthroughs results which rendered everything that did not look like a matrix multiplication obsolete.

PS I wouldn't take this story too seriously, details may vary.

aseipp · 2023-12-15T01:07:14

Some observations:

- Very bad performance at existing x86 workloads, so a major selling point was basically not there in practice, because extracting any meaningful performance required a software rewrite anyway. This was an important adoption criteria; if they outright said "All your existing workloads are compatible, but will perform like complete dogshit", why would anyone bother? Compatibility was a big selling point that ended up meaning little in practice, unfortunately.

- Not actually what x86 users wanted. This was at the height of "Intel stagnation" and while I think they were experimenting with lots of stuff, well, in this case, they were serving a market that didn't really want what they had (or at least wasn't convinced they wanted it).

- GPU creators weren't sitting idle and twiddling their thumbs. Nvidia was continuously improving performance and programmability of their GPUs across all segments (gaming, HPC, datacenters, scientific workloads) while this was all happening. They improved their compilers, programming models, and microarchitecture. They did not sit by on any of these fronts.

Ironically the main living legacy of Phi is AVX-512, which people did and still do want. But that kind of gives it all away, doesn't it? People didn't want a new massively multicore microarchitecture. They wanted new vector instructions that were flexible and easier to program than what they had -- and AVX-512 is really much better. They wanted the things they were already doing to get better, not things that were like, effectively a different market.

Anyway, the most important point is probably the last one, honestly. Like we could talk a lot about compiler optimizations or autovectorization. But really, the market that Phi was trying to occupy just wasn't actually that big, and in the end, GPUs got better at things they were bad at, quicker than Phi got better at things it was bad at. It's not dissimilar to Optane. Technically interesting, and I mourn its death, but the competition simply improved faster than the adoption rate of the new thing, and so flash is what we have.

Once you factor in that you have to rewrite software to get meaningful performance uplift, the rest sort of falls into place. Keep in mind that if you have a $10,000 chip and you can only extract 50% of the performance, you more or less have just $5,000 on fire for nothing in return. You might as well go all the way and use a GPU because at least then you're getting more ops/mm^2 of silicon.

jauntywundrkind · 2023-12-15T01:53:43

I don't disagree anywhere but I don't think any of these statements actually condemn Xeon Phi outright. It didn't work at the time, and doing it with so little software support to tile out workloads well was a big & possibly bad gambit, but I'm so unsure we can condemn the architecture. There seems to be so few folks who made good attempts and succeeded or failed & wrote about it.

I tend to think there was tons of untapped potential still on the table. And that a failure to adopt potential isn't purely Intel alone's fault. The story we are commenting on is about the rest-of-industry trying to figure out enduring joint strategies, and much of this is chipmaker provided, but it is also informed and helped by plenty of consumers also pouring energy in to figure out what's working and not, trying to push the bounds.

Agreed that anyone going in thinking Xeon Phi would be viable for running a boring everyday x86 workload was going to be sad. To me the promise seemed clear that existing toolchains & code would work, but it was always clear to me there were a bunch of little punycores & massive SIMD units and that doing anything not SIMD intensive wasn't going to go well at all. But what's the current trend? Intel and AMD are both actively building not punycores but smaller cores, with Sierra Forest and Bergamo. E-cores are the grown up Atom we saw here.

Yes the GPGPU folks were winning. They had a huge head start, were the default option. And Intel was having trouble delivering nodes. So yes, Xeon Phi was getting trounced for real reasons. But they weren't architectural issues! It just means the Xeon Phi premise was becoming increasingly handicapped.

As I said I broadly agree everywhere. Your core point about giving the market more of what it already does is well taken, is a river of wisdom we see again and again. But I do think conservative thinking, iterating along, is dangerous thinking that obstructs us from seeing real value & possibility before us. Maybe Intel could have made a better ML chip than the GPGPU market has gotten for years, had things gone differently; I think the industry could perhaps have been glad they had veered onto a new course, but the barriers to that happening & the slow down in Intel delivery & the difficulty bootstrapping new software were all horrible encumberances which were rightly more than was worth bearing together.

vkazanov · 2023-12-15T09:41:55

I don't thing anybody seriously considered Phi's for generic compute or something.

Most experimenters saw it as a way to have something GPU-like in terms of raw power but with no limitations charateristic of SIMT's. Like, slightly different code paths for threads doing number crunching or something.

But it turns out that it's easier to force everything into a matrix. Or a very big matrix. Or a very-very-very big matrix.

And then see what sticks.

janwas · 2023-12-16T07:13:45

Why are we not also talking about memory bandwidth? Personal opinion: this is the key. The latest Phi had about 100 GB/s in 2017. The contemporary Nvidia GTX 1080: 320 GB/s.

When CPUs actually come with bandwidth and a decent vector unit, such as the A64FX, lo and behold, they lead the Top500 supercomputer list, also beating out GPUs of the day.

Why have we not been getting bandwidth in CPUs? Is it because SPECint benchmarks do not use much? Or because there is too much branch-heavy code, so we think hundreds of cores are helpful?

Existing machines are ridiculously imbalanced, hundreds of times more compute vs bandwidth than the 1:1 still seen in the 90s. Hence matmul as a way of using/wasting the extra compute.

The AMD MI300a looks like a very interesting development: >5 TB/s shared by 24 cores plus GPUs.

rini17 · 2023-12-15T09:53:36

AVX might be going the right direction, even if the AVX512 was stretch too far. I was impressed by llama.cpp performance boost when AVX1 support was added.

There's no intrinsic reason why multiplying matrices requires massive parallelism, in principle it could be done on few cores plus good management of ALUs/memory bandwidth/caches.

ttoinou · 2023-12-14T19:02:58

What's wrong with compute shaders ?

bartwr · 2023-12-14T19:15:26

I shipped a dozen products with them (mostly video games), so there's nothing "wrong" that would make them unusable. But programming them and setting up the graphics pipe (and all the passes, structured buffers, compiling, binding, weird errors, and synchronization) is a huge PITA as compared to the convenience of CUDA. Compilers are way less mature, especially on some platforms cough. Some GPU capabilities are not exposed. No real composability or libraries. No proper debugging.

（评论） (comments)

（评论）
(comments)