![]() |
|
![]() |
|
6 GPUs because they want fast storage and it uses PCIe lanes. Besides the goal was to run a 70b FP16 model (requiring roughly 140GB VRAM). 6*24GB = 144GB |
![]() |
|
I’ve got a few 4090s that I’m planning on doing this with. Would appreciate even the smallest directional tip you can provide on splitting the model that you believe is likely to work.
|
![]() |
|
It was probably just before running LLMs with tensor parallelism became interesting. There are plenty of other workloads that can be divided by 6 nicely, it's not an end-all thing.
|
![]() |
|
6 seems reasonable. 128 Lanes from ThreadRipper needs to have a few for network and NVMe (4x NVMe would be x16 lanes, and 10G network would be another x4 lanes).
|
![]() |
|
I don't think P2P is very relevant for inference. It's important for training. Inference can just be sharded across GPUs without sharing memory between them directly.
|
![]() |
|
> It doesn't beat RTX 4090 when it comes to actual LLM inference speed Sure, whisper.cpp is not an LLM. The 4090 can't even do inference at all on anything over 24GB, while ASi can chug through it even if slightly slower. I wonder if with https://github.com/tinygrad/open-gpu-kernel-modules (the 4090 P2P patches) it might become a lot faster to split a too-large model across multiple 4090s and still outperform ASi (at least until someone at Apple does an MLX LLM). |
![]() |
|
Sure, it's also at least an order of magnitude slower in practice, compared to 4x 4090 running at full speed. We're looking at 10 times the memory bandwidth and much greater compute.
|
![]() |
|
If you mean train llama from scratch, you aren't going to train it on any single box. But even with a single 3090 you can do quite a lot with LLMs (through QLoRA and similar). |
![]() |
|
What does P2P mean in this context? I Googled it and it sounds like it means "peer to peer", but what does that mean in the context of a graphics card?
|
![]() |
|
Is this really efficient or practical? My understanding is that the latency required to copy memory from CPU or RAM to GPU negates any performance benefits (much less running over a network!)
|
![]() |
|
PCIe P2P still has to go up to a central hub thing and back because PCIe is not a bus. That central hub thing is made by very few players(most famously PLX Technologies) and it costs a lot.
|
![]() |
|
PCIe isn't a bus and it doesn't really have a concept of mastering. All PCI DMA was based on bus mastering but P2P DMA is trickier than normal DMA.
|
![]() |
|
Taking off the users panel on the side of their house and flipping it to 'lots of power' when that option had previously been covered up by the panel interface.
|
![]() |
|
Glad to see that geohot is back being geohot, first by dropping a local DoS for AMD cards, then this. Much more interesting :p
|
![]() |
|
He has a very checkered history with "hacking" things. He tends to build heavily on the work of others, then use it to shamelessly self-promote, often to the massive detriment of the original authors. His PS3 work was based almost completely on a presentation given by fail0verflow at CCC. His subsequent self-promotion grandstanding world tour led to Sony suing both him and fail0verflow, an outcome they were specifically trying to avoid: https://news.ycombinator.com/item?id=25679907 In iPhone land, he decided to parade around a variety of leaked documentation, endangering the original sources and leading to a fragmentation in the early iPhone hacking scene, which he then again exploited to build on the work of others for his own self-promotion: https://news.ycombinator.com/item?id=39667273 There's no denying that geohotz is a skilled reverse engineer, but it's always bothersome to see him put onto a pedestal in this way. |
![]() |
|
I actually lost about $5k on cheapETH running servers. Nobody was "defrauded", I think these people don't understand how forks work. It's a precursor to the modern L2 stuff, I did this while writing the first version of Optimism's fraud prover. https://github.com/ethereum-optimism/cannon I suspect most of the people who bring this up don't like me for other reasons, but with this they think they have something to latch on to. Doesn't matter that it isn't true and there wasn't a scam, they aren't going to look into it since it agrees with their narrative. |
![]() |
|
Sure, but that was something that was always going to happen. So it's better to have it at least for one generation instead of no generation. |
![]() |
|
> You may need to uninstall the driver from DKMS. Your system needs large BAR support and IOMMU off. Can someone point me to the correct tutorial on how to do these things? |
![]() |
|
Can someone ELI5 what this may make possible that wasn't possible before? Does this mean I can buy a handful of 4090s and use it in lieu of an h100? Just adding the memory together?
|
![]() |
|
Wow, that was a ride. Really pushing the Overton window. "Regulating access to compute rather than data" - they're really spelling out their defection in the war on access to general computation. |
![]() |
|
I find it baffling that ideas like "govern compute" are even taken seriously. What the hell has happened to the ideals of freedom?! Does the government own us or something?
|
![]() |
|
Taxes, conscription and even pedestrian traffic rules make sense at least to some degree. Restricting "AI" because of what some uninformed politician imagines it to be is in a whole different league.
|
![]() |
|
Voldemort is fictional and so are bumbling wizard apprentices. Toy-level, not-yet-harmful AIs on the other hand are real. And so are efforts to make them more powerful. So the proposition that more powerful AIs will exist in the future is far more likely than an evil super wizard coming into existence. And I don't think literal 5-word-magic-incantation mind control is essential for an AI to be dangerous. More subtle or elaborate manipulation will be sufficient. Employees already have been duped into financial transactions by faked video calls with what they assumed to be their CEOs[0], and this didn't require superhuman general intelligence, only one single superhuman capability (realtime video manipulation). [0] https://edition.cnn.com/2024/02/04/asia/deepfake-cfo-scam-ho... |
![]() |
|
You say you have not seen any arguments that convince you. Is that just not having seen many arguments or having seen a lot of arguments where each chain contained some fatal flaw? Or something else?
|
I still don't understand why they went with 6 GPUs for the tinybox. Many things will only function well with 4 or 8 GPUs. It seems like the worst of both worlds now (use 4 GPUs but pay for 6 GPUs, don't have 8 GPUs).