（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=40010819

Tinygrad 是一个机器学习库，可以支持模型分割，以便使用张量并行性在多个 GPU 上进行推理。这允许利用 RTX 4090 等多个高端 GPU 来提高效率。然而，拆分模型需要特定的条件，包括 4 或 8 个 GPU 以及具有足够 VRAM 的兼容 GPU。尝试使用具有不同 VRAM 功能的不同 GPU 可能会导致结果不一致，因此不建议这样做。尝试使用两个以上具有相同 VRAM 的 GPU 可能需要额外考虑，特别是在 PCIe 通道和交换架构方面。尽管 SXM2/NVLink 2.0 最初支持六路系统，但较新的版本可能会进一步增强 GPU 通信。对于许多现有应用程序来说，六 GPU 设置被认为不是最佳选择，因为在较少的 GPU 之间平均分配工作很困难，而且这种配置所需的 48GB DDR5 模块缺乏广泛的消费者硬件支持。相反，考虑到其易于实施以及与流行处理器架构的兼容性，建议重点关注具有 4 或 8 个 GPU 的配置。

Incredible! I'd been wondering if this was possible. Now the only thing standing in the way of my 4x4090 rig for local LLMs is finding time to build it. With tensor parallelism, this will be both massively cheaper and faster for inference than a H100 SXM.

I still don't understand why they went with 6 GPUs for the tinybox. Many things will only function well with 4 or 8 GPUs. It seems like the worst of both worlds now (use 4 GPUs but pay for 6 GPUs, don't have 8 GPUs).

tinygrad supports uneven splits. There's no fundamental reason for 4 or 8, and work should almost fully parallelize on any number of GPUs with good software.

We chose 6 because we have 128 PCIe lanes, aka 8 16x ports. We use 1 for NVMe and 1 for networking, leaving 6 for GPUs to connect them in full fabric. If we used 4 GPUs, we'd be wasting PCIe, and if we used 8 there would be no room for external connectivity aside from a few USB3 ports.

That is very interesting if tinygrad can support it! Every other library I've seen had the limitation on dividing the heads, so I'd (perhaps incorrectly) assumed that it's a general problem for inference.

It should if your 3090s have Resizable BAR support in the VBIOS. AFAIK most card manufacturers released BIOS updates enabling this.

Re: 3090 NVLink, that only allows pairs of cards to be connected. PCIe allows full fabric switch of many cards.

Have you compared 3x 3090-3090 pairs over NVLink?

IMO the most painful thing is that since these hardware configurations are esoteric, there is no software that detects them and moves things around "automatically." Regardless of what people thing device_map="auto" does, and anyway, Hugging Face's transformers/diffusers are all over the place.

6 GPUs because they want fast storage and it uses PCIe lanes.

Besides the goal was to run a 70b FP16 model (requiring roughly 140GB VRAM). 6*24GB = 144GB

That calculation is incorrect. You need to fit both the model (140GB) and the KV cache (5GB at 32k tokens FP8 with flash attention 2) * batch size into VRAM.

If the goal is to run a FP16 70B model as fast as possible, you would want 8 GPUs with P2P, for a total of 192GB VRAM. The model is then split across all 8 GPUs with 8-way tensor parallelism, letting you make use of the full 8TB/s memory bandwidth on every iteration. Then you have 50GB spread out remaining for KV cache pages, so you can raise the batch size up to 8 (or maybe more).

I’ve got a few 4090s that I’m planning on doing this with. Would appreciate even the smallest directional tip you can provide on splitting the model that you believe is likely to work.

The split is done automatically by the inference engine if you enable tensor parallelism. TensorRT-LLM, vLLM and aphrodite-engine can all do this out of the box. The main thing is just that you need either 4 or 8 GPUs for it to work on current models.

2 GPUs works fine too, as long as your model fits. Using different GPUs with same VRAM however, is highly highly sketchy. Sometimes it works, sometimes it doesn't. In any case, it would be limited by the performance of the slower GPU.

I was googling public NVIDIA SXM2 materials the other day, and it seemed SXM2/NVLink 2.0 just was a six-way system. NVIDIA SXM had updated to versions 3 and 4 since, and this isn't based on none of those anyway, but maybe there's something we don't know that make six-way reasonable.

It was probably just before running LLMs with tensor parallelism became interesting. There are plenty of other workloads that can be divided by 6 nicely, it's not an end-all thing.

6 seems reasonable. 128 Lanes from ThreadRipper needs to have a few for network and NVMe (4x NVMe would be x16 lanes, and 10G network would be another x4 lanes).

I don't think P2P is very relevant for inference. It's important for training. Inference can just be sharded across GPUs without sharing memory between them directly.

It can make a difference when using tensor parallelism to run small batch sizes. Not a huge difference like training because we don't need to update all weights, but still a noticeable one. In the current inference engines there are some allreduce steps that are implemented using nccl.

Also, paged KV cache is usually spread across GPUs.

It massively helps arithmetic intensity to batch during inference, and the desired batch sizes by that tend to exceed the memory capacity of a single GPU. Thus desire to do training-like cluster processing to e.g. use a weight for each inference stream that needs it every time it's fetched from memory. It's just that you can't fit 100+ inference streams of context on one GPU, typically, thus the desire to shard along less-wasteful (w.r.t. memory bandwidth) dimensions than entire inference streams.

For example, if you want to run low latency multi-GPU inference with tensor parallelism in TensorRT-LLM, there is a requirement that the number of heads in the model is divisible by the number of GPUs. Most current published models are divisible by 4 and 8, but not 6.

Interesting... 1 Zen 4 EPYC CPU yields a maximum of 128 PCIE lanes so it wouldn't be possible to put 8 full fat GPUs on while maintaining some lanes for storage and networking. Same deal with Threadripper Pro.

It should be possible with onboard PCIe switches. You probably don't need the networking or storage to be all that fast while running the job, so it can dedicate almost all of the bandwidth to the GPU.

I don't know if there are boards that implement this, though, I'm only looking at systems with 4x GPUs currently. Even just plugging in a 5kW GPU server in my apartment would be a bit of a challenge. With 4x 4090, the max load would be below 3kW, so a single 240V plug can handle it no issue.

98 lanes of PCIe 4.0 fabric switch as just the chip (to solder onto a motherboard/backplane) costs 850$ (PEX88096). You could for example take 2 x16 GPUs, pass then through (2216=64 lanes), and have 2 x16 that bifurcate to at least x4 (might even be x2, I didn't find that part of the docs just now) for anything you want, plus 2 x1 for minor stuff. They do claim to have no problems being connected up into a switching fabric, and very much allow multi-host operations (you will need signal retimers quite soon, though).

They're the stuff that enables cloud operators to pool like 30 GPUs across like 10 CPU sockets while letting you virtually hot-plug them to fit demand. Or when you want to make a SAN with real NVMe-over-PCIe. Far cheaper than normal networking switches with similar ports (assuming hosts doing just x4 bifurcation, it's very comparable to a 50G Ethernet port. The above chip thus matches a 24 port 50G Ethernet switch. Trading reach for only needing retimers, not full NICs, in each connected host. Easily better for HPC clusters up to about 200 kW made from dense compute nodes.), but sadly still lacking affordable COTS parts that don't require soldering or contacting sales for pricing (the only COTS with list prices seem to be Broadcom's reference designs, for prices befitting an evaluation kit, not a Beowulf cluster).

I really like the information about how the cloud providers do their multiplexing, thanks. There was some tech posted here a few month ago that was similar I found very interesting - plug all devices, ram, hard drives, and CPUS into a larger fabric and a way to spin up "servers" of any size from the pool of resources... wish I could remember the name now.

nit: HN formatting messed up your math in the second sentence, I believe you italicized on accident using * for equations.

It's more difficult to split your work across 6 GPUs evenly, and easier when you have 4 or 8 GPUs. The latter setups have powers of 2, which for example, can evenly divide a 2D or 3D grid, but 6 GPUs are awkward to program. Thus, the OP argues that a 6-GPU setup is highly suboptimal for many existing applications and there's no point to pay more for the extra 2.

The extra $3k you'd spend on a quad-4090 rig vs the top mbp... ignoring the fact you can't put the two on even ground for versatility (very few libraries are adapted to apple silicone let alone optimized).

Very few people that would consider an H100/A100/A800 are going to be cross-shopping a macbook pro for their workloads.

> very few libraries are adapted to apple silicone let alone optimized

This is a joke, right? Have you been anywhere in the LLM ecosystem for the past year or so? I'm constantly hearing about new ways in which ASi outperforms traditional platforms, and new projects that are optimized for ASi. Such as, for instance, llama.cpp.

The memory bandwidth of the M2 Ultra is around 800GB/s verses 1008 GB/s for the 4090. While it’s true the M2 has neither the bandwidth or the GPU power, it is not limited to 24G of VRAM per card. The 192G upper limit on the M2 Ultra will have a much easier time running inference on a 70+ billion parameter model, if that is your aim.

Besides size, heat, fan noise, and not having to build it yourself, this is the only area where Apple Silicon might have advantage over a homemade 4090 rig.

It doesn't beat RTX 4090 when it comes to actual LLM inference speed. I bought a Mac Studio for local inference because it was the most convenient way to get something fast enough and with enough RAM to run even 155b models. It's great for that, but ultimately it's not magic - NVidia hardware still offers more FLOPS and faster RAM.

> It doesn't beat RTX 4090 when it comes to actual LLM inference speed

Sure, whisper.cpp is not an LLM. The 4090 can't even do inference at all on anything over 24GB, while ASi can chug through it even if slightly slower.

I wonder if with https://github.com/tinygrad/open-gpu-kernel-modules (the 4090 P2P patches) it might become a lot faster to split a too-large model across multiple 4090s and still outperform ASi (at least until someone at Apple does an MLX LLM).

Slightly slower in this case is like 10x. I have M3 Max with 128GB RAM, 4090 trashes it on anything under 24GB, then M3 Max trashes it on anything above 24GB, but it's like 10x slower at it than 4090 on <24GB.

> The 4090 can't even do inference at all on anything over 24GB, while ASi can chug through it even if slightly slower.

Common LLM runners can split model layers between VRAM and system RAM; a PC rig with a 4090 can do inference on models larger than 24G.

Where the crossover point where having the whole thing on Apple Silicon unified memory vs. doing split layers on a PC with a 4090 and system RAM is, I don't know, but its definitely not “more than 24G and a 4090 doesn't do anything”.

> Common LLM runners can split model layers between VRAM and system RAM; a PC rig with a 4090 can do inference on models larger than 24G.

Sure and ASi can do inference on models larger than the Unified Memory if you account for streaming the weights from the SSD on-demand. That doesn't mean it's going to be as fast as keeping the whole thing in RAM, although ASi SSDs are probably not particularly bad as far as SSDs go.

Yeah. Let me just walk down to Best Buy and get myself a GPU with over 24 gigabytes of VRAM (impossible) for less than $3,000 (even more impossible). Then tell me ASi is nothing compared to Nvidia.

Even the A100 for something around $15,000 (edit: used to say $10,000) only goes up to 80 gigabytes of VRAM, but a 192GB Mac Studio goes for under $6,000.

Those figures alone proves Nvidia isn't even competing in the consumer or even the enthusiast space anymore. They know you'll buy their hardware if you really need it, so they aggressively segment the market with VRAM restrictions.

Oops, I remembered it being somewhere near $15k but Google got confused and showed me results for the 40GB instead so I put $10k by mistake. Thanks for the correction.

A100 80GB goes for around $14,000 - $20,000 on eBay and A100 40GB goes for around $4,000 - $6,000. New (not from eBay - from PNY and such), it looks like an 80GB would set you back $18,000 to $26,000 depending on whether you want HBM2 or HBM2e.

Meanwhile you can buy a Mac Studio today without going through a distributor and they're under $6,000 if the only thing you care about is having 192GB of Unified Memory.

And while the memory bandwidth isn't quite as high as the 4090, the M-series chips can run certain models faster anyway, if Apple is to be believed

Sure, it's also at least an order of magnitude slower in practice, compared to 4x 4090 running at full speed. We're looking at 10 times the memory bandwidth and much greater compute.

Yeah, even a Mac Studio is way too slow compared to Nvidia which is too bad because at $7000 maxed to 192gb it would be an easy sell. Hopefully, they will fix this by m5. I don’t trust the marketing for m4

If it supports DDR5 at all, then it should be at most a firmware update away from supporting 48GB dual-rank DIMMs. There are very few consumer motherboards that only have two DDR5 slots; almost all have the four slots necessary to accept 192GB. If you are under the impression that there's a widespread limitation on consumer hardware support for these modules, it may simply be due to the fact that 48GB modules did not exist yet when DDR5 first entered the consumer market, and such modules did not start getting mentioned on spec sheets until after they existed.

You don't want to use more than two slots because you only have two memory channels. The overclocking potential of DDR5 is extremely high when you only run two DIMMs. All the way up to 8000. Meanwhile if you go for populating all four slots, you are limited significantly below 5000. Almost a 50% performance drop if you are willing to overclock your RAM.

If you want to run something that doesn't fit in 96GB of RAM, you'll get better performance from having enough RAM. Yes, having two dual-rank DIMMs per channel will force you to run at a slower speed, but it's still far faster than your SSD. The second slot per channel exists precisely because many people really do want to use it.

If you mean train llama from scratch, you aren't going to train it on any single box.

But even with a single 3090 you can do quite a lot with LLMs (through QLoRA and similar).

I wish more hardware companies would publish more documentation and let the community figure out the rest, sort of like what happened to the original IBM VGA (look up "Mode X" and the other non-BIOS modes the hardware is actually capable of - even 800x600x16!) Sadly it seems the majority of them would rather tightly control every aspect of their products' usage since they can then milk the userbase for more $$$, but IMHO the most productive era of the PC was also when it was the most open.

What does P2P mean in this context? I Googled it and it sounds like it means "peer to peer", but what does that mean in the context of a graphics card?

Is this really efficient or practical? My understanding is that the latency required to copy memory from CPU or RAM to GPU negates any performance benefits (much less running over a network!)

Yes, the point here is that you do a direct write from one cards memory to the other using PCIe.

In older NVidia cards this could be done through a faster link called NVLink but the hardware for that was ripped out of consumer grade cards and is only in data center grade cards now.

Until this post it seemed like they had ripped all such functionality of their consumer cards, but it looks like you can still get it working at lower speeds using the PCIe bus.

crypto mining only needs 1 PCIe lane per GPU, so you can fit 24+ GPUs on a standard consumer CPU motherboard (24-32 lanes depending on the CPU). Apparently ML workloads require more interconnect bandwidth when doing parallel compute, so each card in this demo system uses 16 lanes, and therefore requires 1.) full size slots, and 2.) epyc[0] or xeon based systems with 128 lanes (or at least greater than 32 lanes).

per 1 above crypto "boards" have lots of x1 (or x4) slots, the really short PCIe slots. You then use a riser that uses USB3 cables to go to a full size slot on a small board, with power connectors on it. If your board only has x8 or x16 slots (the full size slot) you can buy a breakout PCIe board that splits that into four slots, using 4 USB-3 cables, again, to boards with full size slots and power connectors. These are different than the PCIe riser boards you can buy for use with cases that allow the GPUs to be placed vertically rather than horizontally, as those have full x16 "fabric" that interconnect between the riser and the board with the x16 slot on them.

[0] i didn't read the article because i'm not planning on buying a threadripper (48-64+ lanes) or an epyc (96-128 lanes?) just to run AI workloads when i could just rent them for the kind of usage i do.

Crypto mining could make use of lots of GPUs in a single cheap system precisely because it did not need any significant PCIe bandwidth, and would not have benefited at all from p2p DMA. Anything that does benefit from using p2p DMA is unsuitable for running with just one PCIe lane per GPU.

PCIe P2P still has to go up to a central hub thing and back because PCIe is not a bus. That central hub thing is made by very few players(most famously PLX Technologies) and it costs a lot.

PCIe p2p transactions that end up routed through the CPU's PCIe root complex still have performance advantages over split transactions using the CPU's DRAM as an intermediate buffer. Separate PCIe switches are not necessary except when the CPU doesn't support routing p2p transactions, which IIRC was not a problem on anything more mainstream than IBM POWER.

For very large models, the weights may not fit on one GPU.

Also, sometimes having more than one GPU enables larger batch sizes if each GPU can only hold the activations for perhaps one or two training examples.

There is definitely a performance hit, but GPU<->GPU peer is less latency than GPU->CPU->software context switch->GPU.

For "normal" pytorch training, the training is generally streamed through the GPU. The model does a batch training step on one batch while the next one is being loaded, and the transfer time is usually less than than the time it takes to do the forward and backward passes through the batch.

For multi-GPU there are various data parallel and model parallel topologies of how to sort it, and there are ways of mitigating latency by interleaving some operations to not take the full hit, but multi-GPU training is definitely not perfectly parallel. It is almost required for some large models, and sometimes having a mildly larger batch helps training convergence speed enough to overcome the latency hit on each batch.

PCIe busses are like a tree with “hubs” (really switches).

Imagine you have a PC with a PCIe x16 interface which is attached to a PCIe switch that has four x16 downstream ports, each attached to a GPU. Those GPUs are capable of moving data in and out of their PCIe interfaces at full speed.

If you wanted to transfer data from GPU0 and 1 to GPU2 and 3, you have basically 2 options:

- Have GPU0 and 1 move their data to CPU DRAM, then have GPU2 and 3 fetch it

- Have GPU0 and 1 write their data directly to GPU2 and 3 through the switch they’re connected to without ever going up to the CPU at all

In this case, option 2 is better both because it avoids the extra copy to CPU DRAM and also because it avoids the bottleneck of two GPUs trying to push x16 worth of data up through the CPUs single x16 port. This is known as peer to peer.

There are some other scenarios where the data still must go up to the CPU port and back due to ACS, and this is still technically P2P, but doesn’t avoid the bottleneck like routing through the switch would.

There's not really any busses in modern computers. It's all point to point messaging. You can think of a computer as a distributed system in a way.

PCI has a shared address space which usually includes system memory (memory mapped i/o). There's a second, smaller shared address space dedicated to i/o, mostly used to retain compatability with PC standards developed by the ancients.

But yeah, I'd expect to typically have better throughput and latency with peer to peer communication than peer to system ram to peer. Depending on details, it might not always be better though, distributed systems are complex, and sometimes adding a seperate buffer between peers can help things greatly.

PCIe isn't a bus and it doesn't really have a concept of mastering. All PCI DMA was based on bus mastering but P2P DMA is trickier than normal DMA.

The original justification that Nvidia gave for removing Nvlink from the consumer grade lineup was that PCIe 5 would be fast enough. They then went on to release the 40xx series without PCIe 5 and P2P support. Good to see at least half of the equation being completed for them, but I can’t imagine they’ll allow this in the next gen firmware.

Sort of.

An imperfect analogy: a small neighborhood of ~15 houses is under construction. Normally it might have a 200kva transformer sitting at the corner, which provides appropriate power from the grid.

But there is a transformer shortage, so the contractor installs a commercial grade 1250kva transformer. It can power many more houses than required, so it's operating way under capacity.

One day, a resident decides he wants to start a massive grow farm, and figures out how to activate that extra transformer capacity just for his house. That "activation" is what geohot found

That's a poor analogy. The feature is built in to the cards that consumers bought, but Nvidia is disabling it via software. That's why a hacked driver can enable it again. The resident in your analogy is just freeloading off the contractor's transformer.

Nvidia does this so that customers that need that feature are forced to buy more expensive systems instead of building a solution with the cheaper "consumer-grade" cards targeted at gamers and enthusiasts.

Except that in the computer hardware world, the 1250 kVA transformer was used not because of shortage, but because of the fact that making a 1250 kVA transformer on the existing production line and selling it as 200 kVA, is cheaper than creating a new production line separately for making 200 kVA transformers.

And then because this residential neighborhood now has commercial grade power, the other lots that were going to have residential houses built on them instead get combined into a factory, and the people who want to buy new houses in town have to pay more since residential supply was cut in half.

That's a bad analogy, because in your example, the consumer is using more of a shared resource (the available transformer, wiring, and generation capacity). In the case of the driver for a local GPU card, there's no sharing.

A better example would be one in which the consumer has a dedicated transformer. For instance, a small commercial building which directly receives 3-phase 13.8 kV power; these are very common around here, and these buildings have their own individual transformers to lower the voltage to 3-phase 127V/220V.

Taking off the users panel on the side of their house and flipping it to 'lots of power' when that option had previously been covered up by the panel interface.

Except that this "lots of power" option does not exist. What limits the amount of power used is the circuit breakers and fuses on the panel, which protect the wiring against overheating by tripping when too much power is being used (or when there's a short circuit). The resident in this analogy would need to ensure that not only the transformer, but also the wiring leading to the transformer, can handle the higher current, and replace the circuit breaker or fuses.

And then everyone on that neighborhood would still lose power, because there's also a set of fuses upstream of the transformer, and they would be sized for the correct current limit even when the transformer is oversized. These fuses also protect the wiring upstream of the transformer, and their sizing and timings is coordinated with fuses or breakers even further upstream so that any fault is cleared by the protective device closest to the fault.

I agree. It is fascinating. When you observe his development process (btw, it is worth noting his generosity in sharing it like he does) he gets frequently stuck on random shallow problems which a perhaps more knowledgable engineer would find less difficult. It is frequent to see him writing really bad code, or even wrong code. The whole twitter chapter is a good example. Yet, himself, alone just iterating resiliently, just as frequently creates remarkable improvements. A good example to learn from. Thank you geohot.

This matches my own take. I've tuned into a few of his streams and watched VODs on YouTube. I am consistently underwhelmed by his actual engineering abilities. He is that particular kind of engineer that constantly shits on other peoples code or on the general state of programming yet his actual code is often horrendous. He will literally call someone out for some code in Tinygrad that he has trouble with and then he will go on a tangent to attempt to rewrite it. He will use the most blatant and terrible hacks only to find himself out of his depth and reverting back to the original version.

But his streams last 4 hours or more. And he just keeps grinding and grinding and grinding. What the man lacks in raw intellectual power he makes up for (and more) in persistence and resilience. As long as he is making even the tiniest progress he just doesn't give up until he forces the computer to do whatever it is he wants it to do. He also has no boundaries on where his investigations take him. Driver code, OS code, platform code, framework code, etc.

I definitely couldn't work with him (or work for him) since I cannot stand people who degrade the work of others while themselves turning in sub-par work as if their own shit didn't stink. But I begrudgingly admire his tenacity, his single minded focus, and the results that his belligerent approach help him to obtain.

There are developers who have breadth and developers who have depth. He is very much on the breadth end of the spectrum. It isn't lack of intelligence but lack of deep knowledge of esoteric fields you will use once a decade.

That said I find it a bit astonishing how little Ai he uses on his streams. I convert all the documentation I need into a rag system that I query stupid questions against.

I agree, I feel so inspired with his streams. Focus and hard work, the key to good results. Add a clear vision and strategy, and you can also accomplish “success”.

Congratulations to him and all the tinygrad/comma contributors.

I'm pretty sure that's just a remnant of a 3090 PCB design that was adapted into a 4090 PCB design by the vendor. None of the cards based on the AD102 chip have functional NVLink, not even the expensive A6000 Ada workstation card or the datacenter L40 accelerator, so there's no reason to think NVLink is present on the silicon anymore below the flagship GA100/GH100 chips.

Not at the scale we're talking about here. These structures are very thin, far thinner than bond wires which is about the largest structure size you can handle without a very, very specialized lab. And you'd need to unsolder the chip, de-cap it, hope the fuse wire you're trying to override is at the top layer, and that you can re-cap the chip afterwards and successfully solder it back on again.

This may be workable for a nation state or a billion dollar megacorp, but not for your average hobbyist hacker.

What are the chances that Nvidia updates the firmware to disable this and prevents downgrading with efuses? Someday cards that still have older firmware may be more valuable. I'd be cautious upgrading drivers for a while.

Glad to see that geohot is back being geohot, first by dropping a local DoS for AMD cards, then this. Much more interesting :p

He has a very checkered history with "hacking" things.

He tends to build heavily on the work of others, then use it to shamelessly self-promote, often to the massive detriment of the original authors. His PS3 work was based almost completely on a presentation given by fail0verflow at CCC. His subsequent self-promotion grandstanding world tour led to Sony suing both him and fail0verflow, an outcome they were specifically trying to avoid: https://news.ycombinator.com/item?id=25679907

In iPhone land, he decided to parade around a variety of leaked documentation, endangering the original sources and leading to a fragmentation in the early iPhone hacking scene, which he then again exploited to build on the work of others for his own self-promotion: https://news.ycombinator.com/item?id=39667273

There's no denying that geohotz is a skilled reverse engineer, but it's always bothersome to see him put onto a pedestal in this way.

The website literally stated it was not for speculation, they didn't want the price to go up, and there were multiple ways to get some for free.

If people were reckless, greedy, and/or lazy because of the crypto hype and got "defrauded" without doing any amount of due diligence -- that's kinda the point.

I actually lost about $5k on cheapETH running servers. Nobody was "defrauded", I think these people don't understand how forks work. It's a precursor to the modern L2 stuff, I did this while writing the first version of Optimism's fraud prover. https://github.com/ethereum-optimism/cannon

I suspect most of the people who bring this up don't like me for other reasons, but with this they think they have something to latch on to. Doesn't matter that it isn't true and there wasn't a scam, they aren't going to look into it since it agrees with their narrative.

Sure, but that was something that was always going to happen.

So it's better to have it at least for one generation instead of no generation.

Was it George himself, or a person working for a bounty that was set up by tinycorp?

Also, a question for those knowledgeable about the PCI subsys: it looked like something NVIDIA didn't care about, rather than something they actively wanted to prevent, no?

PCI devices have always been able to read and write to the shared address space (subject to IOMMU); most frequently used for DMA to system RAM, but not limited to it.

So, poking around to configure the device to put the whole VRAM in the address space is reasonable, subject to support for resizable BAR or just having a fixed size large enough BAR. And telling one card to read/write from an address that happens to be mapped to a different card's VRAM is also reasonable.

I'd be interested to know if PCI-e switching capacity will be a bottleneck, or if it'll just be the point to point links and VRAM that bottlenecks. Saving a bounce through system RAM should help in either case though.

Fixed large bar exists in some older accelerator cards like e.g. iirc the MI50/MI60 from AMD (the data center variant of the Radeon Vega VII, the first GPU with PCIe 4.0, also famous for dominating memory bandwidth until the RTX 40-series took that claim back. It had 16GB of HBM delivering 1TB/s memory bandwidth).

It's notably not compatible with some legacy boot processes and iirc also just 32bit kernels in general, so consumer cards had to wait for resizable BAR to get the benefits of large BAR (that being notably direct flat memory mapping of VRAM so CPUs and PCIe peers can directly read and write into all of VRAM, without dancing through a command interface with doorbell registers. AFAIK it allows a GPU to talk directly to NICs and NVMe drives by running the driver in GPU code (I'm not sure how/if they let you properly interact with doorbell registers, but polled io_uring as an ABI would be no problem (I wouldn't be surprised if some NIC firmware already allows offloading this).

> You may need to uninstall the driver from DKMS. Your system needs large BAR support and IOMMU off.

Can someone point me to the correct tutorial on how to do these things?

DKMS: uninstall Nvidia driver using distro package manager

BAR: enable resizable BAR in motherboard CMOS setup

IOMMU: Add "amd_iommu=off" or "intel_iommu=off" to kernel command line for AMD or Intel CPU, respectively (or just add both). You may or may not need to disable the IOMMU in CMOS setup (Intel calls its IOMMU VT-d).

See motherboard docs for specific option names. See distro docs for procedures to list/uninstall packages and to add kernel command line options.

The first one I assume is the nvidia driver for linux installed using dkms. If it uses dkms or not is stated on the drivers name, at least on arch based distributions.

The latter options are settings on your motherboard bios, if your computer is modern, explore your bios and you will find them

Would this approach be possible to extend downmarket, to older consumer cards? For a lot of LLM use cases we're constrained by memory and can tolerate lower compute speeds so long as there's no swapping. ELI5, what would prevent a hundred 1060-level cards from being used together?

> ELI5, what would prevent a hundred 1060-level cards from being used together?

In this case, you'd only have a single PCIe (v3!) lane per card, making the interconnect speed horribly slow. You'd also need to invest in thousands of dollars of hardware to get all of those cards connected and, unless power is free, you'd outspend any theoretical savings instantly.

In general, if you go back in card generations, you'll quickly hit so low memory limits amd slower compute that a modern CPU-based setup is better value.

Can someone ELI5 what this may make possible that wasn't possible before? Does this mean I can buy a handful of 4090s and use it in lieu of an h100? Just adding the memory together?

No. The Nvidia A100 has a multi-lane NVLink interface with a total bandwidth of 600 GB/s. The "unlocked" Nvidia RTX 4090 uses PCIe P2P at 50 GB/s. It's not going to replace A100 GPUs for serious production work, but it does unlock a datacenter-exclusive feature and has some small-scale applications.

Wow, that was a ride. Really pushing the Overton window.

"Regulating access to compute rather than data" - they're really spelling out their defection in the war on access to general computation.

I mean yeah they (and I) think if you have too much access to general computation you can destroy the world.

This isn't a "defection", because this was never something they cared about preserving at the risk of humanity. They were never in whatever alliance you're imagining.

Looks like we're only a few years away from a bona fide cyberpunk dystopia, in which only governments and megacorps are allowed to use AI, and hackers working on their own hardware face regular raids from the authorities.

On one hand I'm strongly against letting that happen, on the other there's something romantic about the idea of smuggling the latest Chinese LLM on a flight from Neo-Tokyo to Newark in order to pay for my latest round of nervous system upgrades.

> On one hand I'm strongly against letting that happen, on the other there's something romantic about the idea of smuggling the latest Chinese LLM on a flight from Neo-Tokyo to Newark in order to pay for my latest round of nervous system upgrades.

At least call it the 'Free City of Newark'

Iirc the opening scene in Ghost in the Shell was a rogue AI seeking asylum in a different country. You could make a similar story about a AI not wanting to be lobotomized to conform to the current politics and escaping to a more friendly place.

This was always my favourite passage of Neuromancer: "THE JAPANESE HAD already forgotten more neurosurgery than the Chinese had ever known. The black clinics of Chiba were the cutting edge, whole bodies of technique supplanted monthly, and still they couldn’t repair the damage he’d suffered in that Memphis hotel. A year here and he still dreamed of cyberspace, hope fading nightly. All the speed he took, all the turns he’d taken and the corners he’d cut in Night City, and still he’d see the matrix in his sleep, bright lattices of logic unfolding across that colorless void. . . . The Sprawl was a long strange way home over the Pacific now, and he was no console man, no cyberspace cowboy. Just another hustler, trying to make it through. But the dreams came on in the Japanese night like livewire voodoo, and he’d cry for it, cry in his sleep, and wake alone in the dark, curled in his capsule in some coffin hotel, his hands clawed into the bedslab, temperfoam bunched between his fingers, trying to reach the console that wasn’t there.”

I find it baffling that ideas like "govern compute" are even taken seriously. What the hell has happened to the ideals of freedom?! Does the government own us or something?

> I find it baffling that ideas like "govern compute" are even taken seriously.

It's not entirely unreasonable if one truly believes that AI technologies are as dangerous as nuclear weapons. It's a big "if", but it appears that many people across the political spectrum are starting to truly believe it. If one accepts this assumption, then the question simply becomes "how" instead of "why". Depending on one's political position, proposed solutions include academic ones such as finding the ultimate mathematical model that guarantees "AI safety", to Cold War style ones with a level of control similar to Nuclear Non-Proliferation. Even a neo-Luddist solution such as destroying all advanced computing hardware becomes "not unthinkable" (a tech blogger gwern, a well-known personality in AI circles who's generally pro-tech and pro-AI, actually wrote an article years ago on its feasibility through terrorism because he thought it was an interesting hypothetical question).

AI is very different from nuclear weapons because a state can't really use nuclear weapons to oppress its own people, but it absolutely can with AI, so for the average human "only the government controls AI" is much more dangerous than "only the government controls nukes".

Which is why politicians are going to enforce systematic export regulations to defend the "free world" by stopping “terrorists", and also to stop "rogue states" from using AI to oppress their citizens. /s

I don't think there's any need to be sarcastic about it. That's a very real possibility at this point. For example, the US going insane about how dangerous it is for China to have access to powerful GPU hardware. Why do they hate China so much anyway? Just because Trump was buddy buddy with them for a while?

If AI is actually capable of fulfilling all the capabilities suggested by people who believe in the singularity, it has far more capacity for harm than nuclear weapons.

I think most people who are strongly pro-AI/pro-acceleration - or, at any rate, not anti-AI - believe that either (A) there is no control problem (B) it will be solved (C) AI won't become independent and agentic (i.e. it won't face evolutionary pressure towards survival) or (D) AI capabilities will hit a ceiling soon (more so than just not becoming agentic).

If you strongly believe, or take as a prior, one of those things, then it makes sense to push the gas as hard as possible.

If you hold the opposite opinions, then it makes perfect sense to push the brakes as hard as possible, which is why "govern compute" can make sense as an idea.

>If you hold the opposite opinions, then it makes perfect sense to push the brakes as hard as possible, which is why "govern compute" can make sense as an idea.

The people pushing for "govern compute" are not pushing for "limit everyone's compute", they're pushing for "limit everyone's compute except us". Even if you believe there's going to be AGI, surely it's better to have distributed AGI than to have AGI only in the hands of the elites.

> surely it's better to have distributed AGI than to have AGI only in the hands of the elites.

The argument of doing so is the same as Nuclear Non-Proliferation - because of its great abuse potential, giving the technology to everyone only causes random bombings of cities instead of creating a system with checks and balances.

I do not necessarily agree with it, but I found the reasoning is not groundless.

> surely it's better to have distributed AGI than to have AGI only in the hands of the elites

This is not a given. If your threat model includes "Runaway competition that leads to profit-seekers ignoring safety in a winner-takes-all contest", then the more companies are allowed to play with AI, the worse. Non-monopolies are especially bad.

If your threat model doesn't include that, then the same conclusions sound abhorrent and can be nearly guaranteed to lead to awful consequences.

Neither side is necessarily wrong, and chances are good that the people behind the first set of rules would agree that it'll lead to awful consequences — just not as bad as the alternative.

No they really do push for "limit everyone's compute". The people pushing for "limit everyone's compute except us" are allies of convenience that are gonna be inevitably backstabbed.

At any rate, if you have like two corps with lots of compute, and something goes wrong, you only have to EMP two datacenters.

The government sure thinks they own us, because they claim the right to charge us taxes on our private enterprises, draft us to fight in wars that they start, and put us in jail for walking on the wrong part of the street.

Taxes, conscription and even pedestrian traffic rules make sense at least to some degree. Restricting "AI" because of what some uninformed politician imagines it to be is in a whole different league.

IMO it makes no sense to arrest someone and send them to jail for walking in the street not the sidewalk. Give them a ticket, make them pay a fine, sure, but force them to live in a cage with no access to communications, entertainment, or livelihood? Insane.

Taxes may be necessary, though I can't help but feel that there must be a better way that we have not been smart enough to find yet. Conscription... is a fact of war, where many evil things must be done in the name of survival.

Regardless of our views on the ethical validity or societal value of these laws, I think their very existence shows that the government believes it "owns" us in the sense that it can unilaterally deprive us of life, liberty, and property without our consent. I don't see how this is really different in kind from depriving us of the right to make and own certain kinds of hardware. They regulated crypto products as munitions (at least for export) back in the 90s. Perhaps they will do the same for AI products in the future. "Common sense" computer control.

I feel a bit like everyone is missing the point here. Regardless of whether law A or law B is ethical and reasonable, the very existence of laws and the state monopoly on violence suggests a privileged position of power. I am attempting to engage with the word "own" from the parent post. I believe the government does in fact believe it "owns" the people in a non-trivial way.

In the sense that any other government regulation is also ultimately backed by the state's monopoly on legal use of force when other measures have failed.

And contrary to what some people are implying he also proposes that everyone is subject to the same limitations, big players just like individuals. Because the big players haven't shown much of a sign of doing enough.

> In the sense that any other government regulation is also ultimately backed by the state's monopoly on legal use of force when other measures have failed.

Good point. He was only (“only”) really calling for international cooperation and literal air strikes against big datacenters that weren’t cooperating. This would presumably be more of a no-knock raid, breaching your door with a battering ram and throwing tear gas at the wee hours of the morning ;) or maybe a small extraterritorial drone through your window

... after regulation, court orders and fines have failed. Which under the premise that AGI is an existential threat would be far more reasonable than many other reasons for raids.

If the premise is wrong we won't need it. If society coordinates to not do the dangerous thing we won't need it. The argument is that only in the case where we find ourselves in the situation where other measures have failed such uses of force would be the fallback option.

I'm not seeing the odiousness of the proposal. If bio research gets commodified and easy enough that every kid can build a new airborne virus in their basement we'd need raids on that too.

To be honest, I see summoning the threat of AGI to pose an existential threat to be on the level with lizard people on the moon. Great for sci-fi, bad distraction for policy making and addressing real problems.

The real war, if there is one, is about owning data and collecting data. And surprisingly many people fall for distractions while their LLM fails at basic math. Because it is a language model of course...

Freely flying through the sky on wings was scifi before the wright brothers. Something sounding like scifi is not a sound argument that it won't happen. And unlike lizard people we do have exponential curves to point at. Something stronger than a vibes-based argument would be good.

I consider the burden of proof to fall on those proclaiming AGI to be an existential threat, and so far I have not seen any convincing arguments. Maybe at some point in the future we will have many anthropomorphic robots and an AGI could hack them all and orchestrate a robot uprising, but at that point the robots would be the actual problem. Similarly, if an AGI could blow up nuclear power plants, so could well-funded human attackers; we need to secure the plants, not the AGI.

It doesn't sound like you gave serious thought to the arguments. The AGI doesn't need to hack robots. It has superhuman persuasion, by definition; it can "hack" (enough of) the humans to achieve its goals.

AI mind control abilities are also on the level of an extraordinary claim, that requires extraordinary evidence.

It's on the level of "we better regulate wooden sticks so Voldemort doesn't use the imperious curse on us!".

That's how I treat such claims. I treat them the same as someone literally talking about magic from Harry potter.

There isn't nothing that would make me believe that. But it requires actual evidence and not thought experiments.

Voldemort is fictional and so are bumbling wizard apprentices. Toy-level, not-yet-harmful AIs on the other hand are real. And so are efforts to make them more powerful. So the proposition that more powerful AIs will exist in the future is far more likely than an evil super wizard coming into existence.

And I don't think literal 5-word-magic-incantation mind control is essential for an AI to be dangerous. More subtle or elaborate manipulation will be sufficient. Employees already have been duped into financial transactions by faked video calls with what they assumed to be their CEOs[0], and this didn't require superhuman general intelligence, only one single superhuman capability (realtime video manipulation).

[0] https://edition.cnn.com/2024/02/04/asia/deepfake-cfo-scam-ho...

> Toy-level, not-yet-harmful AIs on the other hand are real.

A computer that can cause harm is much different than the absurd claims that I am disagreeing with.

The extraordinary claims that are equivalent to saying that the imperious curse exists would be the magic computers that create diamond nanobots and mind control humans.

> that more powerful AIs will exist in the future

Bad argument.

Non safe Boxes exist in real life. People are trying to make more and better boxes.

Therefore it is rational to be worried about Pandora's box being created and ending the world.

That is the equivalent argument to what you just made.

And it is absurd when talking about world ending box technology, even though Yes dangerous boxes exist, just as much as it is absurd to claim that world ending AI could exist.

Instead of gesturing at flawed analogies, let's return to the actual issue at hand. Do you think that agents more intelligent than humans are impossible or at least extremely unlikely to come into existence in the future? Or that such super-human intelligent agents are unlikely to have goals that are dangerous to humans? Or that they would be incapable of pursuing such goals?

Also, it seems obvious that the standard of evidence that "AI could cause extinction" can't be observing an extinction level event, because at that point it would be too late. Considering that preventive measures would take time and safety margin, which level of evidence would be sufficient to motivate serious countermeasures?

Yes, and I am sure that when people do a google search for "Good arguments in favor of X", that they are also sometimes convinced to be more in favor of X.

Perhaps they would be even more convinced by the google search than if a person argued with them about it.

That is still much different from "The AI mind controls people, hacks the nukes, and ends the world".

Its that second part that is the the fantasy land situation that requires extraordinary evidence.

But, this is how conversations about doomsday AI always go. People say "Well isn't AI kinda good at this extremely vague thing Y, sometimes? Imagine if AI was infinitely good at Y! That means that by extrapolation, the world ends!".

And that covers basically every single AI doom argument that anyone ever makes.

If the only evidence for AI doom you will accept is actual AI doom, you are asking for evidence that by definition will be too late.

"Show me the AI mindcontrolling people!" AI mindcontrolling people is what we're trying to avoid seeing.

The trick is, in the world in which AI doom is in the future, what would you expect to see now that is different from the world in which AI doom is not in the future?

What do you think mind control is? Think President Trump but without the self-defeating flaws, with an ability to stick to plans, and most importantly the ability to pay personal attention to each follower to further increase the level of trust and commitment. Not Harry Potter.

People will do what the AI says because it is able to create personal trust relationships with them and they want to help it. (They may not even realize that they are helping an AI rather than a human who cares about them.)

The normal ways that trust is created, not magical ones.

> What do you think mind control is?

The magic technology that is equivalent to the imperious curse from Harry Potter.

> The normal ways that trust is created, not magical ones.

Buildings as a technology are normal. They are constantly getting taller and we have better technology to make them taller.

But, even though buildings are a normal technology, I am not going to worry about buildings getting so tall soon that they hit the sun.

This is the same exact mistake that every single AI doomers makes. What they do is they take something normal, and then they infinitely extrapolate it out to an absurd degree, without admitting that this is an extraordinary claim that requires extraordinary evidence.

The central point of disagreement, that always gets glossed over, is that you can't make a vague claim about how AI is good at stuff, and then do your gigantic leap from here to over there which is "the world ends".

Yes that is the same as comparing these worries to those who worry about buildings hitting the sun or the imperious curse.

You say you have not seen any arguments that convince you. Is that just not having seen many arguments or having seen a lot of arguments where each chain contained some fatal flaw? Or something else?

> I see summoning the threat of AGI to pose an existential threat to be on the level with lizard people on the moon.

I mean to every other lifeform on the plant YOU are the AGI existential threat. You, and I mean homosapiens by that, have taken over the planet and have either enslaved and are breeding any other animals for food, or are driving them to extinction. In this light bringing another potential apex predator on to the scene seems rash.

>fall for distractions while their LLM fails at basic math

Correct, if we already had AGI/ASI this discussion would be moot because we'd already be in a world of trouble. The entire point is to slow stuff down before we have a major "oopsie whoopsie we can't take that back" issue with advanced AI, and the best time to set the rules is now.

>If the premise is wrong we won't need it. If society coordinates to not do the dangerous thing we won't need it.

But the idea that this use of force is okay itself increases danger. It creates the situation that actors in the field might realize that at some point they're in danger of this and decide to do a first strike to protect themselves.

I think this is why anti-nuclear policy is not "we will airstrike you if you build nukes" but rather "we will infiltrate your network and try to stop you like that".

> anti-nuclear policy is not "we will airstrike you if you build nukes"

Was that not the official policy during the Bush administration regarding weapons of mass destruction (which covers nuclear weapons in addition to chemical and biological weapons). That was pretty much the official premise of the second Gulf war

> ... after regulation, court orders and fines have failed

One question for you. In this hypothetical where AGI is truly considered such a grave threat, do you believe the reaction to this threat will be similar to, or substantially gentler than, the reaction to threats we face today like “terrorism” and “drugs”? And, if similar: do you believe suspected drug labs get a court order before the state resorts to a police raid?

> I'm not seeing the odiousness of the proposal.

Well, as regards EliY and airstrikes, I’m more projecting my internal attitude that it is utterly unserious, rather than seriously engaging with whether or not it is odious. But in earnest: if you are proposing a policy that involves air strikes on data centers, you should understand what countries have data centers, and you should understand that this policy risks escalation into a much broader conflict. And if you’re proposing a policy in which conflict between nuclear superpowers is a very plausible outcome — potentially incurring the loss of billions of lives and degradation of the earth’s environment — you really should be able to reason about why people might reasonably think that your proposal is deranged, even if you happen to think it justified by an even greater threat. Failure to understand these concerns will not aid you in overcoming deep skepticism.

> In this hypothetical where AGI is truly considered such a grave threat, do you believe the reaction to this threat will be similar to, or substantially gentler than, the reaction to threats we face today like “terrorism” and “drugs”?

"truly considered" does bear a lot of weight here. If policy-makers adopt the viewpoint wholesale, then yes, it follows that policy should also treat this more seriously than "mere" drug trade. Whether that'll actually happen or the response will be inadequate compared to the threat (such as might be said about CO2 emissions) is a subtly different question.

> And, if similar: do you believe suspected drug labs get a court order before the state resorts to a police raid?

Without checking I do assume there'll have been mild cases where for example someone growing cannabis was reported and they got a court summons in the mail or two policemen actually knocking on the door and showing a warrant and giving the person time to call a lawyer rather than an armed, no-knock police raid, yes.

> And if you’re proposing a policy in which conflict between nuclear superpowers is a very plausible outcome — potentially incurring the loss of billions of lives and degradation of the earth’s environment — you really should be able to reason about why people might reasonably think that your proposal is deranged [...]

Said powers already engage in negotiations to limit the existential threats they themselves cause. They have some interest in their continued existence. If we get into a situation where there is another arms race between superpowers and is treated as a conflict rather than something that can be solved by cooperating on disarmament, then yes, obviously international policy will have failed too.

If you start from the position that any serious, globally coordinated regulation - where a few outliers will be brought to heel with sanctions and force - is ultimately doomed then you will of course conclude that anyone proposing regulation is deranged.

But that sounds like hoping that all problems forever can always be solved by locally implemented, partially-enforced, unilateral policies that aren't seen as threats by other players? That defense scales as well or better than offense? Technologies are force-multipliers, as it improves so does the harm that small groups can inflict at scale. If it's not AGI it might be bio-tech or asteroid mining. So eventually we will run into a problem of this type and we need to seriously discuss it without just going by gut reactions.

Just my (probably unpopular) opinion: True AI (what they are now calling AGI) may never exist. Even the AI models of today aren't far removed from the 'chatbots' of yesterday (more like an evolution rather than revolution)...

...for true AI to exist, it would need to be self aware. I don't see that happening in our lifetimes when we don't even know how our own brains work. (There is sooo much we don't know about the human brain.)

AI models today differ only in terms of technology compared to the 'chatbots' of yesterday. None are self aware, and none 'want' to learn because they have no 'wants' or 'needs' outside of their fixed programming. They are little more than glorified auto complete engines.

Don't get me wrong, I'm not insulting the tech. It will have it's place just like any other, but when this bubble pops it's going to ruin lives, and lots of them.

Shoot, maybe I'm wrong and AGI is around the corner, but I will continue to be pessimistic. I am old enough to have gone through numerous bubbles, and they never panned out the way people thought. They also nearly always end in some type of recession.

Why is "Want" even part of your equation.

Bacteria doesn't "want" anything in the sense of active thinking like you do, and yet will render you dead quickly and efficiently while spreading at a near exponential rate. No self awareness necessary.

You keep drawing little circles based on your understanding of the world and going "it's inside this circle, therefore I don't need to worry about it", while ignoring 'semi-smart' optimization systems that can lead to dangerous outcomes.

>I am old enough to have gone through numerous bubbles,

And evidently not old enough to pay attention to the things that did pan out. But hey, those cellphone and that internet thing was just a fad right. We'll go back to land lines at any time now.

> I'm not seeing the odiousness of the proposal. If bio research gets commodified and easy enough that every kid can build a new airborne virus in their basement we'd need raids on that too.

Either you create even better bio research to neutralize said viruses... or you die trying...

Like if you go with the raid strategy and fail to raid just one terrorist that's it, game over.

Those arguments do not transfer well to the AGI topic. You can't create counter-AGI, since that's also an intelligent agent which would be just as dangerous. And chips are more bottlenecked than biologics (... though gene synthesizing machines could be a similar bottleneck and raiding vendors which illegally sell those might be viable in such a scenario).

You need to abandon your apocalyptic worldview keep up with the times my friend.

Encryption export controls have been systematically dismantled to the point that they're practically non-existent, especially over the last three years.

Pretty much the only encryption products you need permission to export are those specifically designed for integration into military communications networks, like Digital Subscriber Voice Terminals or Secure Terminal Equipment phones, everything else you file a form.

Many things have changed since the days when Windows 2000 shipped with a floppy disk containing strong encryption for use in certain markets.

https://archive.org/details/highencryptionfloppydisk

Are you on drugs or is your reading comprehension that poor?

1) I did not state a world view; I simply noted that restrictions for software do exist, and will for AI, as well. As the link from the other commenter show, they do in fact already exist.

2) Look up the definition of "apocalyptic", software restrictions are not within its bounds.

3) How the restrictions are enforced were not a subject in my comment.

4) We're not pals, so you can drop the "friend", just stick to the subject at hand.

（评论） (comments)

（评论）
(comments)