大型显卡不需要大型电脑。
Big GPUs don't need big PCs

原始链接: https://www.jeffgeerling.com/blog/2025/big-gpus-dont-need-big-pcs

## 树莓派与外置显卡:令人惊讶的实用性? 本次实验探讨了使用树莓派 5 以及外置显卡(eGPU)——甚至多个显卡——来处理通常由台式机处理的任务的可行性。尽管树莓派的 PCIe 带宽有限(Gen 3 的 1 条通道,而台式机有 Gen 5 的 16 条通道),但结果却出乎意料地具有竞争力。 测试重点包括 Jellyfin 媒体转码、GPU 渲染(GravityMark)以及 LLM/AI 性能(推理和预填充),使用了 AMD、Nvidia,甚至包含 *四* 张 Nvidia RTX A5000 的配置。树莓派通常能达到接近台式机的性能,有时甚至在效率上胜出,仅损失 2-5% 的峰值速度。 主要发现:转码对于典型使用是可行的,原始渲染速度接近台式机,而 AI 性能,尤其是在多个 GPU 通过 PCIe 交换机共享内存的情况下,可以达到与专用服务器相差 2% 以内的水平。树莓派配置的成本为 350-400 美元,而台式机为 1500-2000 美元,并且空闲时的功耗明显更低(4-5W 与 30W)。 最终,虽然台式机在原始性能上仍然更胜一筹,但树莓派为许多 GPU 密集型任务提供了一种引人注目、高效且经济实惠的解决方案,证明了其超越最初设计的潜力。

## 大型GPU,小型PC:黑客新闻讨论 最近黑客新闻上的一场讨论集中在这样一个观点:本地运行大型语言模型(LLM)时,GPU是最关键的组件,其重要性超过了对强大整体PC的需求。 用户们正在探索构建或利用极其精简的系统——例如300美元的迷你PC,甚至翻新旧笔记本电脑——仅仅为了容纳和连接高端GPU。核心概念是,这些系统只需要高效地传输数据到GPU和从GPU传输数据,并且可以轻松处理浏览和视频播放等基本任务。 一些评论者分享了在低功耗虚拟机上成功运行工作负载的经验,并使用了外置GPU,这进一步强化了现代硬件营销常常高估日常生产力所需规格的观点。这一趋势表明,人们正在转向优先考虑GPU性能,并尽可能减少支持计算机的体积和能耗。
相关文章

原文

Raspberry Pi eGPU vs PC GPU

Ever since I got AMD, Intel, and Nvidia graphics cards to run on a Raspberry Pi, I had a nagging question:

What's the point?

The Raspberry Pi only has 1 lane of PCIe Gen 3 bandwidth available for a connection to an eGPU. That's not much. Especially considering a modern desktop has at least one slot with 16 lanes of PCIe Gen 5 bandwidth. That's 8 GT/s versus 512 GT/s. Not a fair fight.

But I wondered if bandwidth isn't everything, all the time.

I wanted to put the question of utility to rest, by testing four things on a variety of GPUs, comparing performance on a Raspberry Pi 5 to a modern desktop PC:

  • Jellyfin and media transcoding
  • Graphics performance for purely GPU-bound rendering (via GravityMark)
  • LLM/AI performance (both prefill and inference)
  • Multi-GPU applications (specifically, LLMs, since they're the easiest to run)

Yes, that's right, we're going beyond just one graphics card today. Thanks to Dolphin ICS, who I met at Supercomputing 25, I have a PCIe Gen 4 external switch and 3-slot backplane, so I can easily run two cards at the same time:

Two GPUs in Dolphin PCIe Interconnect board - Nvidia RTX A400 and A4000

The tl;dr: The Pi can hold its own in many cases—it even wins on efficiency (often by a large margin) if you're okay with sacrificing just 2-5% of peak performance!

Four GPUs, One Pi

The craziest part was, while I was finishing up this testing, GitHub user mpsparrow plugged four Nvidia RTX A5000 GPUs into a single Raspberry Pi. Running Llama 3 70b, the setup generated performance within 2% of his reference Intel server:

Raspberry Pi 5 with 4x Nvidia RTX A5000 GPUs - LLM benchmark

On the Pi, it was generating responses at 11.83 tokens per second. On a modern server, using the exact same GPU setup, he got 12. That's less than a 2 percent difference.

How is this possible? Because—at least when using multiple Nvidia GPUs that are able to share memory access over the PCIe bus—the Pi doesn't have to be in the way. An external PCIe switch may allow cards to share memory over the bus at Gen 4 or Gen 5 speeds, instead of requiring trips 'north-south' through the Pi's PCIe Gen 3 lane.

But even without multiple GPUs and PCIe tricks, a Pi can still match—and in a few cases, beat—the performance of a modern PC.

Cost and efficiency

Besides efficiency, cost might factor in (both setups not including a graphics card in the price):

Raspberry Pi eGPU setup Intel PC
TOTAL: $350-400 TOTAL: $1500-2000
16GB Pi CM5 + IO Board
Minisforum eGPU Dock
M.2 to Oculink adapter
USB SSD
850W PSU
Intel Core Ultra 265K
ASUS ProArt Motherboard
Noctua Redux cooler
850W PSU
Benchtable/case
M.2 NVMe SSD
64GB of DDR5 RAM

If peak efficiency and performance aren't your thing, consider that the Pi alone burns 4-5W at idle. The PC? 30W. That's without running a graphics card, and in both cases the only thing plugged in during my measurements was a cheap Raspberry Pi USB keyboard and mouse.

So let's get to the hardware.

Video

For more background, and to see more details on the actual hardware setups used, check out the video that accompanies this blog post (otherwise, keep on scrolling!):

Single GPU Comparisons - Pi vs Intel Core Ultra

Before we get to the benchmarks I did run, first I'll mention the ones I didn't, namely gaming. On previous iterations of my Pi + GPU setups, I was able to get Steam and Proton running Windows games on Arm through box64.

This time around, with Pi OS 13 (Debian Trixie), I had trouble with both FeX and box64 installing Steam, so I put that on hold for the time being. I'll be testing more gaming on Arm for a video next year surrounding Valve's development of the Steam Frame.

The focus today is raw GPU power.

And I have three tests to stress out each system: Jellyfin, GravityMark, and LLMs.

Benchmark results - Jellyfin

Let's start with the most practical thing: using a Pi as a media transcoding server.

Since Nvidia's encoder's more polished, I tested it first. Even an older budget card should be adequate for a stream or two, but I had a 4070 Ti available, so I tested it.

Using encoder-benchmark, the PC kinda slaughters the Pi:

Encoder benchmark - Pi vs Intel PC

Observing nvtop, I found the culprit: the Raspberry Pi's anemic IO.

Encoder Benchmark uses raw video streams, which are fairly large (e.g. over 10 GB for a 4K file). This means you have to load all that data in from disk, then copy it over the PCIe bus to the GPU, then the GPU spits back a compressed video stream, which is written back to the disk.

On the Pi, the PCIe bus tops out at 850 MB/sec or so, and because there's only one lane, and I was using a USB 3.0 SSD for my boot disk, it really only got up to 300 MB/sec or so, sustained.

The PC could pump through 2 GB/sec from the PCIe Gen 4 x4 SSD I had installed.

In terms of raw throughput, the PC is hands-down the winner.

But the way Jellyfin works, at least for my own media library, where I store lightly-compressed H.264 and H.265 files, transcoding doesn't need quite that much bandwidth.

I installed Jellyfin and set it to use NVENC hardware encoding. That worked out of the box.

I could skip around in my 1080p encode of Sneakers without any lag during transcoding. I could switch bitrates to emulate playback through my home VPN playing back Galaxy Quest, and Jellyfin didn't miss a beat. A 4K H.265 file for Apollo 11 was also perfectly smooth at all bitrates on this setup.

Jellyfin transcoding two videos on the fly with nvtop showing Pi 5 in foreground

Even with two transcodes going on at the same time, like here for Dune in 4K and Sneakers in 1080p, it's running just as smooth. It does seem to max out the decode engine, at that point, but it wasn't causing any stuttering.

The Intel PC wins in raw throughput, which is great if you're building a full-blown video transcoding server, but the Pi is fine for most other transcoding use-cases (like OBS, Plex, or Jellyfin).

Historically, AMD isn't quite as good at transcoding, but their cards are adequate. Transcoding worked on the AI Pro I tested, but I had some stability issues.

Benchmark results - GravityMark

I wanted to see how raw 3D rendering performed, so I ran the GravityMark benchmark—for now, only on AMD cards, because I haven't gotten a display to run on Nvidia's driver on the Pi yet.

GravityMark Pi vs PC - AMD Ryzen AI Pro R9700

No surprise, the Intel PC was faster... but only by a little.

The rendering is all done on the GPU side, and it doesn't really rely on the Pi's CPU or PCIe lane, so it can go pretty fast.

What did surprise me was what happened when I ran it again on an older AMD card, my RX 460.

GravityMark RX460 - Pi vs PC

This GPU is ancient, in computer years, but I think that gives a leg up for the Pi. The RX 460 runs at PCIe Gen 3, which is exactly as fast as the Pi'll go—and the Pi actually edged out the PC. But the thing that gave me a bigger shock was the score per watt.

GravityMark performance per watt RX 460 - Pi vs PC

This is measuring the overall efficiency of the system, and while Intel's not amazing for efficiency right now, it's not like the Pi's the best Arm has to offer.

I got benchmark results for an Nvidia 3060, 3080 Ti, and A4000, but only on the PC side. I still don't have a desktop environment or display output working on the Pi yet.

Benchmark results - AI

The AMD Radeon AI Pro R9700 has 32 gigs of VRAM, and should be perfect for a large array of LLMs, up to Qwen3's 30 billion parameter model that takes up 20 GB of VRAM.

And here are the results, Pi versus PC:

AMD Ryzen AI Pro R9700 Pi vs PC LLMs

Ouch. This is not what I was expecting to see.

I was a little discouraged, so I went back to my trusty RX 460:

AMD RX 460 GPU LLM performance

That's more in line with what I was expecting. Maybe that R9700's lackluster performance is from driver quirks or the lack of a large BAR?

I can't blame AMD's engineers for not testing their AI GPUs on Raspberry Pis.

That made me wonder if Nvidia's any better, since they've been optimizing their Arm drivers for years.

Here's the RTX 3060 12 gig, it's a popular card for cheap at-home inference since it has just enough VRAM to be useful:

Nvidia RTX 3060 AI LLM Performance Pi vs PC

The Pi's holding its own. Some models seem to do a little better on the PC like tinyllama and llama 3.2 3B, but for medium-size models, the Pi's within spitting distance. The Pi even beat the PC at Llama 2 13B.

What really surprised me was this next graph:

Nvidia RTX 3060 AI LLM Efficiency Pi vs PC

This measures how efficient each system is. Accounting for the power supply, CPU, RAM, GPU, and everything, the Pi is actually pumping through tokens more efficiently than the PC, while nearly matching its performance.

Okay... well, that's just the 3060. That card's also five years old. Maybe bigger and newer cards won't fare so well?

I ran my AI gauntlet against all the Nvidia cards I could get my hands on, including an RTX 3080 Ti, RTX 4070 Ti, RTX A4000, and RTX 4090. I'll let you watch the video to see all the results, but here I'll skip right to the end, showing the RTX 4090's performance on the Pi (it's the fastest GPU I own currently):

Nvidia RTX 4090 on Raspberry Pi CM5

It's comical how much more volume the graphics card takes up, compared to the Pi—it even dwarfs many of the PC builds into which it may be installed!

Nvidia RTX 4090 AI LLM Performance Pi vs PC

And it looks like tinyllama just completely nukes the Pi from orbit here. But surprisingly, the Pi still holds its own for most of the models—Qwen3 30B is less than 5% slower on the Pi.

With a card that can eat up hundreds of watts of power on its own, how's the efficiency?

Nvidia RTX 4090 AI LLM Efficiency Pi vs PC

I thought that since the rest of the system would be a smaller percentage of power draw, the PC would fare better—and it does, actually, for a few models. But the Pi's still edging out the bigger PC in the majority of these tests (the larger models).

Which... is weird.

I honestly didn't expect that. I was expecting the Pi just to get trounced, and maybe pull off one or two little miracle wins.

One caveat: I'm using llama.cpp's Vulkan backend, to keep things consistent across vendors. CUDA could change absolute numbers a little, but not much actually, based on some CUDA testing with the RTX 2080 Ti—and CUDA works fine on the Pi, surprisingly.

And for the r/LocalLLaMA folks reading this post already commenting about prompt processing speeds and context, I linked each result above to a GitHub issue with all the test data, so before you post that comment, please check those issues.

Dual GPUs

So far we've just been running on one GPU. What if we try two?

I used a Dolphin PCIe interconnect board, with Dolphin's MXH932 PCIe HBA, and connected that to my Pi CM5 using an M.2 to SFF-8643 adapter, along with an SFF-8643 to SFF-8644 cable, which goes from the Pi to the Dolphin card.

Before I ran any LLMs, I wanted to see if I could share memory between the two cards. PCI Express has a feature that lets devices talk straight to each other, instead of having to go 'north-south' through the CPU. That would remove the Pi's Gen 3 x1 bottleneck, and give the cards a full Gen 4 x16 link for tons more bandwidth.

For that to work, you have to disable ACS, or the Access Control Service. And Dolphin apparently set that up for me already, on their switch card.

mpsparrow was running four of the same Nvidia RTX A5000 cards. But I only had different model cards. It looks like the Nvidia driver doesn't support VRAM pooling the same way if you have different cards, like my 4070 and A4000, or my A4000 and A400.

But that's okay, there's still things I can do with llama.cpp and multiple GPUs, going 'north south' through the CPU.

Nvidia Dual GPU setup on Pi 5

This is the performance of the 4070 and A4000, compared to just running the same models on the A4000 directly.

I'm guessing because of that extra traffic between the Pi and the cards, there are tons of little delays while llama.cpp is pushing data through the two GPUs. It's not better for performance, but this setup does let you scale up to larger models that won't fit on one GPU.

Like Qwen 3 30B is around 18 GB, which is too big for either of these two cards alone. It would be faster and more efficient to use a GPU with enough VRAM to fit the entire model. But if you have two graphics cards already, and you want to run them together, at least it's possible.

I also ran the two biggest AMD cards I have, the RX 7900 XT (20 GB) and Radeon AI Pro R9700 (32GB), and that gives me a whopping 52 GB of VRAM. But there, again, maybe due to AMD's drivers, I couldn't get some models to finish a run—and when they did, they were not very fast:

AMD RX 7900 XT and Radeon AI Pro R9700 Dual GPU on Pi 5

To close out my dual GPU testing, I also ran all the tests on the Intel PC, and it shouldn't be too surprising, it was faster:

Dual GPU setup performance for AI LLMs Pi vs PC

But at least with Qwen3 30B, the Pi holds its own again.

Takeaway: If you have multiple of the same card, and maybe even try other tools like vLLM—which I couldn't get to install on Pi—you might get better numbers than I did here.

But the main lesson still applies: more GPUs can give you more capacity, but they'll definitely be slower than one bigger GPU—and less efficient.

Conclusion

After all that, which one is the winner?

Raspberry Pi vs PC power usage measured by Home Assistant ThirdReality Zigbee Smart Outlets

Well, the PC obviously, if you care about raw performance and easy setup. But for a very specific niche of users, the Pi is better, like if you're not maxed out all the time and have almost entirely GPU-driven workloads. The idle power on the Pi setup was always 20-30W lower. And other Arm SBCs using Rockchip or Qualcomm SoCs are even more efficient, and often have more I/O bandwidth.

Ultimately, I didn't do this because it made sense, I did it because it's fun to learn about the Pi's limitations, GPU computing, and PCI Express. And that goal was achieved.

I'd like to thank Micro Center for sponsoring the video that led to writing this blog post (they provided the AMD Ryzen AI Pro R9700 for testing, as well as one of the 850W power supplies), and Dolphin ICS for letting me borrow two of their PCI Express boards after I spoke with them at their booth at Supercomputing 25.

联系我们 contact @ memedata.com