（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=40559368

英特尔计划在 2025 年末发布支持 256 位 AVX-512 指令子集的笔记本电脑和台式机 CPU。然而，这些 CPU 缺乏早期 Intel 型号中的完整 AVX-512 功能，限制了其执行需要 512 位运算的复杂计算的有效性。相反，他们优先考虑能源效率和与主流软件的兼容性。相比之下，AMD 的 Zen 4 和 5 CPU 原生处理 512 位指令。尽管 Zen 4 最初在每个执行单元中缺乏专用的 FP 乘法器，但 Zen 5 解决了这个问题。两款 CPU 都改进了时钟管理，消除了 512 位指令执行期间时钟频率降低的问题。此外，由于架构增强，AMD 的 CPU 尽管具有更少的 FP 乘数，但仍保持高性能。尽管取得了进步，英特尔和 AMD 之间的竞争仍在继续，推动了 CPU 技术的创新。 AMD 和 Intel 不断发展，而 Apple 的集成度和销量使得 M1 系列等产品提供了令人印象深刻的性能。争论的焦点是苹果的M1系列是否领先原始峰值性能，或者英特尔和AMD是否在单线程和多线程性能方面迎头赶上。市场需要有竞争力的价格和最佳的台式机多核性能。此外，独立显卡在英特尔和 AMD SoC 都无法完全匹敌的领域表现出色。最终，英特尔和 AMD 都根据各自的优势迎合不同的市场。

AVX512 in a single cycle vs 2 cycles is big if the clock speed can be maintained at all near 5GHz. Also doubling of L1 cache bandwidth is interesting! Possibly, needed to actually feed an AVX512 rich instruction stream I guess.

For most instructions, both Intel and AMD CPUs with AVX-512 support are able to do two 512-bit instructions per clock cycle. There is no difference between Intel and AMD Zen 4 for most 512-bit AVX-512 instructions.

I expect that this will remain true for Zen 5 and the next Intel CPUs.

The only important differences in throughput between Intel and AMD were for the 512-bit load and store instructions from the L1 cache and for the 512-bit fused multiply-add instructions, where Intel had double throughput in its more expensive models of server CPUs.

I interpret AMD's announcement that now Zen 5 has a double transfer throughput between the 512-bit registers and the L1 cache and also a double 512-bit FP multiplier, so now it matches the Intel AVX-512 throughput per clock cycle in all important instructions.

> There is no difference between Intel and AMD Zen 4 for most 512-bit AVX-512 instructions.

Except for the fact that Intel hasn't had any AVX-512 for years already in consumer CPUs, so there's nothing to compare against really in this target market

The comparison done by the AMD announcement and by everyone else compares the Zen 5 cores, which will be used both in their laptop/desktop products and in their Turin server CPUs, with the Intel Emerald Rapids and the future Granite Rapids server CPUs.

As you say, Intel has abandoned the use of the full AVX-512 instruction set in their laptop/desktop products and in some of their server products.

At the end of 2025, Intel is expected to introduce laptop/desktop CPUs that will implement a 256-bit subset of the AVX-512 instruction set.

While that will bring many advantages of AVX-512 that are not related to register and instruction widths, it will lose the simplification of the high-performance programs that is possible in 512-bit AVX-512 due to the equality between register size and cache line size, so the consumer Intel CPUs will remain a worse target for the implementation of high-performance algorithms.

> of the high-performance programs that is possible in 512-bit AVX-512 due to the equality between register size and cache line size, so the consumer Intel CPUs will remain a worse target for the implementation of high-performance algorithms.

Can you elaborate here? I love full-width AVX-512 as much as the next SIMD nerd, but I rarely considered the alignment of the cache line and vector width one of the particularly useful features. If anything, it was a sign that AVX-512 was probably the end of the road for full-throughput full-width loads and stores at full AVX register width, since double-cache line memory operations are likely to be half-throughput at best and a doubling of the cache line width seems unlikely.

I read that comment as "the wider, the sweeter" (which I agree with), but that we're now (as you say) at the end of the road, and thus the sweetest point.

But an increase in cacheline size would be nice if it can get us larger vectors, or otherwise significantly improve memory bandwidth.

The difference is intel chips that support AVX-512 run $1,300 - $11,000 with MUCH higher total system costs whereas AMD actually DOES support AVX-512 on all it's chips and you can get AVX-512 for dirt cheap. The whole intel instruction support story feels garbage here. Weren't they the ones to introduce this whole 512 thing in the first place?

That's being very generous to Intel. It's much more believable that the strategy of mixing performance and efficiency cores in mainstream consumer processors starting with Alder Lake was a relatively last-minute decision, too late to re-work the Atom core design to add AVX512, so they sacrificed it from the performance cores instead.

The Intel consumer processors due out later this year are expected to include the first major update to their efficiency cores since they introduced Alder Lake. Only after that ships without AVX512 will it be plausible to attribute product segmentation as a major cause rather than just engineering factors.

(Intel's AVX10 proposal indicates their long-term plan may be to never support 512-bit SIMD on the efficiency cores, but to eventually make it possible for full 512-bit SIMD to make a return to their consumer processors.)

No question - but these kind of games lead to poor adoption.

They made a big todo about how much better than AMD they are with things like AVX512.

Then they play these games for market purposes.

Then they have stupid clock speed pauses, so when you try it your stuff goes slower.

Meanwhile - AMD is putting it on all their chips and it works reasonably there.

So I've just found the whole Intel style here kind of annoying. I really remember them doing bogus comparisons to non AVX-512 AMD parts (and projecting when their chips would be out). Reality is you are writing software that depends on AVX512 - tell clients to buy AMD to run it. Does AVX512 even work on efficiency cores and things like that? It's a mess.

Or Intel's repeated failures to roll out a working <14nm processes forced them to cripple even the P-core AVX-512 implementation (one less ALU port compared to Xeon Gold), rip out whatever compatibility feature they wanted to put in the E-cores and Microsoft refused to implement workarounds in Windows (e.g. make AVX-512 opt-in and restrict threads that enabled it to P-cores, reschedule threads faulting on AVX-512 instructions from E-core to P-cores).

> The only important differences in throughput between Intel and AMD

Not exactly related, but AMD also has a much better track record when it comes to speculative execution attacks.

A clock speed drop must always occur whenever a CPU does more work per clock cycle, e.g. when more CPU cores are active or when wider SIMD instructions are used.

However the early generations of Intel CPUs that have implemented AVX-512 had bad clock management, which was not agile enough to lower quickly the clock frequency, i.e. the power consumption, when the temperature was too high due to higher power consumption, in order to protect the CPU. Because of that and because there are no instructions that the programmers could use to announce their intentions of using intensively wide SIMD instructions in a sequence of code, the Intel CPUs lowered the clock frequency preemptively and a lot, whenever they feared that the future instruction stream might contain 512-bit instructions that could lead to an overtemperature. The clock frequency was restored only after delays not much lower than a second. When AVX-512 instructions were executed sporadically, that could slow down any application very much.

The AMD CPUs and the newer Intel CPUs have better clock management, which reacts more quickly, so the low clock frequency during AVX-512 instruction execution is no longer a problem. A few AVX-512 instructions will not lower measurably the clock frequency, while the low clock frequency when AVX-512 instructions are executed frequently is compensated by the greater work done per clock cycle.

AVX512 was never over 2 cycles. In Zen4 it used the 256 wide execution units of avx2 (except for shuffle), but there are more then one 256-bit wide execution units, so you still got your one cycle throughput.

More importantly for the "2 cycles" question, Zen 4 can get one cycle latency for double-pumped 512-bit ops (for the ops where that's reasonable, i.e. basic integer/bitwise arith).

Having all 512-bit pipes would still be a massive throughput improvement over Zen 4 (as long as pipe count is less than halved), if that is what Zen 5 actually does; things don't stop at 1 op/cycle. Though a rather important question with that would be where that leaves AVX2 code.

Almost certainly Zen 5 won't have single-cycle FP latency (I haven't heard of anything doing such even for scalar at modern clock rates (though maybe that does exist somewhere); AMD, Intel, and Apple all currently have 3- or 4-cycle latency). And Zen 4 already has a throughput of 2 FP ops/cycle for up to 256-bit arguments.

The thing discussed is that Zen 4 does 512-bit SIMD ops via splitting them into two 256-bit ones, whereas Zen 5 supposedly will have hardware doing all 512 bits at a time.

Even if Lisa Su said this at the Zen 4 launch, it is not likely that 512-bit operations are split into a pair of 256-bit operantions that are executed sequentially in the same 256-bit execution unit.

Both Zen 3 and Zen 4 have four 256-bit execution units.

Two 512-bit instructions can be initiated per clock cycle. It is likely that the four corresponding 256-bit micro-operations are executed simultaneously in all the 4 execution units, because otherwise there would be an increased likelihood that the dispatcher would not be able to find enough micro-operations ready for execution so that no execution unit remains idle, resulting in reduced performance.

The main limitation of the Zen 4 execution units is that only 2 of them include FP multipliers, so the maximum 512-bit throughput is one fused multiply-add plus one FP addition per clock cycle, while the Intel CPUs have an extra 512-bit FMA unit, which stays idle and useless when AVX-512 instructions are not used, but which allows two 512-bit FMA per cycle.

Without also doubling the transfer path between the L1 cache and the registers, a double FMA throughput would not have been beneficial for Zen 4, because many algorithms would have become limited by the memory transfer throughput.

Zen 5 doubles the width of the transfer path to the L1 and L2 cache memories and it presumably now includes FP multipliers in all the 4 execution units, thus matching the performance of Intel for 512-bit FMA operations, while also doubling the throughput of the 256-bit FMA operations, where in Intel CPUs the second FMA unit stays unused, halving the throughput.

No well-designed CPU has a FP addition or multiplication latency of 1. All modern CPUs are designed for the maximum clock frequency which ensures that the latency of operations similar in complexity with 64-bit integer additions between registers is 1. (CPUs with a higher clock frequency than this are called "superpipelined", but they have went out of fashion a few decades ago.)

For such a clock frequency, the latency of floating-point execution units of acceptable complexity is between 3 and 5, while the latency of loads from the L1 cache memory is about the same.

The next class of operations with a longer latency includes division, square root and loads from the L2 cache memory, which usually have latencies between 10 and 20. The longest latencies are for loads from the L3 cache memory or from the main memory.

Yeah, it's certainly possible that it's not double-pumping. Should be roughly possible to test via comparing latency upon inserting a vandpd between two vpermd's (though then there are questions about bypass networks; and of course if we can't measure which method is used it doesn't matter for us anyway); don't have a Zen 4 to test on though.

But of note is that, at least in uops.info's data[0], there's one perf counter increment per instruction, and all four pipes get non-zero equally-distributed totals, which seems to me much simpler to achieve with double-pumping (though not impossible with splitting across ports; something like incrementing a random one. I'd expect biased results though).

Then again, Agner says "512-bit vector instructions are executed with a single μop using two 256-bit pipes simultaneously".

[0]: https://uops.info/html-tp/ZEN4/VPADDB_ZMM_ZMM_ZMM-Measuremen...

I’ve been running some LLMs on my 5600x and 5700g cpus, and the performance is… ok but not great. Token generation is about “reading out loud” pace for the 7&13 B models. I also encounter occasional system crashes that I haven’t diagnosed yet, possibly due to high RAM utilization, but also possibly just power/thermal management issues.

A 50% speed boost would probably make the CPU option a lot more viable for home chatbot, just due to how easy it is to make a system with 128gb RAM vs 128gb VRAM.

I personally am going to experiment with the 48gb modules in the not too distant future.

You could put an 8700G in the same socket. The CPU isn't much faster but it has the new NPU for AI. I'm thinking about this upgrade to my 2400G but might want to wait for the new socket and DDR5.

I looked at upgrading my existing AMD based system's ram for this purpose, but found out my mobo/cpu only supports 128gb of ram. Lots, but not as much as I had hoped I could shove in there.

How are the 24x PCIe 5.0 lanes (~90GB/s) of the 9950X allocated?

The article makes it appear as:

* 16x PCIe 5.0 lanes for "graphics use" connected directly to the 9950X (~63GB/s).

* 1x PCIe 5.0 lane for an M.2 port connected directly to the 9950X (~4GB/s). Motherboard manufacturers seemingly could repurpose "graphics use" PCIe 5.0 lanes for additional M.2 ports.

* 7x PCIe 5.0 lanes connected to the X870E chipset (~28GB/s). Used as follows:

  * 4x USB 4.0 ports connected to the X870E chipset (~8GB/s).

  * 4x PCIe 4.0 ports connected to the X870E chipset (~8GB/s).

  * 4x PCIe 3.0 ports connected to the X870E chipset (~4GB/s).

  * 8x SATA 3.0 ports connected to the X870E chipset (some >~2.4GB/s part of ~8GB/s shared with WiFi 7).

  * WiFi 7 connected to the X870E chipset (some >~1GB/s part of ~8GB/s shared with 8x SATA 3.0 ports).

These processors will use the existing AM5 socket, so they fundamentally cannot make major changes to lane counts and allocations, only per-lane speeds. They're also re-using CPU's IO die from last generation and re-using the same chipset silicon, which further constrains them to only minor tweaks.

Typical use cases and motherboards give an x16 slot for graphics, x4 each to at least one or two M.2 slots for SSDs, and x4 to the chipset. Last generation and this generation, AMD's high-end chipset is actually two chipsets daisy-chained, since they're really not much more than PCIe fan-out switches plus USB and SATA HBAs.

Nobody allocates a single PCIe lane to an SSD slot, and the link between the CPU and chipset must have a lane width that is a power of two; a seven-lane link is not possible with standard PCIe.

Also, keep in mind that PCIe is packet-switched, so even though on paper the chipset is over-subscribed with downstream ports that add up to more bandwidth than the uplink to the CPU provides, it won't be a bottleneck unless you have an unusual hardware configuration and workload that actually tries to use too much IO bandwidth with the wrong set of peripherals simultaneously.

Cheap small computers with Intel Alder Lake N CPUs, like Intel N100, allocate frequently a single PCIe lane for each SSD slot.

However you are right that such a choice is very unlikely for computers using AMD CPUs or Intel Core CPUs.

AMD's previous socket was usually x16/x4/x4 for GPU/M.2/chipset. For AM5, they added another four lanes, so it's usually x16/x4+x4/x4 for GPU/(2x)M.2/chipset, unless the board is doing something odd with providing lanes for Thunderbolt ports or something like that.

Oh you're right. I didn't read the diagram correctly and the specs say 28 total lanes, 24 usable - probably because 4 go to the chipset.

Siena would make a very practical HEDT socket - it's basically half of a bergamo, 6ch DDR5/96x pcie 5.0. It's sort of an unfortunate artifact of the way server platforms have gone that HEDT has fizzled out, they're mostly just too big and it isn't that practical to fit into commodity form-factors anymore, etc. a bigass socket sp3 and 8ch was already quite big, now it's 12ch for SP5 and you have a slightly smaller one at SP6. But still, doing 1DPC in a commodity form factor is difficult, you really need an EEB sort of thing for things like GENOAD8X etc let alone 2 dimms per channel etc, which if you do like a 24-stick board and a socket you don't fit much else.

https://www.anandtech.com/show/20057/amd-releases-epyc-8004-...

2011/2011-3/2066 were actually a reasonable size. Like LGA3678 or whatever as a hobbyist thing doesn't seem practical (the W-3175X stuff) and that was also 6ch, and Epyc/TR are pretty big too etc. There used to exist this size-class of socket that really no longer gets used, there aren't tons of commercial 3-4-6 channel products made anymore, and enthusiast form-factors are stuck in 1980 and don't permit the larger sockets to work that well.

The C266 being able to tap off IOs as SAS3/12gbps or pcie 4.0 slimsas is actually brilliant imo, you can run SAS drives in your homelab without a controller card etc. The Asrock Rack ones look sick, EC266D4U2-2L2Q/E810 lets you basically pull all of the chipset IO off as 4x pcie 4.0x4 slimsas if you want. And actually you can technically use MCIO retimers to pull the pcie slots off, they had a weird topology where you got a physical slot off the m.2 lanes, to allow 4x bifurcated pcie 5.0x4 from the cpu. 8x nvme in a consumer board, half in a fast pcie 5.0 tier and half shared off the chipset.

https://www.asrockrack.com/general/productdetail.asp?Model=E...

Wish they'd do something similar with AMD and mcio preferably, like they did with the GENOAD8X. But beyond the adapter "it speaks SAS" part is super useful for homelab stuff imo. AMD also really doesn't make that much use of the chipset, like, where are the x670E boards that use 2 chipsets and just sling it all off as oculink or w/e. Or mining-style board weird shit. Or forced-bifurcation lanes slung off the chipset into a x4x4x4x4 etc.

https://www.asrockrack.com/general/productdetail.asp?Model=G...

All-flash is here, all-nvme is here, you just frustratingly can't address that much of it per system, without stepping up to server class products etc. And that's supposed to be the whole point of the E series chipset, very frustrating. I can't think of many boards that feel like they justify the second chipset, and the ones that "try" feel like they're just there to say they're there. Oh wow you put 14 usb 3.0 10gbps ports on it, ok. How about some thunderbolt instead etc (it's because that's actually expensive). Like tap those ports off in some way that's useful to people in 2024 and not just "16 sata" or "14 usb 3.0" or whatever. M.2 NVMe is "the consumer interface" and it's unfortunately just about the most inconvenient choice for bulk storage etc.

Give me the AMD version of that board where it's just "oops all mcio" with x670e (we don't need usb4 on a server if it drives up cost). Or a miner-style board with infinite x4 slots linked to actual x4s. Or the supercarrier m.2 board with a ton of M.2 sticks standing vertically etc. Nobody does weird shit with what is, on paper, a shit ton of pcie lanes coming off the pair of chipsets. C'mon.

Super glad USB4 is a requirement for X870/X870E, thunderbolt shit is expensive but it'll come down with volume/multisourcing/etc, and it truly is like living in the future. I have done thunderbolt networking and moved data ssd to ssd at 1.5 GB/s. Enclosures are super useful for tinkering too now that bifurcation support on PEG lanes has gotten shitty and gpus keep getting bigger etc. An enclosure is also great for janitoring M.2 cards with a simple $8 adapter off amazon etc (they all work, it's simple physical adapater).

Very well said. It feels like we have all this amazing tech that could open the door to so many creative possibilities but no one is interested in exploring it.

Before, we had so little but it was all available to utilize to the fullest extent. Now we live in a world of excess but it’s almost a walled garden.

I'll probably wait one or two years before getting into anything with DDR5. I've blew some money on an AMD laptop in 2021. At the time it was a monster with decent expansion options: RX 6800m, Ryzen 9 5900HX. I've stuck it with maximum 64GB DDR4 and 2x 4TB psi 3.0 nvme. Runs Linux very well.

But now I'm seeing lots of things I'm locked out. Faster ethernet standards, the fun that brings with tons of GPU memory (no USB4, can't add 10Gbe either), faster and larger memory options, AV1 encoding. It's just sad that I bought a laptop right before those things were released.

Should had go with a proper PC. Not doing this mistake anymore.

It sounds like you need a desktop workstation with replaceable extension cards, and not a mostly immutable laptop, which has different strengths.

(disclosure I own a 13in one)

Yea closest I see to being better about it is Frame.work laptops, and even then it's not as good a story as desktops, just the best story for upgrading a laptop right now. Other than that buying one and making sure you have at least two thunderbolt (or compatible) ports on separate busses is probably the best you can do since that'd mean two 40Gb/s links for expansion even if it's not portable, but would let you get things like 10GbE adapters or fast external storage and such without compromising too much on capability.

The idea of Framework sounds so good but their actual implementation has really been lacking so far, especially when you consider total price of the laptop + upgrades vs just buying cheaper devices whole but more often. E.g. wired Ethernet is a USB C 2.5 Gbps adapter that sticks out of the chasses because it doesn't fit.

AMD seem to be playing it safe this with this desktop generation. Same node, similar frequencies, same core counts, same IOD, X3D chips only arriving later... IPC seems like the only noteworthy improvement here. 15% overall is good but nothing earth shattering.

The mobile APUs are way more interesting.

> 15% overall is good but nothing earth shattering

Interestingly though the 9700X seems to be rated at 65W TDP (compared to a 105 TDP for the 7700X). I run my 7700X in "eco mode" where I lowered the TDP to max 95 W (IIRC, maybe it was 85 W: I should check in the BIOS).

So it looks like it's 15% overall more power with less power consumption.

AMD used to give decently accurate TDP, but Intel started giving unrealistically optimistic TDP ratings, so AMD joined the game. Their TDP is more of an aspiration than a reality (ironically, Intel seems to have gotten more accurate with their new base and turbo power ratings).

9700x runs 100MHz higher on the same process as the 7700x. If they are actually running at full speed, I don't see how 9700x could possibly be using less power with more transistors at a higher frequency. They could get lower power for the same performance level though if they were being more aggressive about ramping down the frequency (but it's a desktop chip, so why would they?).

For the same node, you aren't cutting power by 40% while increasing clockspeeds. At best, they could be aggressively cutting clockspeeds, but that means lower overall performance.

Marketing seems a more likely reason (and I don't believe it would be the first time either).

Geekbench Ryzen 9 7950x is 2930 max, so if we're generous and give the 9950x 15% uplift we'll be at 3380, which is still 400 points or so behind apple silicon for a much higher clock speed and a multiplier larger power draw. Also the max memory bandwidth at 70GB/s or so is basically pathetic, trounced by ASi.

You're comparing apples and oranges. Ryzen has never had the lead with 1T performance, but emphasises nT and core counts instead. Memory bandwidth is largely meaningless for desktop CPUs but matters a lot for a SoC with a big GPU.

Strix Halo appears to be AMD's competitor to Apple SoCs which will feature a much bigger iGP and much greater memory bandwidth. When we hear more about that, comparisons will be apt.

Ryzen doesn't even lead in MT on laptops.

With M4, they're likely to fall even farther behind. M4 Pro/Max is likely to arrive in Fall. AMD's Strix Point doesn't seem to have a release date.

My point is that Strix Point and M4 Pro/Max will likely release at around the same time based on reports and historical trends.

M4 is already out. We know Zen5's performance. Therefore, we can conclude that the gap between M4 vs Zen5 is higher than M3 vs Zen4.

You seem to be very sensitive when it comes with AMD.

You seem to be very sensitive when it comes to Apple.

I've used Apple for 20+ years, from Motorola 680X0 CPUs to Motorola PowerPC to Intel CPUs. SGI MIPS, Sun Sparc, DEC Alpha etc.

I don't care what name is written on the CPU I use.

I care how fast my development machine is.

  I've used Apple for 20+ years, from Motorola 680X0 CPUs to Motorola PowerPC to Intel CPUs. SGI MIPS, Sun Sparc, DEC Alpha etc.

I'm not sure why this is relevant to M4 vs Zen5 or M3 vs Zen4.

  I care how fast my development machine is.

Great. I hope you get the fastest development machine for yourself. But this conversation isn't about your development machine.

I consider it relevant to

"You seem to be very sensitive when it comes with AMD."

"But this conversation isn't about your development machine."

It is when your point is

"You seem to be very sensitive when it comes with AMD."

AMD has been making SoCs forever. Why does it take a kick in the teeth from Apple for both Intel and AMD suddenly to wake up and give us performance SoCs, 5 years later? They've been coasting on the stone-age motherboard+cards arch because essentially Nvidia gave you a big, fat, modern coprocessor in the form of an [g/r]tx card that hid the underlying problems.

They could have done what AAPL did ages ago but they have no ability to innovate properly. They've been leaning on their x86 duopoly and if it's now on its last legs, it's their fault.

This question is almost rhetorical. Yes, competition drives innovation. When Ryzen APUs only had Intel to worry about, they rested on their laurels with stagnating but still superior performance. Intel did the same thing with 4c/8t desktop CPUs during the Bulldozer dark age.

SoCs are good for the segments they target but they're by no means the be-all-end-all of personal computing. The performance of discrete graphics cards simply can't be beaten, and desktop users want modularity and competitive perf/$. Framing the distinction between a highly integrated SoC-based computer and a traditional motherboard+AIB arrangement as a quantitative rather than a qualitative difference is an error.

Both Intel and AMD have already received Apple's wakeup call and have adjusted their strategies. I also think it's unfair to say that nothing has changed until just now. I consider Ryzen 6000 to be an understated milestone in this competition with a big uplift in iGP performance and a focus on efficiency. There's a wide gap to close, sure, but AMD and Intel have certainly not been standing still.

Apple's vertical integration and volume made them uniquely suited to produce products like the Mx line, so it makes sense that they were able to deliver a product like this first.

as soon as someone puts 16GB of VRAM next to an SoC (Nvidia, soon), the gaming pc is dead. These hot 'n slow discrete components are fun to decorate with RGB but they're yesterday for _all_ segments.

The point is that the dGPU will _become_ the PC, and it essentially has been the PC for a while now, as you are essentially proving with your cybperpunk comment.

Nvidia is preparing to SoC-up it's RTX cards with ARM cores and then it's toast for your rig.

> AMD has been making SoCs forever. Why does it take a kick in the teeth from Apple for both Intel and AMD suddenly to wake up and give us performance SoCs, 5 years later

What do you mean? AMD has been the market leader in performance SoCs for specialised use cases (like consoles) for many years.

I do think the compilation benchmark breaks down, as seen with the 96-core CPU:

    406.2 Klines/sec 7995WX

3x the cores but only slightly faster (8%).

Also 32-core threadripper machines seem to be in the price range of M2 Ultra machines.

[Edit] I found a 491.5 klines/sec result for the 7985WX

> which is still 400 points or so behind apple silicon for a much higher clock speed and a multiplier larger power draw

Not a fair comparison. If we're on about Geekbench as per the announcement, it's +35%. The 15% is a geomean. It might not be better but definitely not far off Apple.

In a similar manner, except Geekbench the geomean of M3 vs M4 isn't that great either.

And in a way same applies to M3 vs M4 Geekbench scores. A few new instructions were added. Aside from those it's nowhere near the 25% improvement there either.

Surprisingly not that much to be excited about IMO. AMD isn't using TSMC's latest node and the CPUs only officially support DDR5 speeds up to 5600MHz (yes, I know that you can use faster RAM). The CPUs are also using the previous-gen graphics architecture, RDNA2.

> AMD isn't using TSMC's latest node

Staying on an older node might ensure AMD the production capacity they need/want/expect. If they had aimed for the latest 3nm then they'd have get in line behind Apple and Nvidia. That would be my guess, why aim for 3nm, if you can't get fab time and you're still gaining a 15% speed increase.

Given that there are only 2 CUs in the GPU (and fairly low clock speeds), does the architecture matter much? Benchmarks were kinda terrible, and it looks to me that the intent of the built-in GPU is for hardware video encoding or to run it in a home server system, or emergency BIOS and the like. Compared to the desktop CPUs, even the lowest end mobile 8440U has 4 CUs, going up to 12 CUs on the higher end. Or go with Strix Point, which does have an RDNA 3.5 GPU (with 12 or 16 CUs) in it.

I guess you _can_ game on those 2 CU GPUs, but it really doesn't seem to be intended for that.

Oh, there's plenty of good uses for it, but I was specifically wondering why the previous poster cared for it being RDNA 2 instead of RDNA 3 since I don't think it makes a difference outside of gaming, and it's a pretty bad gaming GPU because of the low core count.

So I was curious if there was anything else that RDNA 3/3.5 would offer over RDNA 2 in such a low end configuration.

(Oh, just realized that "better efficiency" could mean that RDNA 3 can do the office work that RDNA 2 does but with lower power usage. That could be true, though I wonder if that would really be significant savings in any way)

Yeah, I'm glad they started including built in gpu so there's something there, but beyond booting to a desktop I wouldn't use this graphics for anything else. But if you're just running a screen and compiling rust, that's all you need. Or in my case, running a home server / NAS.

When building a non-APU ryzen machine for homelab use, I ended up buying the very cheapest graphics card I could find that was compatible, a "GeForce GT 710" that was not a beefy card when it was released in 2014. It's.. fine. After getting the system working I passed it through to a win10 VM and I can play non-FPS windows-only steam games over RDP.

So yeah next time I build a machine I'll appreciate having this built in.

Yeah, I have several GT 710's which also came in a PCIe x1 variant so I could keep the x16 slot free for something better. Glad that's no longer needed - the built-in GPU is a legit good thing.

TBH, CPUs nowadays are mostly good enough for the consumer, even at mid or low tiers.

It's the GPUs that are just getting increasing inaccessible, price wise.

Yes - with more and more users moving to laptops and wanting a longer battery life, raw peak performance hasn't moved much in a decade.

A decade ago, Steam's hardware survey said 8GB was the most popular amount of RAM [1] and today, the latest $1600 Macbook Pro comes with.... 8GB of RAM.

In some ways that's been a good thing - it used to be that software got more and more featureful/bloated and you needed a new computer every 3-5 years just to keep up.

[1] https://web.archive.org/web/20140228170316/http://store.stea...

> raw peak performance hasn't moved much in a decade.

In general, CPU clock speeds stagnated about 20 years ago because we hit a power wall.

In 1985, the state of the art was maybe 15-20MHz; in 1995, that was 300-500MHz; in 2005, we hit about 3GHz and we've made incremental progress from there.

It turns out that you can only switch voltages across transistors so many times a second before you melt down the physical chip; reducing voltage and current helps but at the expense of stability (quantum tunneling is only becoming a more significant source of leakage as we continue shrinking process sizes).

Most of the advancements over the past 20 years have come from pipelining, increased parallelism, and changes further up the memory hierarchy.

> today, the latest $1600 Macbook Pro comes with.... 8GB of RAM.

That's an unfair comparison. Apple has a history of shipping less RAM with its laptops than comparable PC builds (the Air shipped with 2GB in the early 2010s, eventually climbing up to 8GB by the time the M1 launched).

Further, the latest iteration of the Steam hardware survey shows that 80% of its userbase has at least 16GB of RAM, whereas in 2014 8GB was merely the plurality; not even 40% of users had >= 8GB. A closer comparison point would have been the 4GB mark, which 75% of users met or exceeded.

> That's an unfair comparison.

When I visit retailers' websites, the 8GB product category seems to be the one with the most products on offer. Dell, Asus, Acer, HP and Lenovo are also more than happy to sell you a laptop with 8GB of RAM, today. Although I would agree don't charge $1600 and call them "Pro"

So 8GB machines are still around, and not just in the throwaway $200 laptop segment.

I would agree with you that the 2024 Steam hardware survey shows a plurality of users with 16GB, whereas the 2014 survey said 8GB, so progress hasn't entirely stopped. But compared to the glory days of Moore's Law, a doubling over 10 years is not much.

> it used to be that software got more and more featureful/bloated and you needed a new computer every 3-5 years just to keep up.

I'm sorry, "used to be" ? 90% of the last decade of hardware advancement was eaten up by shoddy bloated software, where we now have UI lag on the order of seconds, 8GB+ of memory used all the time by god knows what and a few browser tabs and 1 core always peaking in util (again, doing god knows what).

> I'm just saying the rate of increase of bloat is now much reduced.

The rate of increase of bloat is now reduced, because hardware advancements rate of increase is also now reduced.

The bloat takes up all the hardware advancements, so of course they'll just be in line with each other.

I honestly think that given the demands of 4K video specifically using potentially a few gigs of memory just for decoding, 8gb made a world of sense, but little has come out since that really needs all that much memory for the average person.

When the industry moves to lpddr6/ddr6 I wouldn’t be shocked to see an increase to 6gb per module standard although maybe some binned 4gb modules will still be sold.

To be fair, a decade ago gaming PCs came with 2GB to 4GB of vRAM. Today's gaming PCs come with 12GB to 20GB of vRAM. Most games don't demand a lot of system memory, so it makes sense that PC gamers would invest in other components.

You're also comparing Windows x86 gaming desktops from a decade ago with macOS AppleSilicon base-spec laptops today. Steam's recent hardware survey shows 16GB as the most popular amount of RAM [1]. Not the 5x increase we've seen in vRAM, but still substantial.

[1] https://store.steampowered.com/hwsurvey/Steam-Hardware-Softw...

Tons of gamers are on 8gb because of mobile GPUs and becuase the only affordable 12gb GPU Nvidia has ever released is the 3060 which is a desktop GPU. I don’t honestly expect 12gb+ to become mainstream until 6000 series.

According to the Steam hardware survey, about 58% of the current-gen GPUs being used (NVIDIA 4000s and AMD 7000s) have 12GB+ of vRAM. I'd argue it's already mainstream - at least among "PC gamers". Obviously there is still plenty of old hardware out there, but I'm specifically focused on what people are buying new today because that's what OP was looking at.

> The CPUs are also using the previous-gen graphics architecture, RDNA2.

The GPU on these parts is there mostly for being able to boot into BIOS or OS for debugging. Basically when things go wrong and you want to debug what is broken (remove GPU from machine and see if things work)

These are decent GPUs for anything other than heavy gaming. I'm driving two 4k screens with it, and even for some light gaming (such as factorio) it's completely fine.

I'm under the impression that Factorio can run on any GPU capable of producing video output at all. years ago when I played it, it ran perfectly fine on whatever iGPU my 4790K had. 60 FPS/UPS with pretty big bases (although iirc I did disable all video effects like smoke to avoid cluttering the screen)

I agree, my 780m is quite capable in most games. Depending on the resolution & settings even cyberpunk 2077 is playable at 60fps. MS Flight sim though hits (presumably the memory bandwidth bottleneck) hard.

The GPU in the APU is on a totally different level when compared to the 2 core RDNA 2 one in the “normal” CPUs

For example the 780m you mentioned has 12 cores and newer architecture so is probably something like 10 to 15 times more powerful.

Sort of. The generation is RDNA2, but it's unlike all other RDNA2 chips because the focus is so much on energy efficiency (and the APU has its own codename, van gogh)

Hard disagree on that one. I am daily driving an RDNA2 graphics unit for 1.5 years now and it’s absolutely sufficient. I mostly do office work and occasionally play Minecraft. It’s absolutely sufficient for that and I don’t see any reason why you‘d want to waste money on a dGPU for that kind of load

Another advantage of having an integrated GPU is you can do a GPU pass-through and let a VM directly and fully use your dedicated GPU.

This could be a thing if you're running native Linux but some games only work on Windows which you run in a VM instead of dual booting.

> The GPU on these parts is there mostly for being able to boot into BIOS or OS for debugging.

That's wildly not true. Transcoding, gaming, multiple displays, etc. They are often used as any other GPU would be used.

> The GPU on these parts is there mostly for being able to boot into BIOS or OS for debugging.

Not at all. I drive a 38" monitor with the iGPU of the 7700X. If you don't game and don't run local AI models it's totally fine.

And... No additional GPU fans.

My 7700X build is so quiet it's nearly silent. I can barely hear it's Noctua NH-12S cooler/fan ramping up when under full load and that's how it should be.

> The CPUs are also using the previous-gen graphics architecture, RDNA2

Faster GPU is reserved for APUs. These graphics are just here for basic support.

Yeah - I've been waiting to see what this release would entail as I kinda want to build a SFF PC. But now that I know what's in it, and since they didn't come out with anything really special chipset-wise, I'll probably just see if I can get some current-get stuff at discounted prices during the usual summer sales.

It's because x86 chips are no longer leading in the client. ARM chips are. Specifically, Apple chips. Though Qualcomm has huge potential leapfrog AMD/Intel chips in a few generations too.

[If you're a laptop user, scroll down the thread for laptop Rust compile times, M3 Pro looks great]

You're misguided.

Apple has excellent Notebook CPUs. Apple has great IPC. But AMD and Intel have easily faster CPUs.

https://opendata.blender.org/benchmarks/query/?compute_type=...

Blender Benchmark

      AMD Ryzen 9 7950X (16 core)         560.8
      Apple M2 Ultra (24 cores)           501.82
      Apple M3 Max (12 cores)             408.27
      Apple M3 Pro                        226.46
      Apple M3                            160.58

It depends on what you're doing.

I'm a software developer using a compiler that 100%s all cores. I like fast multicore.

      Apple Mac Pro, 64gb, M2 Ultra, $7000
      Apple Mac mini, 32gb, M2 Pro, 2TB SSD, $2600

[Edit2] Compare to: 7950x is $500 and a very fast SSD is $400, fast 64gb is $200, very good board is $400 so I get a very fast dev machine for ~$1700 (0,329 p/$ vs. mini 0,077 p/$)

[Edit] Made a c&p mistake, the mini has no ultra.

That seems wrong.

Though Blender may have an optimization for avx512 but not for SME or Neon.

But the vast majority will use GPUs to do rendering for Blender.

Try SPEC or its close consumer counterpart, Geekbench.

As an anecdote, all my Python and Node.js applications run faster on Apple Silicon than Zen4. Even my multithread Go apps seem to run better on Apple Silicon.

Now you're just asserting things unencumbered by even the slightest evidence.

On Passmark Apple CPUs are pretty far down the list.

On Geekbench I gave up after scrolling a few pages.

And "run faster on Apple Silicon than Zen4" means nothing. On the low end you have fairly cheap Ryzen 3 laptop chips, and on the high end you have Threadripper behemoths.

Passmark is a pretty bad CPU benchmark. It generally has poor correlation.

I would stick to SPEC and Geekbench.

Even Cinebench 2024 isn't too bad nowadays though R23 was quite poor in correlation.

In general, not only are Apple Silicon CPUs faster than AMD consumer CPUs, but they're 2-4x more power efficient as well.

The problem with Geekbench is it's trying to average the scores from many different benchmarks, but then if some of them are outliers (e.g. one CPU has hardware acceleration or some other unusual aptitude for that specific workload), it gets an outsized score which is then averaged in and skews the result even if it doesn't generalize.

What you want to do is look at the benchmarks for the thing you're actually using it for.

> they're 2-4x more power efficient as well.

This is generally untrue, people come to this conclusion by comparing mobile CPUs with desktop CPUs. CPU power consumption is non-linear with performance, so a large power budget lets you eek out a tiny bit more margin. For example, compare the 65W 5700X with the 105W 5800X. The 40 extra watts buys you around 2% more single thread performance, not because the 5700X has a more efficient design -- they're the exact same CPU with a different power cap. It's because turning up the clock speed a tiny bit uses a lot more power, but desktop CPUs do it anyway, because they don't have any such thing as battery life and people want the extra tiny bit more. Or the CPU simply won't clock any higher and doesn't even hit the rated TDP on single-threaded workloads.

The extra power will buy you a lot more on multi-threaded workloads, because then you get linear performance improvement with more power by adding more cores. But that's where the high core count CPUs will mop the floor with everything else -- while achieving higher performance per watt, because the individual cores are clocked lower and use less power.

  The problem with Geekbench is it's trying to average the scores from many different benchmarks, but then if some of them are outliers (e.g. one CPU has hardware acceleration or some other unusual aptitude for that specific workload), it gets an outsized score which is then averaged in and skews the result even if it doesn't generalize.

Geekbench CPU benchmark does not optimize for accelerators. It optimizes for instruction sets only.

It's not just about coprocessors. If one CPU has a set of SIMD instructions that double performance on that benchmark or more, that creates a large outlier that significantly changes the average.

Apple Silicon also has more memory bandwidth the primary purpose of which is to feed the GPU because most CPU workloads don't care about that, but if you average in the occasional ones that does then you get more outliers.

Which is why the thing that matters is how it performs on the thing you actually want to run on it, not how it performs in aggregate on a bunch of other applications you don't use.

The max power on Intel/AMD CPUs is only there to get the CPU "performance crown". As you've said, you spend a large amount of additional power for very minor gains (to be at the top of fancy Youtube review charts).

It looks though as if AMD/Intel feel threatened by Snapdragon though - we'll see what AMD Strix / Halo brings for the first meaningful x86 mobile processor in years (or Luna Lake).

> The max power on Intel/AMD CPUs is only there to get the CPU "performance crown".

It's mostly not. Its real purpose is to improve performance on threaded workloads.

Multi-core CPUs work like this: At the max boost a single core might use, say, 50 watts. So if you have 8 cores and wanted to run them all full out, you'd need a 400 watt power budget, which is a little nuts. It's not even worth it. Because you only have to clock them a little lower, say 4GHz instead of 5, to cut the power consumption more than in half, and then you get a TDP of e.g. 100W. Still not nothing but much more reasonable. You can also cut the clock speed even more and get the power consumption all the way down to 15W, but then you're down to 2GHz on threaded workloads and sacrificing quite a bit of multi-thread performance.

So they're not just trying to eek out a couple of percent, even though that's all you get from single thread improvement, because a single core was already near or at its limit. Whereas 8 cores at 4GHz will be legitimately twice as fast as the same cores at 2GHz. But they'll also use more than twice as much power. Which matters in a laptop but not so much in a desktop.

Of course, the thing that works even better is to have 16 cores or more that are clocked a little lower, which improves performance and performance per watt. The performance per watt of the 96-core Threadrippers are astonishingly good -- even though they're 360W. But that also requires more silicon, so those ones are the expensive ones.

Apple's M3 Max CPU cores peak out at 55w for 16 cores (12p+4e). AMD's 8-core U-series chips peak out at 65-70w on CPU-only workloads and still loses out massively in pretty much every category.

If you downclock that AMD chip, it does get more efficient, but also loses by even larger margins.

Because you're comparing a 16-core CPU to an 8-core CPU on threaded workloads, which as mentioned is where the multi-threaded workloads will favor the one with more cores on both performance and performance per watt. But why not compare it to the Ryzen that also has 16 cores, like the 7945HX3D? Because then the Ryzen is generally faster on threaded workloads, even though the TDP is still 55W -- and even though it's on TSMC 5nm instead of 3nm.

Geekbench has a seperate page for each "instruction set".

For Apple you need to go to https://browser.geekbench.com/mac-benchmarks

Then compare numbers by hand I assume.

Though what I would love is compile-time vs. $ (as mentioned, I'm a software developer). The 7950x is $500 and a very fast SSD is $400, fast 64gb is $200, very good board is $400 so I get a very fast dev machine for ~$1700.

I compiled a few previously. Sorry for the formatting:

ASUS ROG Zephyrus G16 (2024)

Processor: Intel Core Ultra 9 185

Memory: 32GB

Cargo Build: 31.85 seconds

Cargo Build --Release: 1 minute 4 seconds

ASUS ROG Zephyrus G14 (2024)

Processor: AMD Ryzen 8945HS / Radeon 780M

Memory: 32GB

Cargo Build: 29.48 seconds

Cargo Build --Release: 34.78 seconds

ASUS ROG Strix Scar 18 (2024)

Processor: Intel Core i9 14900HX

Memory: 64GB

Cargo Build: 21.27 seconds

Cargo Build --Release: 28.69 seconds

Apple MacBook Pro (M3 Pro 11 core)

Processor: M3 Pro 11 core

Cargo Build: 13.70 seconds

Cargo Build --Release: 21.65 seconds

Apple MacBook Pro 16 (M3 Max)

Processor: M3 Max

Cargo Build: 12.70 seconds

Cargo Build --Release: 15.90 seconds

Firefox Mobile build:

M1 Air: 95 seconds AMD 5900hx: 138 seconds Source: https://youtu.be/QSPFx9R99-o?si=oG_nuV4oiMxjv4F-&t=505

Javascript builds

Here, Alex compares the M1 Air running Parallels emulating Linux vs native Linux on AMD Zen2 mobile. The M1 is still significantly faster. https://youtu.be/tgS1P5bP7dA?si=Xz2JQmgoYp3IQGCX&t=183

Docker builds

Here, Alex runs Docker ARM64 vs AMD x86 images and the M1 Air built the image 2x faster than an AMD Zen2 mobile. https://youtu.be/sWav0WuNMNs?si=IgxeMoJqpQaZv2nc&t=366

Anyways, Alex has a ton more videos on coding performance between Apple, Intel and AMD.

Lastly, this is not M1 vs Zen2 but it's M2 vs Zen4.

LLVM build test

M2 Max: 377 seconds Ryzen 9 7940S: 826 seconds

@aurareturn really appreciate the comparison and your effort (upvoted). As I no longer use a laptop (don't need it, too expensive, breaks, no upgrades, but that is just me), browsing through that channel looks like he focuses on laptops.

Would love to see a 7950x/64gb/SSD5 comparison, perhaps (see https://www.octobench.com/ for SSD impact on Go compilation) he will create one in the future (channel bookmarked). But would I still need to use a laptop, I would probably switch back to Apple (have an iMac Pro as decoration standing in the shelf, was my last Apple dev machine).

The $5000 16 Pro looks great as a machine. When still working at eBay, the nice thing was one always got the max specced machine as a developer back in the days - so that would probably be it. Real nice one.

[Edit]

Someone suggested looking at Geekbench Clang, which brought some insights for my desktop usage:

(it looks like top CPUs are more or less the same, ~15% difference)

"Randomly" picking

    M2 Ultra   233.9 Klines/sec
    7950x      230.3 Klines/sec  
    14900K     215.3 Klines/sec  
    M3 Max     196.5 Klines/sec

Ah right; that's confusing! Seems that the AMD and Intel chips are much faster though, consistent with other benchmarks.

Speed vs. $ is of course a different story than pure speed; kinda hard to capture in a number I guess.

Most Go projects compile more than fast enough even on my 7 year old i5, although there are exceptions (mostly crummy hyper-overengineered projects).

I think it depends on what you do. If you write a database in Go, there is no problem with 5min compile time. If you write a web app, 10 sec compile times are already annoying.

Here's GB6: https://browser.geekbench.com/v6/cpu/compare/6339005?baselin...

Note: M3 Max is a 40w CPU maximum, while 7950x is a 230w CPU maximum. The stated 170w max is usually deceptive from AMD.

Source for 7950x power consumption: https://www.anandtech.com/show/17641/lighter-touch-cpu-power....

Note that the M3 Max leads in ST in Cinebench 2024 and 2-3x better in perf/watt. It does lose in MT in Cinebench 2024 but wins in GB6 MT.

Cinebench is usually x86 favored as it favors AVX over NEON as well as having extremely long dependency chains, bottlenecked by caches and partly memory. This is why you get a huge SMT yield from it and why it scales very highly if you throw lots of "weak" cores at it.

This is why Cinebench is a poor CPU benchmark in general as the vast majority of applications do not behave like Cinebench.

Geekbench and SPEC are more predictive of CPU speed.

It the end, what matters is real-world performance and different workloads have different bottlenecks. For people who use Cinema 4D, Cinebench is the most accurate measurement of hardware capabilities they can get. It's very hard to generalize what will matter for the vast majority of people. I find it's best to look at benchmarks for the same applications or similar workloads to what you'll be doing. Single score benchmark like Geekbench are fun and quick way to get some general idea about CPU capabilities, but most of the time they don't match specifics of real-world workloads.

Here's a content creation benchmark (note that for some tasks a GPU is also used):

https://www.pugetsystems.com/labs/articles/mac-vs-pc-for-con...

Cinebench is being used like a general purpose CPU benchmark when, like you said, it should only be used to judge the performance of Cinema 4D. Cinema 4D is a niche software in a niche. Why is a niche in a niche software application being used to judge overall CPU performance? It doesn't make sense.

Meanwhile, Geekbench does run real world workloads using real world libraries. You can look at subtest scores to to see "real world" results.

Pugetsystem benchmarks are pretty good. It shows how Apple SoCs punch above their weight in real world applications over benchmarks.

Regardless, they are comparing desktop machines using as much as 1500 watts vs a laptop that maxes out at 80 watts and Apple is still competing well. The wins in the PC world are usually due to beefy Nvidia GPUs that are applications have historically optimized for.

That's why I originally said ARM is leading AMD - specifically Apple ARM chips.

Some Geekbench 6 workloads and libraries:

- Dijkstra's algorithm: not used by vast majority of applications.

- Google Gumbo: unmaintained since 2016.

- litehtml: not used by any major browser.

- Clang: common on HN, but niche for general population.

- 3D texture encoding: very niche.

- Ray Tracer: a custom ray tracer using Intel Embree lib. That's worse than Cinebench.

- Structure from Motion: generates 3D geometry from multiple 2D images.

It also uses some more commonly used libraries, but there's enough niche stuff in Geekbench that I can't say it's a good representation of a real world workloads.

> Regardless, they are comparing desktop machines using as much as 1500 watts vs a laptop that maxes out at 80 watts and Apple is still competing well. The wins in the PC world are usually due to beefy Nvidia GPUs that are applications have historically optimized for.

They included a laptop, which is also competing rather well with Apple offerings. And it's not PC's fault you can't add a custom GPU to Apple offerings.

> Cinebench R23 was built on Intel Embree. ;)

Yes, and the renderer is the same as in Cinema 4D which is used by many, while custom ray tracer build for Geekbench is not used outside benchmarking.

The blog post you linked just shows that one synthetic benchmark correlates with another synthetic benchmark. Where do SPEC CPU2006/2017 benchmarks guarantee or specify correlation with real-world performance workloads?

Cinebench fits characteristics of a good benchmark defined by SPEC [1]. It's not a general benchmark, but biased benchmark. It's floating-point intensive. It should do a perfect job of evaluating Cinema 4D rendering performance, a good job as proxy for other floating-point heavy workloads, but a poor job as proxy for other non-floating-point heavy workloads. The thing is, most real world workloads are biased. So, no single-score benchmark result can do an outstanding job of predicting CPU performance for everyone (which is what a general CPU benchmark with a single final score aims to do). They are useful, but limited in their prediction abilities.

[1] https://www.spec.org/cpu2017/Docs/overview.html#Q2

Cinebench is fine if you’re looking for how your cpu performs in cinema4d. The problem is users are using it as a general purpose cpu benchmark. It isn’t. It sucks at it.

SPEC is the industry standard for CPU benchmarking.

"But the vast majority will use GPUs to do rendering for Blender."

And the argument is, you can't use Blender to compare CPU performance because of that?

"Even my multithread Go apps seem to run better on Apple Silicon."

As a Go developer, I'd love to hear your story: How much faster does your Apple Silicon compile compare to a Zen4 (e.g. the 7950x?)? For example 100k lines of Go code.

I might switch back to Apple again (used Apple for 20+ years), if it's faster at compilation speed.

M4 is looking pretty interesting. Near 10% IPC uplift and they bumped the e-core on the base M4, so we're probably looking at the same 12 p-cores for the M4 max, but likely going from 4 to 12 e-cores (two of the 6-core complexes).

In multithreaded workloads, 2 of their current e-cores are roughly equivalent to 1 p-core, so that would represent the equivalent of 4 extra p-cores.

> How much faster does your Apple Silicon compile compare to a Zen4 (e.g. the 7950x?)?

Good ol, compare a $400 piece of equipment with a $3000 piece of equipment. I wonder what will win. (unironically, most of the time, the $3000 piece of equipment doesnt win)

On Geekbench, though they are segregated in separate pages so I'm not sure if the comparison is fully correct, the M2Ultra is behind the top 3 PC processors (2 Intel and 1 AMD) for multi-core, and it is indeed the best at single core.

Would that be comparing to Windows on Zen4 or to Linux on Zen4? On Windows I've noted that especially forking performance takes a big hit which causes many dynamic languages that do stuff with say invoking a runtime binary being 100s of times slower on windows (tried with bash and python).

There's something wrong with your M3 Max stuff. I believe it comes in 14 and 16-core variants while the M3 Pro comes in 11 and 12-core variants.

In any case, M3 Max uses less than 55w of power in CPU-only workloads while a desktop 7950x peaked out at 332w of power according to Guru3D (without an OC).

The fact that M2 Ultra hits so close while peaking out at only around 100w of CPU power is pretty crazy (M2 Ultra doesn't even hit 300w with all CPU and GPU cores maxed out).

> Blender Benchmark

Maybe use a benchmark that actually makes sense for CPUs, rather than something that's always much faster on a GPU (eg. M3 Pro as any sane user would use it for Blender is 2.7x the performance of a Ryzen 7950X, not 0.4x).

> Apple Mac mini, 32gb, M2 Ultra, 2TB SSD, $2600

Not a real thing. You meant M2 Pro, because the Max and Ultra chips aren't available in the Mac mini.

Is blender 100% GPU now? Last time I used it, there were multiple renderers available, and it wasn’t a 100% win to switch to GPU. IIRC the cpu did better in ray tracing(?). This was a couple years ago though so things may have changed or I might not be recalling correctly.

I think GPU rendering was always faster as long as you had a supported GPU. Now that the Cycles renderer has support for all the major GPU APIs/vendors, the only reasons to render on the CPU are if you don't have a half-decent GPU, or if your scene doesn't fit in your GPU's memory. Neither of those are a concern on Apple systems.

At least on NVIDIA hardware, Blender can use the GPU's raytracing capabilities rather than just the general-purpose GPU compute capabilities. Which means it doesn't take a very expensive GPU at all to outperform high-end CPUs.

I would love to quote a compiler comparison, but I don't know a good and accepted compiler benchmark. What would you use as a compiler benchmark? (Preferably Go, but I assume Rust would be better, as it is much slower, so the differences are bigger)

(corrected my c&p mistake with the mini, thanks)

Thanks, Clang looks good, now I need to check how to sort CPUs/systems by the Clang benchmark, no success for now.

"Randomly" picking

    14900K     215.3 Klines/sec  
    7950x      230.3 Klines/sec  
    M2 Ultra   233.9 Klines/sec
    M3 Max     196.5 Klines/sec

Single thread:

M3 Max: 3898

7950x: 2951

The ST advantage of Apple Silicon is real. 7950x does do better in highly parallel tasks.

To me, Apple Silicon is clearly leading clients over AMD/Intel. Hence, my original reason for why AMD's announcement isn't "exciting". Because Apple Silicon is so far ahead in client.

Of course, AMD can crank up the core via Epyc/Threadripper and Apple has no answer. For that, you'd need to look into ARM chips from Ampere/Amazon for a competitor.

Building clang using itself is a reasonable approximation to a compiler benchmark, speaking as someone who spends a depressing fraction of his life doing that over and over for permutations of the source code. That's somewhere in the five to ten minutes range on a decent single socket system.

Can you point me to a comparison site? Didn't find a M3/M2/7950/... comparison site for chromium compile times :-(

(Even phoronix is scares and mostly focuses on laptops - I have no laptop)

ARM needs more than just proper UEFI support: Microsoft needs to lift the secureboot restrictions on ARM.

x86: Microsoft requires that end-users are allowed to disable secure boot and control which keys are used.

arm: Microsoft requires that end-users are not allowed to disable secure boot

This isn't a hardware issue, but simply a policy issue that Microsoft could solve with a stroke of a pen, but since Microsoft is such a behemoth in the laptop space, their policies control the non-apple market.

source: https://mjg59.dreamwidth.org/23817.html

It's like I keep saying: the first Chinese manufacturer to churn out cheap SBCs with ServerReady support will make a killing as a true Pi killer. Anyone? Anyone? Pine64? Pine64?

We dont really need a Pi killer anymore. They've done a fine job of killing it themselves. Their community has shrunk massively due to low cost mini pc's being leaps and bounds better value than a Pi now. Their two fingers being put up at the hacker/tinkerer hobbyist market over the last few years combined with the IPO and shift to B2B has made it very clear where their priorities lie.

Unless you need the GPIO theres zero reason to overpay for a Pi 5 for example when you can pick up decent second hand mini pc's on ebay for a lower price.

Case in point, a couple of months ago I was able to nab two brand new still in box Dell Optiplex 3050's (Core i7 6700T 4 Cores, 2.8Ghz, 16GB RAM Win 10 hardware license, 256gb ssd, with mouse & keyboard) for £55 each delivered. The base 4gb model Pi 5 comes in at £80-£100 once you add power, storage and a case.

Sure, its not ARM but you're not likely to be doing anything that _needs_ ARM.

> Unless you need the GPIO theres zero reason to overpay for a Pi 5

Even then, both usb-to-gpio and mini PCs with gpio exist. Unless you want something really small, then there's still pi zero and Arduino

Thats about the only area the Pi is superior at this point. For the usecases where a zero is all you need its a no brainer, but many are using them as home servers and such. Even a basic wireless camera feed can be a struggle for the zero so it's usecases are certainly limited, but its power to performance is great for its price.

Arguably that single company is ASML. There are more fabs (e.g. Intel), but AFAIK cutting-edge nodes all use ASML EUV chip fabrication machines?

Intel still fabs their own CPUs, their dedicated Xe graphics are made by TSMC, though.

Nvidia 30-series was fabbed by Samsung.

So there is some competition in the high-end space, but not much. All of these companies rely on buying lithography machines from ASML, though.

>Intel still fabs their own CPUs

Isn't Lunar Lake made by TSMC? Supposedly they have comparable efficiency to AMD/Apple/Qualcomm at the cost of making their fab business even less profitable

The real problem is, any disruption of Taiwan and China's ability to produce and sell semiconductors right now is going to have wide ranging impacts on the economy.

If you thought prices were high during pandemic shortages, strap in.

It's probably not a coincidence that as soon as the US starts spending billions to onshore semiconductor production, China begins a fresh round of more concrete saber-rattling. (Yes, there's likely other factors too.)

I normally rebuild my workstation every ~4 years (recently more for fun than out of actual need for more processing power), might finally do it again (preferably a recent 8-/12-core Ryzen). My most recent major upgrade was in 2017 (Core i5 3570K -> Ryzen 7 1700X), with a minor fix in 2019 (Ryzen 7 2700, since 1700X was suffering from the random-segfaults-during-parallel-builds issue).

Same. I'm on a 10700k and thinking on an upgrade. I'll wait for X3D parts to come out (assuming they're doing that this gen, not sure if we've got confirmation), and compare vs 15th gen Intel once it's out in like September-ish.

Might be worth considering a Ryzen 9 5900XT (just launched as well) for a drop in upgrade. Been running a 5950X since close to launch and still pretty happy with it.

Interesting unveil; it seems everyone was expecting more significant architectural changes though they still managed to net a decent IPC improvement.

In light of the "very good but not incredible" generation-over-generation improvement, I guess we can now play the "can you get more performance for less dollars buying used last-gen HEDT or Epyc hardware or with the newest Zen 5 releases?" game (NB: not "value for your dollar" but "actually better performance").

The all-core max clock speed costs a lot of watts for that last 10% - close to 50% of total CPU consumption in some cases.

That's why undervolting has become a thing to do (unless you're an Intel CPU marketer) - give up a few percent of your all-core max clock rate and cut your wattage used by a lot.

It's weird that CPUs tend to use more power every two years now instead of less. Looks like this time, the TDP didn't increase. Yay! i guess. Don't most people think their PCs are fast enough already?

It's as if our planet wasn't being destroyed at a frightening speed. We're headed towards a cliff, but instead if braking, we're accelerating.

The people who think PCs are already fast enough don't buy CPUs every year.

A 7950X in Eco mode is ridiculously capable for the power it pulls but that's less of a selling point.

Compared to Intel, AMD seems to bin their chips a lot less. Intel have their -T chips which would be nifty if Intel weren't so far behind in terms in terms of efficiency.

The TDP increases slower than the efficiency, so for the same workload, a newer CPU consumes similar or less power. Especially nowadays, most of these chips only get anywhere near the stated TDP when being near fully maxed out.

You can run 4x32 DDR5 on current gen AMD consumer platforms but don't expect speeds above 4400MHz in quad-channel regardless of what the module is rated for. I'd instead suggest dual channel 48GB DIMMs for 96GB at full speed.

4 DIMMs does not equal Quad-channel. I see in AMD's presentation Quad-channel is supported by chipsets, but I am not aware of a current AMD consumer/hedt chip with 4 memory channels.

True, but the Threadripper requires the TRX50 platform, so I still don't understand what is operative about the statement that the X870E chipset supports quad channel memory.

Good point. Makes me wonder if they'd consider a Threadripper version that uses that chipset, since the Non-Pro Threadripper have been in a weird spot ever since Ryzen went up to 16 cores. An AM5 Threadripper might not make sense because one of the appeals were the much bigger amount of PCIe lanes, but I wonder if they would pair the X870E with a new socket for Zen 5 Non-Pro Threadrippers.

A Threadripper with 24-32 cores but only 2 channels of memory, that worked in the AM5 socket, would fill a nice hole in the AMD story. I don't know if AM5 has enough power pins to do the job, though. If you want more than 16 cores right now you have to spend $2k on the CPU and motherboard, triple what it costs to get a 16-core Ryzen.

I just had a look and the comparison website I use says there are 113 motherboards that support 128GB or more memory for AM5. (in my local market. so the US probably has even more)

128GB isn't exactly a lot, so that would surprise me if it wasnt supported.

I think for a while there, the only way to be able to use 128gb was to go TR4 or TRX... I kindof stopped looking for a while, but 100+ boards is certainly a nice change.

geizhals.de

Very good website to compare gadgets/electronics and look for good prices.

Its for the German market though (and some small amount of Austrian shops)

I'm running the same as above. However do check both the memory and motherboard manufacture's tested with lists (QVL qualified vendor list) before purchasing! Each product should have its own compatablity listing.

Give the server ones a look. 448GB here (one stick out of eight died and I have not been proactive at replacing it). Epyc comes in some SKUs that make a lot of sense for a desktop, or the seriously high spec ones from a previous generation are going cheap.

Improving avx512 is really good, since this is the data unit sweet spot: a cache line is 512 bits.

But I am more interested in the cleanup of the GPU hardware interface (it should be astonishingly simple to program the GPU with its various ring buffers, like as it is rumored to be the case on nvidia side) AND in the squishing of all hardware shader bugs: look at valve ACO compiler erratas in mesa, AMD hardware shader is a bug minefield. Hopefully, the GFX12 did fix ALL KNOWN SHADER HARDWARE BUGS (sorry, ACO is written with that horrible c++, I dunno what went thru the head of valve and no, rust syntax is a complex as c++, then this is toxic too).

> TSMC N4

Question: Am I understanding this correctly that AMD will be using a node size from TSMC that’s 2-years old, but in a way it’s kind of older.

Because N4 was like a “N5+” (and the current gen is “N3+”).

EDIT: why the downvotes for a question?

Yes, AMD is not using the leading-edge node. The older node is cheaper/has better yield and more importantly has much larger supply than N3, which Apple has likely completely devoured.

I am personally very curious how it compares vs Intel's 15th gen, which is rumored to be on Intel 20 process.

I wouldn't be sure about that. Isn't apple silicon performance mostly benchmarked using Geekbench?

AMD claims +35% IPC improvements in that specific benchmark, due to improvement in the AVX512 pipeline.

AMD claims +35% IPC improvement in AES subtest of Geekbench 6. It's not the entire Geekbench 6 CPU suite. It's deceptive.

Overall GB6 improvement is likely around 10-15% only because that's how much IPC improved while clock speed remains the same.

How come is a 15% IPC increase generation for generation a disappointing result? There might be greener pastures, I agree, but a 15% increase year over year for the quality factor of a product is nothing to be disappointed of. It's good execution, even more so in a mature and competitive sector such as microelectronics.

It's not year over year. It's 2 years.

It's disappointing because M4 is significantly ahead. I would expect Zen to make a bigger leap to catch up.

Also, this small leap opens up for Intel's Arrow Lake to take the lead.

Actually from Zen 4 to Zen 5 there is a little more than one year and a half (supposing that they will indeed go on sale in July), almost matching the AMD announcement done around the launch of Zen 4 that they will shorten the time between generations from 2 years to 1 1/2 years.

Of course it's real. But not only is M4 significantly faster in ST, it'll probably 3-4x more efficient than Zen5 mobile just like how M3 is against Zen4 mobile.

This difference in efficiency cannot be explained by node advantage alone as N3E vs N4 is probably around only 15-20% more efficient.

（评论） (comments)

（评论）
(comments)