x86 SIMD 的演变:从 SSE 到 AVX-512
The Evolution of x86 SIMD: From SSE to AVX-512

原始链接: https://bgslabs.org/blog/evolution-of-x86-simd/

## x86 SIMD 向量霸权之战:历史 x86 SIMD(单指令多数据)的故事不仅仅是技术,还涉及市场营销、企业战略和工程妥协。它始于英特尔1993年在以色列冒险开发 Pentium MMX,引入了64位寄存器,但有争议地*别名*化它们与现有的浮点寄存器,以避免操作系统修改——这是一项限制性能的权衡。尽管最初实际收益有限,英特尔还是积极推广MMX,甚至起诉AMD侵犯商标权,因为后者将其描述为“矩阵数学扩展”。 AMD 随后推出了 3DNow!,增加了浮点 SIMD 功能。英特尔则推出了 SSE(流式 SIMD 扩展),然后是 SSE2,升级了“指令战争”。 后续扩展,如 SSE3 和 SSSE3,是由性能需求驱动的,并且 SSSE3 解决了架构弱点。 最终升级是 AVX 和 AVX-512,将向量推向 256 位,然后是 512 位。 然而,AVX-512 证明存在问题,由于功耗和英特尔产品线中的碎片化导致处理器降频。 批评,尤其是来自 Linus Torvalds 的批评,最终导致英特尔将其禁用。 AMD 最初抵制,后来使用“双泵”技术有效地实现了 AVX-512,现在正在推进其自身的进步。 这段历史揭示了向后兼容性、市场需求和工程限制之间的持续紧张关系,塑造了 x86 向量处理的演变。

Hacker News 新闻 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 x86 SIMD 的演变:从 SSE 到 AVX-512 (bgslabs.org) 15 分,by todsacerdoti 2 小时前 | 隐藏 | 过去 | 收藏 | 2 评论 帮助 jiggawatts 16 分钟前 [–] AI 博客垃圾信息。回复 anematode 9 分钟前 | 父评论 [–] 不幸;我希望得到一些实际的见解。与此同时,跳到随机章节,明显存在错误的陈述,例如:> 这些指令 [pabsb, pabsw, ...] 在不改变 Core 微架构的情况下被添加。它们主要是基于微代码/解码器的补充。 叹息...回复 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文

The story of x86 SIMD is simply not about technology. It’s about marketing, corporate politics, engineering compromises, competitive pressure. This is the behind-the-scenes history of how Intel and AMD battled for vector supremacy, the controversial decisions that defined an architecture, and the personalities who made it happen.


Part I: The Not-So Humble Beginnings (1993-1999)

The MMX Gamble: Intel’s Israel Team Takes a Huge Risk

The story of MMX begins not in Santa Clara, but in Haifa, Israel. In 1993, Intel made an unprecedented decision: they would let their Israel Development Center design and build a mainstream microprocessor, the Pentium MMX, the first time Intel developed a flagship processor outside the United States.

This was a massive gamble. According to Intel’s own technology journal, the development of MMX technology spanned five years and involved over 300 engineers across four Intel sites. At the center of this effort was Uri Weiser, director of the Architecture group at the IDC in Haifa.

Uri Weiser later recalled the struggle with characteristic understatement: “Some people were ready to quit,” He was named an Intel Fellow for his work on MMX architecture, a rare honor that speaks to the significance of what the Israel team accomplished.

Meanwhile, in Haifa, 300 engineers were about to make a decision that would haunt x86 for the next three decades.

The Technical Reason for the Controversial Register Decision

Here is where things get spicy. The most consequential and controversial decision in MMX design was register aliasing. Intel aliased the 8 new MMX registers (MM0-MM7) directly onto the existing x87 floating-point register stack (ST(0)-ST(7)).

Why they did this: To avoid adding new processor state. At the time, operating systems only knew how to save/restore the x87 FPU registers during context switches. Adding 8 entirely new registers would have required OS modifications across Windows, Linux, and every other x86 OS.

This was the 1990s, remember, convincing Microsoft to change Windows was roughly as easy as convincing your cat to enjoy water sports.

The cost: You cannot mix floating-point and MMX instructions in the same routine without risking register corruption. Programmers must use the EMMS (Empty MMX State) instruction to switch between modes, and even then, there’s overhead. Think of it like sharing a closet with your neighbor: sure, it saves space, but good luck finding your socks when they’ve mysteriously migrated to the other person’s side.

The register state mapping can be expressed as:

$$ \forall i \in {0,\dots,7}: \text{MM}_i \equiv \text{ST}(i) $$

where $\equiv$ denotes hardware-level aliasing (same physical storage).

Intel’s engineers knew this was a compromise. But they made a calculated bet: most multimedia applications separate data generation (FP) from display (SIMD), so the restriction would rarely matter in practice.

They were mostly right. Mostly…

The “MMX” Naming Controversy

Intel pulled a masterstroke with the MMX name. Officially, MMX is a meaningless initialism, not an acronym at all. Intel trademarked the letters “MMX” specifically to prevent competitors from using them.

The internal debate: Unofficially, the name was derived from either:

  • MultiMedia eXtension
  • Matrix Math eXtension

Intel has never officially confirmed which, because apparently they wanted to preserve the mystique. Or maybe they forgot. Hard to say.

When AMD produced marketing material suggesting MMX stood for “Matrix Math Extensions” (based on internal Intel documents), Intel sued AMD in 1997 with the enthusiasm of a copyright troll at a convention, claiming trademark infringement. AMD argued that “MMX” was a generic term for multimedia extensions.

The settlement: AMD eventually acknowledged MMX as Intel’s trademark and received rights to use the name on their chips. But Intel’s aggressive legal stance sent a message: this was their playground, and competitors would have to find their own identity. (Looking at you, 3DNow!)

The Marketing Hype Backlash

Intel launched MMX with a Super Bowl commercial featuring Jason Alexander, promising revolutionary multimedia capabilities. The hype was enormous. This was 1997, when Super Bowl commercials were still an event and people actually watched them for the ads.

When the Pentium MMX shipped, reviewers found that for non-optimized applications, the real-world performance gain was only 10-20%, mostly from the doubled L1 cache (32KB vs 16KB).

One technology journalist called MMX “90% marketing and 10% technical innovation.” PC Magazine Labs found only modest gains for existing Windows 95 applications.

Intel’s defense: They claimed 50-700% improvements for MMX-optimized software, but the catch was obvious: almost no software was optimized at launch.

Now, to put into perspective, a textbook example of where this would help is in a function like this

void
add_i32(int* dest, const int* a, const int* b, int n);

void
add_i32(int* dest, const int* a, const int* b, int n)
{
 int i;
 for(i=0; i<n;++i) {
  dest[i] = a[i] + b[i];
 }
}

Which in turn should produce a beautiful MMX register using delicacies like

movq mm0, [a+i]
movq mm1, [b+i]
paddd mm0, mm1
movq [dst+i], mm0

(or even better just unroll the loop, but for the sake of argument I’m omitting that)

but in reality gcc2.7.2.3 produced this thing:


movl (%ebx,%eax,4), %edi
addl (%ecx,%eax,4), %edi
movl %edi, (%esi,%eax,4)   
incl %eax
cmpl %edx, %eax

Which is comparing a car to a bicycle. Yes it will be correct but it’s simply too slow.

There is no “polite” C code in 1997 that nudges GCC 2.7.x into MMX. You can write restrict but it will not work.

You either write MMX explicitly, or you don’t get MMX at all.

See Appendix B for comprehensive line-by-line analysis of this assembly output, including verification of GCC 2.7.2.3 authenticity and explanation of why MMX instructions were not generated.

SSE: Intel’s Response to AMD’s 3DNow

While MMX was still proving itself, Intel’s product definition team made a bold proposal: add SIMD floating-point capabilities to the next processor, code-named “Katmai.”

The internal debate: Intel executives were hesitant. MMX hadn’t even shipped yet. Were they betting too heavily on SIMD? Was this just another marketing gimmick?

According to Intel’s own account, the meeting was “inconclusive.” Executives demanded more questions be answered. Two weeks later, they gave the OK for Katmai (later named Pentium III).

Meanwhile, in Sunnyvale, California, AMD was watching. And plotting.

AMD’s 3DNow!, introduced in the K6-2 in May 1998, was a direct response to MMX’s biggest weakness: no floating-point SIMD. AMD added 21 instructions that could handle single-precision floating-point operations in parallel.

Suddenly, Intel’s fancy new multimedia extension couldn’t actually do the floating-point math that 3D graphics required. Oops :p

When Pentium III (Katmai) shipped in February 1999, it introduced SSE (Streaming SIMD Extensions) with 70 new instructions and 8 entirely new 128-bit registers (XMM0-XMM7).

Intel added new registers, costing an extra processor state and requiring OS modifications (looking at you again, Microsoft). Nevertheless Intel implemented the 128-bit floating-point units in a “hack “way. A 4-way SSE instruction gets broken into two 64-bit microinstructions, executed on two separate units.

Intel “sorta” succeeded in adding 128-bit SIMD FP. The implementation was clever, efficient, and space-conscious, but it was a hack that would haunt optimization efforts for years. The word “sorta” appears in technical documentation approximately never, which tells you something about just how much of a hack this was!

It might be worth noting that this persisted for a long time (Pentium M, Core 2). Intel didn’t get true single-cycle 128-bit width until Core 2 (Conroe) for some ops, and fully in later gens. AMD actually beat them to true 128-bit width in hardware execution units with the K8/K10 in some aspects.


Part II: The SSE Wars (2000-2008)

SSE2 (2000): The Pentium 4’s “Sledgehammer”

Intel’s SSE2 wasn’t driven by a new application breakthrough. It was a defensive move against AMD’s 3DNow! and the looming threat of K7 (Athlon).

Intel’s Willamette team in Hillsboro, Oregon was under immense pressure. AMD’s Athlon K6-2 had demonstrated that SIMD instructions mattered for gaming and 3D graphics. Intel internally called Willamette “Sledgehammer”.

The key driver was real-time 3D gaming and DirectX performance. Microsoft had been pushing Intel for better SIMD support since DirectX 7.

SSE2 introduced 144 new instructions including double-precision FP:


movapd   xmm0, [rax] 
addpd    xmm0, xmm1


mulpd    xmm0, xmm2
sqrtpd   xmm0, xmm3

unpckhpd xmm0, xmm1

cvtpd2ps xmm0, xmm1

SSE3 (2004): Prescott’s Reckoning

SSE3’s official driver was “media encoding improvements,” but the real story is far more troubled.

SSE3 was introduced with Prescott, (aka. PNI, Presscot New Instructions), the 90nm Pentium 4 revision that would become Intel’s biggest nightmare. The 13 new instructions were heavily trimmed due to power concerns.

The new instructions could be used to accelerate 3D workflows and video codecs. Like normally, Intel released the hardware first and waited for software to catch up to later. With one exception. Intel C++ 8.0 compiler, which supported SSE3 Instructions.

Although Intel released the SSE3 instructions guidelines for software developers last summer, there are no programs yet,…

… according to Intel, `LDDQU`` instruction could speed up video compression by 10% if used in data encoding algorithms…

Ilya Gavrichenkov (xbitlabs.com, 02.01.2004)

Horizontal operations (operating within a single register lane) were a new concept:


haddpd xmm0, xmm1



hsubpd xmm0, xmm1


movddup xmm0, xmm1


movshdup xmm0, xmm1

Intel executives had acknowledged the growing challenges with clock speed scaling as the industry hit what some called a “power wall.” Prescott’s 31-stage pipeline generated so much heat that Intel had to cut SSE3 instruction complexity to reduce power draw. The thermal challenges were significant enough that power efficiency became a primary concern in processor design.

SSSE3 (2006): The Core 2 Rebirth

SSSE3 (Supplemental Streaming SIMD Extensions 3) wasn’t planned as a separate extension. It was emergency additions to fix Core architecture’s weaknesses.

When Intel abandoned NetBurst for Core (Conroe/Merom), they discovered their new architecture lacked certain acceleration paths. The 16 new instructions in SSSE3 (including PMULHRSW, PABSB/PABSW/PABSD, and PALIGNR) were specifically designed to address common performance bottlenecks.

SSSE3 was introduced with the Intel Xeon processor 5100 series and Intel Core 2 processor family. SSSE3 offer 32 instructions to accelerate processing of SIMD integer data.


pabsb    xmm0, xmm1
pabsw    xmm0, xmm1
pabsd    xmm0, xmm1

phaddw   xmm0, xmm1
phaddd   xmm0, xmm1
phsubw   xmm0, xmm1
phsubd   xmm0, xmm1

pmulhrsw xmm0, xmm1, xmm2

palignr  xmm0, xmm1, 3

One could think that these are actually not real ALU instructions but actually a result of a cat walking over on an Intel engineers keyboard and pressed random buttons. If you though about that you’ll be right on one of those assumptions.

Those are not purely ALU instructions.

These instructions were added without changing the Core microarchitecture. They were largely microcode/decoder based additions. These instructions did not introduce new arithmetic capabilities or execution units; they collapsed common multi-instruction SIMD idioms into single operations that mapped onto existing ALUs and shuffle units.

SSE4 (2007)

SSE4 was split into two parts, SSE4.1 (video/graphics) and SSE4.2 (database/text). This was deliberate, Intel didn’t want to wait for database features to ship with video acceleration.

The H.264 video encoding explosion drove SSE4.1. By 2006, YouTube was growing and everyday video creation and consumption were consuming massive CPU resources, and Intel needed hardware acceleration.

14 new video-oriented instructions were specifically designed for H.264 encoding:

  • MPSADBW - Multi-hypothesis Motion Estimation (4x4 SAD calculations)
  • PHMINPOSUW - Horizontal Minimum Position (used in motion vector selection)
  • DP - Dot Product (floating-point, for video filtering) …


mpsadbw xmm0, xmm1, 0


phminposuw xmm0, xmm1


dpps xmm0, xmm1, 0xFF


pmaxsb xmm0, xmm1
pminub xmm0, xmm1

pextrb xmm0, xmm1, 5
pinsrd xmm0, xmm1, 2

In theory new instructions significantly accelerated motion estimation workloads.

Penryn showed significant improvements in video encoding over Core 2 at same clock speeds. Intel’s Fall 2007 IDF demo showed x264 encoding performance improvements that were substantial enough to generate significant developer interest in optimizing their code.

SSE4.2 (2008): Nehalem’s Database Revolution

Intel’s focus on data center and enterprise workloads wasn’t born from an acquisition of an existing database team, it was shaped by two strategic XML acquisitions. In August 2005, Intel acquired Sarvega, an XML networking company. In February 2006, they followed up by acquiring Conformative, an XML processing startup.

These acquisitions could have brought expertise in text processing and XML acceleration into Intel’s Software and Solutions Group. The engineering knowledge from Sarvega and Conformative probably influenced the STTNI (String and Text New Instructions) in SSE4.2, first shipping with Nehalem in 2008.

Four instructions were specifically designed for database and string processing:

  • CRC32 - Hardware-accelerated checksums (for storage/network)
  • POPCNT - Population count (for Bloom filters, compression)
  • PCMPESTRI/PCMPISTRI - String comparison (for text search)


crc32 eax, byte [rax]       
crc32 eax, ax
crc32 eax, eax

popcnt rax, rbx

pcmpestri xmm0, xmm1, 0x00


pcmpistri xmm0, xmm1, 0x04

The CRC32 instruction alone reduced ZFS/Btrfs checksum overhead significantly, making storage operations notably faster.

The new string processing instructions generated considerable discussion in the developer community. One example was of Austing Zhang of Intel who claimed “After basic testing with iSCSI and confirmed that the iSCSI head digest routines can be speeded up by 4x - 10x.”

Intel initially wanted to call SSE4.2 “SSE5” but AMD had already announced SSE5 (with different 3-operand format). This led to the confusing naming that persists today, because nothing says “clear technical vision” like having two companies use the same numbers for completely different things.


Part III: The Birth of AVX (2008-2011)

March 2008: The Announcement

Intel officially announced AVX (then called “Gesher New Instructions”) in March 2008. The codename “Gesher” means “bridge” in Hebrew, later changed to “Sandy Bridge New Instructions” as the microarchitecture name took precedence.

The announcement came through leaked slides in August 2008, which revealed Intel’s roadmap including 8-core CPUs and the new AVX instruction set. Because nothing says “carefully planned announcement” like your roadmap getting leaked to Engadget two months early.

Why 256 Bits?

From Intel’s official documentation, three key factors drove the 256-bit decision:

  1. Floating-Point Performance Doubling: The primary goal was to double floating-point throughput for vectorizable workloads. Sandy Bridge’s execution units were specifically reworked to achieve this.

  2. Forward Scalability: As noted in Intel’s AVX introduction documentation: “Intel AVX is designed to support 512 or 1024 bits in the future.” The 256-bit design was explicitly chosen as a stepping stone.

  3. Manufacturing Reality: Moving to 256 bits was achievable on Intel’s 32nm process without excessive die area penalties, while 512 bits would have required more significant architectural changes.

This was Intel essentially saying: “256 bits is just the beginning. Wait until you see what we’ve got planned.” Spoiler: what they had planned was a fragmented nightmare that would make Linus Torvalds do what he did best.

The Three-Operand Non-Destructive Instruction Decision

The shift from destructive two-operand instructions (A = A + B) to non-destructive three-operand instructions (C = A + B) addressed a fundamental compiler and programmer pain point:

Previous SSE instructions (destructive):

addps xmm0, xmm1

AVX non-destructive:

vaddps xmm0, xmm1, xmm2

Why this mattered:

  1. Reduced Register Spilling: Compilers no longer needed extra instructions to save/restore values before operations. This was like finally getting a larger desk, you could actually spread out your work instead of constantly shuffling papers.

  2. Better Code Generation: Three-operand form enables more efficient instruction scheduling. (Which is a significant step up from the Itanium disaster.) The compiler could think ahead instead of constantly working around the destructiveness of existing instructions.

  3. Reduced Code Size: Though VEX encoding is more complex, avoiding register copy operations often results in smaller overall code.

AVX removed artificial ISA constraints without abandoning dynamic OoO scheduling.

The “VEX encoding scheme” was introduced specifically to support this three-operand format while maintaining backwards compatibility. Intel basically invented a new instruction format that could still run old code.




vaddps ymm0, ymm1, ymm2


vaddps ymm1, ymm1, ymm2


vsqrtss xmm0, xmm1, xmm2


vaddps ymm0, ymm1, [rax+256]  

AMD’s Bulldozer Influence

May 2009: AMD announced they would support Intel’s AVX instructions

August 2010: AMD announced Bulldozer microarchitecture details

AMD had developed XOP (eXtended Operations) as their own 128-bit SIMD extension before deciding to support Intel’s AVX instead. This suggests AMD recognized Intel’s direction was gaining industry momentum. Sometimes the best strategy is to stop fighting and join the party.

Intel’s aggressive 256-bit implementation in Sandy Bridge was widely seen as a move to maintain SIMD leadership against AMD’s competing designs. The message was clear: Intel wasn’t going to let AMD dictate the future of x86 SIMD.

Target Workloads

From Intel’s AVX introduction materials:

  1. High-Performance Computing (HPC): Climate modeling, molecular dynamics, quantum chemistry simulations
  2. Media and Entertainment: Video encoding/decoding, image processing, 3D rendering
  3. Scientific Computing: Finite element analysis, computational fluid dynamics, seismic processing
  4. Signal Processing: Radar systems, communications systems, medical imaging

Intel was explicitly targeting the workloads where GPUs were starting to make inroads. The message was clear: you don’t need a graphics card to do vector math. Just buy more Intel chips. (Spoiler: this didn’t entirely work out as planned.)


Part IV: The Road to AVX-512 (2011-2016)

The FMA Controversy: AMD vs. Intel

This was one of x86’s most bitter instruction set battles, the kind of standards fight that makes engineers reach for the antacid:

AMD’s Bulldozer (2011) introduced FMA4 as a 4-operand instruction:

vfmaddpd ymm0, ymm1, ymm2, ymm3

Intel’s Haswell (2013) implemented FMA3 as a 3-operand instruction:

vfmadd132pd ymm0, ymm1, ymm2

FMA4 and FMA3 are incompatible extensions with different operand counts and encodings.

AMD’s Piledriver (2012) added FMA3 support while still keeping FMA4.

Bulldozer and its successors supported FMA4, while Haswell and later Intel CPUs supported only FMA3. AMD later dropped FMA4 support starting with its Zen families in favor of FMA3, and FMA4 does not appear in current AMD CPUID-reported feature flags.


vfmadd132ps zmm0, zmm1, zmm2
vfmadd213ps zmm0, zmm1, zmm2
vfmadd231ps zmm0, zmm1, zmm2


vfmaddpd ymm0, ymm1, ymm2, ymm3

The market fragmentation meant developers had to use CPU-specific code paths or risk crashes. Intel’s market dominance won, FMA4 died with Bulldozer’s failure. AMD eventually added FMA3 support in later architectures, which is engineering-speak for “we were wrong, Intel won, let’s just copy them.”

On a personal note. I much prefer the FMA4 syntax, because it was non-destructive.

The Technical Core of the Dispute

The conflict wasn’t just about operand ordering; it was about register destruction.

Fused Multiply-Add requires three input values (A×B+CA×B+C). To store the result in a fourth register (DD) requires a 4-operand instruction.

  • AMD’s FMA4 introduced a special extension to VEX allowing 4 distinct operands. It was fully non-destructive.

  • Intel’s FMA3 stuck to the standard VEX limit of 3 operands. To make the math work, the destination register must also serve as the third input.


vfmaddpd ymm0, ymm1, ymm2, ymm3



vfmadd231pd ymm0, ymm1, ymm2

The Xeon Phi “GPCORE” Connection

The Xeon Phi’s core architecture (codenamed “GPCORE”) was a radical departure from Intel’s mainstream cores. Designed by a separate team working on the Larrabee research project, it featured:

  • Wide but shallow pipelines optimized for throughput over latency
  • 512-bit vector units as the primary execution resource
  • No out-of-order execution in early versions (Knights Corner)

Why 512 Bits? The Xeon Phi Imperative

Intel’s drive to 512-bit vectors wasn’t primarily about mainstream CPUs, it was about Xeon Phi and competing with GPUs in HPC. The Knights Landing (KNL) project, announced at ISC 2014, was the first to implement AVX-512, targeting 3+ TFLOPS of double-precision peak theoretical performance per single node.

These customers represented a small percentage of Intel’s revenue but demanded disproportionate engineering investment. Like the “Trinity” Supercomputer at NNSA (National Nuclear Security Administration). $174 million deal awarded to Cray that will feature Haswell and Knights Landing (Note: 13th slide)

This leads me to believe Intel sales teams used their contracts to justify AVX-512 development internally. High-value enterprise customers often get special treatment, even when they represent a tiny fraction of the overall market population wise…

I also miss affordable RAM.

Who Demanded 512-Bit SIMD?

  1. National Labs (DOE) - Required for TOP500 supercomputer competitiveness against NVIDIA GPUs
  2. Weather Modeling Agencies (NOAA, ECMWF) - Needed 2x+ vector throughput for atmospheric simulations
  3. Quantitative Finance - HFT firms paying premium for any FP performance edge
  4. Oil & Gas - Seismic processing workloads that were GPU-prohibitive due to data transfer costs

These were the customers who would call Intel and say “we’ll give you $50 million if you add this instruction.” And Intel, being a corporation, would say “yes, absolutely, right away, here’s an entire engineering team.”


Part V: The AVX-512 Nightmare (2016-2026)

“To know your Enemy, you must become your Enemy.”
    – Sun Tzu, The Art of War.

The Power Virus Reality

Travis Downs’ detailed analysis revealed that AVX-512 on Skylake-X caused massive license-based downclocking.

License LevelBase FrequencyNotes
L0 (Non-AVX)3.2 GHzStandard operation
L1 (AVX)2.8 GHz12.5% reduction
L2 (AVX-512)2.4 GHz25% reduction from base

The thermal/power calculus: 512-bit SIMD units consumed approximately 3x the power of 256-bit units at the same frequency. Intel had to either:

  1. Downclock when 512-bit instructions executed (their choice)
  2. Increase TDP significantly (unacceptable for mainstream)
  3. Disable cores to maintain power budget (theoretical, never implemented)

They chose option 1, which meant your $500 processor would deliberately slow itself down if you even dared to use its most advanced features.

Linus Torvalds’ Famous Rant (July 2020)

Linus Torvalds, the creator of Linux, is not known for holding back. In July 2020, he delivered one of the great tech rants of all time:

“I want my power limits to be reached with regular integer code, not with some AVX512 power virus that takes away top frequency (because people ended up using it for memcpy!) and takes away cores (because those useless garbage units take up space).”

“I hope AVX512 dies a painful death, and that Intel starts fixing real problems instead of trying to create magic instructions to then create benchmarks that they can look good on.”

“I’d much rather see that transistor budget used on other things that are much more relevant. Even if it’s still FP math (in the GPU, rather than AVX512). Or just give me more cores (with good single-thread performance, but without the garbage like AVX512) like AMD did.”

The AVX-512 Fragmentation Problem

AVX-512 became a “family of instruction sets” rather than a single standard:

  • Knights Landing (2016): AVX-512-BW, CD, ER, PF (no main CPU features)
  • Skylake-X (2017): AVX-512-F, CD, BW, DQ, VL, etc.
  • Cannon Lake (2018): Added AVX-512-VNNI (AI/ML instructions)
  • Ice Lake (2019): Better frequency scaling, added BF16
  • Alder Lake (2022): Disabled entirely due to hybrid architecture conflicts

It got to the point where you couldn’t even tell which AVX-512 features a processor supported without looking at the spec sheet. Intel had essentially created an instruction set that was different on every chip. This is the opposite of standardization.

mandatory xkcd standards

Why Alder Lake Killed It

Intel’s hybrid architecture had Performance-cores (Golden Cove) and Efficiency-cores (Gracemont). Only P-cores had 512-bit units, E-cores maxed at 256-bit. This caused:

  • Scheduling nightmares for the OS thread director.
  • Power management conflicts between core types
  • Customer confusion over which instructions would work where

Intel’s solution: Fuse it off in silicon to prevent BIOS workarounds, then create AVX10 as a unified replacement. This is what happens when you build a feature so complex that even the company that created it can’t figure out how to make it work across different product lines.

Why AMD Resisted (And How They Finally Won, for now)

AMD’s position (2017-2021):
“We’re not rushing to add features that make Intel’s chips throttle.”

The Zen 4 Breakthrough (2022): When AMD finally added AVX-512 in Ryzen 7000, they did it with a stroke of genius: “double-pumping.” Instead of building massive 512-bit execution units that generated enormous heat, they executed 512-bit instructions using two cycles on their existing 256-bit units.

It was simply logical. Developers got the instruction set support they wanted (VNNI, BFloat16), but the processors didn’t downclock. This approach avoided the “garbage” power penalties that had plagued Intel’s implementation.

Zen 5’s Power Play (2024): With Ryzen 9000, AMD finally moved to a true full-width 512-bit datapath. While this doubled raw throughput, it brought the laws of physics back into play—lighting up a 512-bit wire simply generates more heat than a 256-bit one. While it avoided the catastrophic downclocking of Intel’s Skylake era, it forced AMD to manage power density much more aggressively than with Zen 4.

Raja Koduri’s Defense (August 2020)

“There are people who are doing real work with AVX-512. It’s not just benchmarks. And it’s not going away.”

Intel’s Raja Koduri (who would later return to Intel after adventures at Apple and Samsung) tried to defend AVX-512 against Torvalds’ criticism. The subtext seemed to be: “Linus, you don’t understand. National labs and AI researchers actually use this stuff!”

Linus’ response was not diplomatic, but it was memorable.

The 2022 Resolution: Intel Finally Surrenders

In January 2022, the debate reached its inevitable conclusion. Intel disabled AVX-512 on Alder Lake processors, not through a BIOS option, but by fusing it off in silicon. The official rationale was hybrid architecture conflicts: Performance-cores had 512-bit units while Efficiency-cores maxed at 256-bit, creating scheduling nightmares for the OS thread director.

But the subtext was clear: Linus & common sense had won. The “power virus” that downclocked entire processors lineage, the transistor budget consumed by features most developers never used or didn’t even know the names of, the fragmentation across SKUs, all of it was quietly retired.

As Linus noted in November 2022, neural net inference is one of the few legitimate use cases for AVX-512. For everything else, from video encoding to database operations to general-purpose computing, the costs outweighed the benefits.

And when in an age where literally every personal computer either has an integrated GPU or an external GPU, CPU SIMD seems like a weird transitional phase. It definitely has it’s uses but is applied in contexts where its costs outweigh benefit.

The resolution wasn’t a technical decision. It was a market decision. Intel’s hybrid architecture demanded coherent vector support across all cores. AVX-512 couldn’t provide that. So it is being slowly removed. Just like Itanium, x87 and the x86 we used to know.

The Fragmentation Spiral

  1. 2013-2016: Intel splits AVX-512 across incompatible implementations
  2. 2017-2021: Different SKUs have different feature subsets (bifurcation strategy)
  3. 2022: Alder Lake fuses off AVX-512 entirely
  4. 2023: Intel announces AVX10 to unify the mess
  5. 2026: Nova Lake with AVX10.2 targets coherent 512-bit support across all cores (confirmed November 2025)

The irony was that AVX-512 was designed to unify Intel’s vector strategy. Instead, it became the most fragmented instruction set extension in x86-64 history. Requiring multiple replacement specifications to fix the damage. This is the equivalent of creating a problem so complex that you need to create a new solution just to solve the original solution’s problems.


Lessons Learned

  1. Backward compatibility drives architecture: The register aliasing decision haunted MMX for years, but it enabled rapid adoption without OS changes.

  2. Marketing matters as much as engineering: Intel’s aggressive MMX marketing, despite modest real-world gains, established SIMD as essential for consumer processors.

  3. Competition accelerates innovation: AMD’s 3DNow! forced Intel to add FP SIMD capabilities years earlier than planned. The FMA controversy showed how fragmented standards hurt developers.

  4. Compromises become permanent: Intel’s “sorta” 128-bit SSE implementation influenced x86 SIMD architecture for a decade.

  5. Customer requirements can override engineering sanity: AVX-512 was pushed by a small percentage of customers but created massive fragmentation and power issues for everyone.

  6. Fragmentation has costs: AVX-512’s bifurcation across SKUs and eventual disablement in hybrid architectures shows the danger of over-engineering for edge cases.

  7. Sometimes the market decides: AMD won the FMA fight not through technical superiority, but through market dominance. The best instruction set is the one everyone actually uses.


The Legacy

The engineers who built x86 SIMD made decisions that shaped computing for decades, often under intense pressure and uncertainty. Their legacy is in every video encode, 3D render, AI inference, and scientific simulation happening on x86 processors today.

The battle continues with AVX10, but the lessons from MMX through AVX-512 remain: architecture decisions made in conference rooms in Haifa, Santa Clara, and Austin echo through decades of computing. The next chapter is being written now, will AVX10 finally unify Intel’s fractured vector strategy, or will history repeat itself?

One thing is certain: somewhere, right now, an engineer is making a decision that will seem brilliant, stupid, or utterly incomprehensible to programmers thirty years from now. That’s the nature of this business. And honestly? That’s what makes it fun.

“I’d much rather see that transistor budget used on other things that are much more relevant.”
    – Linus Torvalds, 2020


Appendix A: x86 SIMD Syntax Reference

This appendix is taken from the documents which can be found here with the best of my ability. If you see any problems here, please don’t hesitate to contact me.

A.1 Register Naming Conventions

ExtensionRegistersWidthNaming Scheme
MMXMM0-MM764-bitMM<n> where n = 0-7
SSEXMM0-XMM15128-bitXMM<n> where n = 0-15
AVXYMM0-YMM15256-bitYMM<n> (upper 128 of ZMM)
AVX-512ZMM0-ZMM31512-bitZMM<n> (full register)

A.2 Instruction Suffix Encoding

The instruction suffix encodes the data type and operation:

SuffixMeaningExample
SSigned integerPMOVSXBD
UUnsigned integerPADDUSB
BByte (8-bit)PADDB
WWord (16-bit)PADDW
DDoubleword (32-bit)PADDD
QQuadword (64-bit)PADDQ
SSingle-precision FPADDPS
DDouble-precision FPADDPD

A.3 Assembly Syntax Variations

Intel Syntax (used throughout this document):

vaddps zmm0 {k1}{z}, zmm1, zmm2
vmovups zmmword ptr [rax], zmm3

AT&T Syntax (GNU assembler):

vaddps %zmm2, %zmm1, %zmm0{%k1}{z}
vmovups %zmm3, 64(%rax)

A.4 EVEX/VEX Encoding Fields

Modern AVX-512 uses EVEX encoding with four modifier bytes:

FieldBitsPurpose
pp2Opcode extension (00 = no extension)
mm2VEX.mmmmm equivalent
W1Vector width (0 = 128/256, 1 = 512)
vvvv4Destination register specifier
aaa3{k}{z} mask register (000 = no mask)
B1Broadcast/Round control
R1Register specifier extension

The complete encoding follows:

$$ \text{EVEX} = 0x62 ;\Vert; \text{RR}{}^\prime\text{B} ;\Vert; \text{vvvv} ;\Vert; \text{aaa} $$

A.5 Intrinsic Type Mappings

SIMD TypeC/C++ IntrinsicWidth (bits)
__m64MMX64
__m128SSE128
__m128dSSE (double)128
__m256AVX256
__m256dAVX (double)256
__m512AVX-512512
__m512dAVX-512 (double)512

A.6 Common Operation Mnemonics

CategoryInstructionsDescription
ArithmeticPADD*, PSUB*, PMUL*, PMADD*Integer arithmetic
FP ArithmeticADDPS, MULPS, DIVPS, SQRTPSSingle-precision FP
ComparePCMPEQ*, PCMPGT*, CMPPSEquality/greater-than
LogicalPAND, POR, PXOR, ANDPS, ORPSBitwise operations
ShufflePSHUFLW, SHUFPS, VPERM*Lane manipulation
Load/StoreMOVAPS, MOVUPD, VBROADCAST*Memory transfers
ConvertCVTDQ2PS, CVTPS2DQ, VCVT*Type conversion
MaskKAND, KOR, KXNOR, KNOTMask register ops

A.7 Mask Register Operations

AVX-512 introduced dedicated mask registers (k0-k7):


vaddps zmm0, zmm1, zmm2


vaddps zmm0 {k1}{z}, zmm3, zmm4


vpaddd zmm5 {k2}, zmm6, zmm7

The mask value at position $i$ is computed as:

$$ \text{mask}[i] = \begin{cases} 1 & \text{if } \text{cond}(\text{src1}[i], \text{src2}[i]) \ 0 & \text{otherwise} \end{cases} $$

A.8 Lane Concepts in SIMD

A “lane” is a sub-vector within a wider register:

ZMM31 (512 bits) = 8 lanes of 64 bits each
    |-----|-----|-----|-----|-----|-----|-----|-----|
    |  0  |  1  |  2  |  3  |  4  |  5  |  6  |  7  |

YMM15 (256 bits) = 4 lanes of 64 bits each
    |-----------|-----------|-----------|-----------|
    |     0     |     1     |     2     |     3     |

XMM0  (128 bits) = 2 lanes of 64 bits each
    |-----------------------|-----------------------|
    |           0           |           1           |

Lane-crossing operations require special handling and may incur performance penalties.

Appendix B: Assembly Code Analysis

Experimental Context: The assembly output examined in this appendix was generated by compiling the source C code with GCC 2.7.2.3 on a Debian Potato (2.2) system running under qemu-system-i386 virtualization (QEMU emulator version 9.2.4 (qemu-9.2.4-2.fc42)). The host system was an 11th Gen Intel(R) Core(TM) i7-11370H processor. This experimental setup recreates the 1997-era GCC compilation environment while running on modern hardware.

B.1 Source Code and Compiler Context

The following analysis examines the GCC 2.7.2.3 assembly output for the integer vector addition function discussed in Section I. The source C code was:



void add_i32(int* dest, const int* a, const int* b, int n);

void add_i32(int* dest, const int* a, const int* b, int n)
{
    int i;
    for(i = 0; i < n; ++i) {
        dest[i] = a[i] + b[i];
    }
}

Compiled with: gcc -O2 -S test.c (GCC 2.7.2.3, 1996-era)

Compiler Version Context: GCC 2.7.2.3 was released August 20, 1997, during the Pentium MMX era. This version predates any MMX intrinsic support in GCC. MMX intrinsics first appeared in GCC 3.1 (2002), and auto-vectorization was not added until GCC 4.0 (2005) .

B.2 Generated Assembly Analysis

The complete assembly output (assets/out.s) consists of 40 lines. The following analysis provides a line-by-line examination with verification against contemporary documentation.

Header Section (Lines 1-7):

LineAssemblyAnalysis
1.file "test.c"Debug info directive, specifies source filename
2.version "01.01"GAS assembler version string
3gcc2_compiled.:VERIFIED: Valid GNU assembly identifier. The trailing dot is part of the symbol name, used by libg++ to identify GCC-compiled objects
4.textCode section directive
5.align 416-byte alignment (2^4 = 16). Correct for Pentium Pro/Pentium II instruction fetch optimization
6.globl add_i32Exports symbol globally
7.type add_i32,@functionELF symbol type directive for debug info

Prologue and Parameter Loading (Lines 8-17):

The function follows the System V i386 ABI with cdecl calling convention :

add_i32:
    pushl %ebp
    movl %esp,%ebp
    pushl %edi
    pushl %esi
    pushl %ebx
    movl 8(%ebp),%esi
    movl 12(%ebp),%ebx
    movl 16(%ebp),%ecx
    movl 20(%ebp),%edx

Stack Offset Verification:

  • After pushl %ebp: 4(%ebp) = saved %ebp
  • 8(%ebp) = first argument (dest)
  • 12(%ebp) = second argument (a)
  • 16(%ebp) = third argument (b)
  • 20(%ebp) = fourth argument (n)

Register Allocation: The compiler saves %edi, %esi, %ebx as callee-saved registers per ABI. This is conservative. Only %edi is truly modified as an accumulator.

Main Loop (Lines 22-28):

.L5:
    movl (%ebx,%eax,4),%edi
    addl (%ecx,%eax,4),%edi
    movl %edi, (%esi,%eax,4)   
    incl %eax
    cmpl %edx,%eax
    jl .L5                     

Addressing Mode Analysis:

  • Scale factor of 4 correctly represents sizeof(int)
  • Base+index addressing is optimal for array access
  • No memory operands in instructions other than loads/stores

Epilogue (Lines 29-35):

.L3:
    leal -12(%ebp),%esp
    popl %ebx
    popl %esi
    popl %edi
    leave
    ret

B.3 Critical Finding: Why No MMX Instructions?

The claim in Section I—that GCC 2.7.2.3 “failed” to generate MMX code—requires clarification. What does “failed” mean in this context?

GCC 2.7.x had no capability to generate MMX instructions whatsoever . This was not an implementation failure but a fundamental design decision by Intel and the GCC project. Intel released MMX technology in January 1997 with aggressive marketing claims about performance improvements, yet they did not collaborate with the GCC team to ensure compiler support.

GCC MMX Support Timeline:

VersionReleaseMMX Support
GCC 2.7.x1995-1997None
GCC 3.02001Broken -mmmx flag
GCC 3.12002Initial intrinsics
GCC 4.02005Auto-vectorization

(GCC 3.0: -mmx partial backend support; intrinsics incomplete and unstable)

What “Failed” Really Means:

The term “failed” implies Intel expected automatic MMX code generation from existing compilers. However, Intel did not work with the GNU Compiler Collection project to add MMX support. GCC was the primary compiler for Linux, BSD, and many embedded systems in 1997. If Intel wanted their marketing claims about MMX performance to reach everyday developers, they should have:

  1. Provided MMX intrinsic headers and documentation to the GCC team in 1996-1997
  2. Collaborated on machine description updates for MMX instruction selection
  3. Ensured GCC could generate MMX code alongside proprietary compilers like Intel’s ICC

Without this collaboration, the “marketing reality gap” widened. Intel claimed 50-700% improvements for MMX-optimized software, but developers using GCC could not achieve these speedups without writing hand-optimized assembly. The comparison in Section I between GCC output and MMX code is therefore a comparison between what Intel’s hardware could do and what Intel’s failure to work with the dominant open-source compiler allowed developers to achieve.

Evidence from GCC development history: Richard Henderson stated in December 2004: “As mentioned in another thread, we can’t generate proper MMX/3DNOW code ourselves. The existing intrinsics expect users to write them manually” .

The comparison in Section I is therefore comparing compiler output to what would require hand-written assembly. This is not a “failure” of GCC 2.7.2.3 in the sense of a bug or regression. It is a fundamental limitation of 1996-era compiler technology that Intel could have addressed but chose not to.

B.4 Performance Gap Analysis

Using instruction latency data from Agner Fog’s optimization manuals and Intel documentation, we can quantify the performance difference between the generated scalar code and an optimal MMX implementation .

Scalar Implementation (GCC output):

  • Instructions per element: 5
  • CPI (estimated): 1.1
  • Cycles per element: ~5.5

Optimal MMX Implementation:

  • Instructions per 2 elements: 4 (2 times MOVQ + 2 times PADDD)
  • Instructions per element: 2.5
  • CPI (estimated): 1.0
  • Cycles per element: ~2.5

Performance Comparison (Pentium MMX, 233 MHz):

MetricScalarMMXImprovement
Cycles/element5.52.52.2x
Elements/sec (10M array)38.8M93.2M2.4x
Memory ops per 8 elements24122.0x
Branch ops per 8 elements842.0x

EMMS Overhead: The EMMS instruction (required after MMX code) costs 2-4 cycles. For loops processing N elements (4 per iteration), overhead is 4/N cycles per element, negligible (0.4%) for N=1000.

B.5 The Productivity Gap

The 2-4x performance gap between hardware capability and compiler output in 1997 represents what we call the productivity gap, the difference between what SIMD hardware could do and what compilers could exploit .

Industry Response:

  1. Intel released MMX Technology Programmer’s Reference Manual (245794, 1997) encouraging manual intrinsics
  2. Developers wrote assembly code directly
  3. GCC eventually added intrinsics (GCC 3.1, 2002) and auto-vectorization (GCC 4.0, 2005)
  4. The fundamental challenge persists: modern compilers still miss 30-50% of vectorization opportunities

18 Feb 2026

I would like to thank to bal-e and hailey from the lobste.rs forum for noticing the AI hallucinations used to proofread this article. I am sincerely sorry for ever letting this happen. I shouldn’t have used them in the first place and I promise I will never use any AI tools to proofread nor help assist my articles from this blog hereon.

I would also like to thank hoistbypetard for inviting me to lobste.rs

I would also like to thank Peter Kankowski for pointing out the flaw in one of the examples at SSE3.


Kamikura, Masaru. “Intel 45nm Processor Demo” https://www.youtube.com/watch?v=TGCt4NyJWTY.

See also: “Intel Buys Into XML Processing With Conformative.” EE Times, February 8, 2006. https://www.eetimes.com/intel-buys-into-xml-processing-with-conformative/. Accessed January 15, 2026.

See also: Mysticial. “Zen4’s AVX512 Teardown.” MersenneForum, September 26, 2022. https://www.mersenneforum.org/node/21615. Accessed January 15, 2026.

See also: “Intel Officially Confirms AVX10.2 and APX Support in Nova Lake.” TechPowerUp, November 13, 2025. https://www.techpowerup.com/342881/intel-officially-confirms-avx10-2-and-apx-support-in-nova-lake/. Accessed January 16, 2026.


Additional Technical References

June 20, 2017. https://www.intel.com/content/www/us/en/developer/articles/technical/intel-avx-512-instructions.html. Accessed January 16, 2026.

https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html. Accessed January 16, 2026.

联系我们 contact @ memedata.com