The story of x86 SIMD is simply not about technology. It’s about marketing, corporate politics, engineering compromises, competitive pressure. This is the behind-the-scenes history of how Intel and AMD battled for vector supremacy, the controversial decisions that defined an architecture, and the personalities who made it happen.
Part I: The Not-So Humble Beginnings (1993-1999)
The MMX Gamble: Intel’s Israel Team Takes a Huge Risk
The story of MMX begins not in Santa Clara, but in Haifa, Israel. In 1993, Intel made an unprecedented decision: they would let their Israel Development Center design and build a mainstream microprocessor, the Pentium MMX, the first time Intel developed a flagship processor outside the United States.
This was a massive gamble. According to Intel’s own technology journal, the development of MMX technology spanned five years and involved over 300 engineers across four Intel sites. At the center of this effort was Uri Weiser, director of the Architecture group at the IDC in Haifa.
Uri Weiser later recalled the struggle with characteristic understatement: “Some people were ready to quit,” He was named an Intel Fellow for his work on MMX architecture, a rare honor that speaks to the significance of what the Israel team accomplished.
Meanwhile, in Haifa, 300 engineers were about to make a decision that would haunt x86 for the next three decades.
The Technical Reason for the Controversial Register Decision
Here is where things get spicy. The most consequential and controversial decision in MMX design was register aliasing. Intel aliased the 8 new MMX registers (MM0-MM7) directly onto the existing x87 floating-point register stack (ST(0)-ST(7)).
Why they did this: To avoid adding new processor state. At the time, operating systems only knew how to save/restore the x87 FPU registers during context switches. Adding 8 entirely new registers would have required OS modifications across Windows, Linux, and every other x86 OS.
This was the 1990s, remember, convincing Microsoft to change Windows was roughly as easy as convincing your cat to enjoy water sports.
The cost: You cannot mix floating-point and MMX instructions in
the same routine without risking register corruption. Programmers must
use the EMMS (Empty MMX State) instruction to switch between modes,
and even then, there’s overhead. Think of it like sharing a closet
with your neighbor: sure, it saves space, but good luck finding your
socks when they’ve mysteriously migrated to the other person’s side.
The register state mapping can be expressed as:
$$ \forall i \in {0,\dots,7}: \text{MM}_i \equiv \text{ST}(i) $$
where $\equiv$ denotes hardware-level aliasing (same physical storage).
Intel’s engineers knew this was a compromise. But they made a calculated bet: most multimedia applications separate data generation (FP) from display (SIMD), so the restriction would rarely matter in practice.
They were mostly right. Mostly…
The “MMX” Naming Controversy
Intel pulled a masterstroke with the MMX name. Officially, MMX is a meaningless initialism, not an acronym at all. Intel trademarked the letters “MMX” specifically to prevent competitors from using them.
The internal debate: Unofficially, the name was derived from either:
- MultiMedia eXtension
- Matrix Math eXtension
Intel has never officially confirmed which, because apparently they wanted to preserve the mystique. Or maybe they forgot. Hard to say.
When AMD produced marketing material suggesting MMX stood for “Matrix Math Extensions” (based on internal Intel documents), Intel sued AMD in 1997 with the enthusiasm of a copyright troll at a convention, claiming trademark infringement. AMD argued that “MMX” was a generic term for multimedia extensions.
The settlement: AMD eventually acknowledged MMX as Intel’s trademark and received rights to use the name on their chips. But Intel’s aggressive legal stance sent a message: this was their playground, and competitors would have to find their own identity. (Looking at you, 3DNow!)
The Marketing Hype Backlash
Intel launched MMX with a Super Bowl commercial featuring Jason Alexander, promising revolutionary multimedia capabilities. The hype was enormous. This was 1997, when Super Bowl commercials were still an event and people actually watched them for the ads.
When the Pentium MMX shipped, reviewers found that for non-optimized applications, the real-world performance gain was only 10-20%, mostly from the doubled L1 cache (32KB vs 16KB).
One technology journalist called MMX “90% marketing and 10% technical innovation.” PC Magazine Labs found only modest gains for existing Windows 95 applications.
Intel’s defense: They claimed 50-700% improvements for MMX-optimized software, but the catch was obvious: almost no software was optimized at launch.
Now, to put into perspective, a textbook example of where this would help is in a function like this
void
add_i32(int* dest, const int* a, const int* b, int n);
void
add_i32(int* dest, const int* a, const int* b, int n)
{
int i;
for(i=0; i<n;++i) {
dest[i] = a[i] + b[i];
}
}
Which in turn should produce a beautiful MMX register using delicacies like
movq mm0, [a+i]
movq mm1, [b+i]
paddd mm0, mm1
movq [dst+i], mm0
(or even better just unroll the loop, but for the sake of argument I’m omitting that)
but in reality gcc2.7.2.3 produced this thing:
movl (%ebx,%eax,4), %edi
addl (%ecx,%eax,4), %edi
movl %edi, (%esi,%eax,4)
incl %eax
cmpl %edx, %eax
Which is comparing a car to a bicycle. Yes it will be correct but it’s simply too slow.
There is no “polite” C code in 1997 that nudges GCC 2.7.x into MMX.
You can write restrict but it will not work.
You either write MMX explicitly, or you don’t get MMX at all.
See Appendix B for comprehensive line-by-line analysis of this assembly output, including verification of GCC 2.7.2.3 authenticity and explanation of why MMX instructions were not generated.
SSE: Intel’s Response to AMD’s 3DNow
While MMX was still proving itself, Intel’s product definition team made a bold proposal: add SIMD floating-point capabilities to the next processor, code-named “Katmai.”
The internal debate: Intel executives were hesitant. MMX hadn’t even shipped yet. Were they betting too heavily on SIMD? Was this just another marketing gimmick?
According to Intel’s own account, the meeting was “inconclusive.” Executives demanded more questions be answered. Two weeks later, they gave the OK for Katmai (later named Pentium III).
Meanwhile, in Sunnyvale, California, AMD was watching. And plotting.
AMD’s 3DNow!, introduced in the K6-2 in May 1998, was a direct response to MMX’s biggest weakness: no floating-point SIMD. AMD added 21 instructions that could handle single-precision floating-point operations in parallel.
Suddenly, Intel’s fancy new multimedia extension couldn’t actually do the floating-point math that 3D graphics required. Oops :p
When Pentium III (Katmai) shipped in February 1999, it introduced SSE (Streaming SIMD Extensions) with 70 new instructions and 8 entirely new 128-bit registers (XMM0-XMM7).
Intel added new registers, costing an extra processor state and requiring OS modifications (looking at you again, Microsoft). Nevertheless Intel implemented the 128-bit floating-point units in a “hack “way. A 4-way SSE instruction gets broken into two 64-bit microinstructions, executed on two separate units.
Intel “sorta” succeeded in adding 128-bit SIMD FP. The implementation was clever, efficient, and space-conscious, but it was a hack that would haunt optimization efforts for years. The word “sorta” appears in technical documentation approximately never, which tells you something about just how much of a hack this was!
It might be worth noting that this persisted for a long time (Pentium M, Core 2). Intel didn’t get true single-cycle 128-bit width until Core 2 (Conroe) for some ops, and fully in later gens. AMD actually beat them to true 128-bit width in hardware execution units with the K8/K10 in some aspects.
Part II: The SSE Wars (2000-2008)
SSE2 (2000): The Pentium 4’s “Sledgehammer”
Intel’s SSE2 wasn’t driven by a new application breakthrough. It was a defensive move against AMD’s 3DNow! and the looming threat of K7 (Athlon).
Intel’s Willamette team in Hillsboro, Oregon was under immense pressure. AMD’s Athlon K6-2 had demonstrated that SIMD instructions mattered for gaming and 3D graphics. Intel internally called Willamette “Sledgehammer”.
The key driver was real-time 3D gaming and DirectX performance. Microsoft had been pushing Intel for better SIMD support since DirectX 7.
SSE2 introduced 144 new instructions including double-precision FP:
movapd xmm0, [rax]
addpd xmm0, xmm1
mulpd xmm0, xmm2
sqrtpd xmm0, xmm3
unpckhpd xmm0, xmm1
cvtpd2ps xmm0, xmm1SSE3 (2004): Prescott’s Reckoning
SSE3’s official driver was “media encoding improvements,” but the real story is far more troubled.
SSE3 was introduced with Prescott, (aka. PNI, Presscot New Instructions), the 90nm Pentium 4 revision that would become Intel’s biggest nightmare. The 13 new instructions were heavily trimmed due to power concerns.
The new instructions could be used to accelerate 3D workflows and video codecs. Like normally, Intel released the hardware first and waited for software to catch up to later. With one exception. Intel C++ 8.0 compiler, which supported SSE3 Instructions.
Although Intel released the SSE3 instructions guidelines for software developers last summer, there are no programs yet,…
… according to Intel, `LDDQU`` instruction could speed up video compression by 10% if used in data encoding algorithms…
Ilya Gavrichenkov (xbitlabs.com, 02.01.2004)
Horizontal operations (operating within a single register lane) were a new concept:
haddpd xmm0, xmm1
hsubpd xmm0, xmm1
movddup xmm0, xmm1
movshdup xmm0, xmm1
Intel executives had acknowledged the growing challenges with clock speed scaling as the industry hit what some called a “power wall.” Prescott’s 31-stage pipeline generated so much heat that Intel had to cut SSE3 instruction complexity to reduce power draw. The thermal challenges were significant enough that power efficiency became a primary concern in processor design.
SSSE3 (2006): The Core 2 Rebirth
SSSE3 (Supplemental Streaming SIMD Extensions 3) wasn’t planned as a separate extension. It was emergency additions to fix Core architecture’s weaknesses.
When Intel abandoned NetBurst for Core (Conroe/Merom), they discovered their new architecture lacked certain acceleration paths. The 16 new instructions in SSSE3 (including PMULHRSW, PABSB/PABSW/PABSD, and PALIGNR) were specifically designed to address common performance bottlenecks.
SSSE3 was introduced with the Intel Xeon processor 5100 series and Intel Core 2 processor family. SSSE3 offer 32 instructions to accelerate processing of SIMD integer data.
pabsb xmm0, xmm1
pabsw xmm0, xmm1
pabsd xmm0, xmm1
phaddw xmm0, xmm1
phaddd xmm0, xmm1
phsubw xmm0, xmm1
phsubd xmm0, xmm1
pmulhrsw xmm0, xmm1, xmm2
palignr xmm0, xmm1, 3
One could think that these are actually not real ALU instructions but actually a result of a cat walking over on an Intel engineers keyboard and pressed random buttons. If you though about that you’ll be right on one of those assumptions.
Those are not purely ALU instructions.
These instructions were added without changing the Core microarchitecture. They were largely microcode/decoder based additions. These instructions did not introduce new arithmetic capabilities or execution units; they collapsed common multi-instruction SIMD idioms into single operations that mapped onto existing ALUs and shuffle units.
SSE4 (2007)
SSE4 was split into two parts, SSE4.1 (video/graphics) and SSE4.2 (database/text). This was deliberate, Intel didn’t want to wait for database features to ship with video acceleration.
The H.264 video encoding explosion drove SSE4.1. By 2006, YouTube was growing and everyday video creation and consumption were consuming massive CPU resources, and Intel needed hardware acceleration.
14 new video-oriented instructions were specifically designed for H.264 encoding:
- MPSADBW - Multi-hypothesis Motion Estimation (4x4 SAD calculations)
- PHMINPOSUW - Horizontal Minimum Position (used in motion vector selection)
- DP - Dot Product (floating-point, for video filtering) …
mpsadbw xmm0, xmm1, 0
phminposuw xmm0, xmm1
dpps xmm0, xmm1, 0xFF
pmaxsb xmm0, xmm1
pminub xmm0, xmm1
pextrb xmm0, xmm1, 5
pinsrd xmm0, xmm1, 2
In theory new instructions significantly accelerated motion estimation workloads.
Penryn showed significant improvements in video encoding over Core 2 at same clock speeds. Intel’s Fall 2007 IDF demo showed x264 encoding performance improvements that were substantial enough to generate significant developer interest in optimizing their code.
SSE4.2 (2008): Nehalem’s Database Revolution
Intel’s focus on data center and enterprise workloads wasn’t born from an acquisition of an existing database team, it was shaped by two strategic XML acquisitions. In August 2005, Intel acquired Sarvega, an XML networking company. In February 2006, they followed up by acquiring Conformative, an XML processing startup.
These acquisitions could have brought expertise in text processing and XML acceleration into Intel’s Software and Solutions Group. The engineering knowledge from Sarvega and Conformative probably influenced the STTNI (String and Text New Instructions) in SSE4.2, first shipping with Nehalem in 2008.
Four instructions were specifically designed for database and string processing:
- CRC32 - Hardware-accelerated checksums (for storage/network)
- POPCNT - Population count (for Bloom filters, compression)
- PCMPESTRI/PCMPISTRI - String comparison (for text search)
crc32 eax, byte [rax]
crc32 eax, ax
crc32 eax, eax
popcnt rax, rbx
pcmpestri xmm0, xmm1, 0x00
pcmpistri xmm0, xmm1, 0x04
The CRC32 instruction alone reduced ZFS/Btrfs checksum overhead significantly, making storage operations notably faster.
The new string processing instructions generated considerable discussion in the developer community. One example was of Austing Zhang of Intel who claimed “After basic testing with iSCSI and confirmed that the iSCSI head digest routines can be speeded up by 4x - 10x.”
Intel initially wanted to call SSE4.2 “SSE5” but AMD had already announced SSE5 (with different 3-operand format). This led to the confusing naming that persists today, because nothing says “clear technical vision” like having two companies use the same numbers for completely different things.
Part III: The Birth of AVX (2008-2011)
March 2008: The Announcement
Intel officially announced AVX (then called “Gesher New Instructions”) in March 2008. The codename “Gesher” means “bridge” in Hebrew, later changed to “Sandy Bridge New Instructions” as the microarchitecture name took precedence.
The announcement came through leaked slides in August 2008, which revealed Intel’s roadmap including 8-core CPUs and the new AVX instruction set. Because nothing says “carefully planned announcement” like your roadmap getting leaked to Engadget two months early.
Why 256 Bits?
From Intel’s official documentation, three key factors drove the 256-bit decision:
-
Floating-Point Performance Doubling: The primary goal was to double floating-point throughput for vectorizable workloads. Sandy Bridge’s execution units were specifically reworked to achieve this.
-
Forward Scalability: As noted in Intel’s AVX introduction documentation: “Intel AVX is designed to support 512 or 1024 bits in the future.” The 256-bit design was explicitly chosen as a stepping stone.
-
Manufacturing Reality: Moving to 256 bits was achievable on Intel’s 32nm process without excessive die area penalties, while 512 bits would have required more significant architectural changes.
This was Intel essentially saying: “256 bits is just the beginning. Wait until you see what we’ve got planned.” Spoiler: what they had planned was a fragmented nightmare that would make Linus Torvalds do what he did best.
The Three-Operand Non-Destructive Instruction Decision
The shift from destructive two-operand instructions (A = A + B) to non-destructive three-operand instructions (C = A + B) addressed a fundamental compiler and programmer pain point:
Previous SSE instructions (destructive):
addps xmm0, xmm1
AVX non-destructive:
vaddps xmm0, xmm1, xmm2
Why this mattered:
-
Reduced Register Spilling: Compilers no longer needed extra instructions to save/restore values before operations. This was like finally getting a larger desk, you could actually spread out your work instead of constantly shuffling papers.
-
Better Code Generation: Three-operand form enables more efficient instruction scheduling. (Which is a significant step up from the Itanium disaster.) The compiler could think ahead instead of constantly working around the destructiveness of existing instructions.
-
Reduced Code Size: Though VEX encoding is more complex, avoiding register copy operations often results in smaller overall code.
AVX removed artificial ISA constraints without abandoning dynamic OoO scheduling.
The “VEX encoding scheme” was introduced specifically to support this three-operand format while maintaining backwards compatibility. Intel basically invented a new instruction format that could still run old code.
vaddps ymm0, ymm1, ymm2
vaddps ymm1, ymm1, ymm2
vsqrtss xmm0, xmm1, xmm2
vaddps ymm0, ymm1, [rax+256] AMD’s Bulldozer Influence
May 2009: AMD announced they would support Intel’s AVX instructions
August 2010: AMD announced Bulldozer microarchitecture details
AMD had developed XOP (eXtended Operations) as their own 128-bit SIMD extension before deciding to support Intel’s AVX instead. This suggests AMD recognized Intel’s direction was gaining industry momentum. Sometimes the best strategy is to stop fighting and join the party.
Intel’s aggressive 256-bit implementation in Sandy Bridge was widely seen as a move to maintain SIMD leadership against AMD’s competing designs. The message was clear: Intel wasn’t going to let AMD dictate the future of x86 SIMD.
Target Workloads
From Intel’s AVX introduction materials:
- High-Performance Computing (HPC): Climate modeling, molecular dynamics, quantum chemistry simulations
- Media and Entertainment: Video encoding/decoding, image processing, 3D rendering
- Scientific Computing: Finite element analysis, computational fluid dynamics, seismic processing
- Signal Processing: Radar systems, communications systems, medical imaging
Intel was explicitly targeting the workloads where GPUs were starting to make inroads. The message was clear: you don’t need a graphics card to do vector math. Just buy more Intel chips. (Spoiler: this didn’t entirely work out as planned.)
Part IV: The Road to AVX-512 (2011-2016)
The FMA Controversy: AMD vs. Intel
This was one of x86’s most bitter instruction set battles, the kind of standards fight that makes engineers reach for the antacid:
AMD’s Bulldozer (2011) introduced FMA4 as a 4-operand instruction:
vfmaddpd ymm0, ymm1, ymm2, ymm3
Intel’s Haswell (2013) implemented FMA3 as a 3-operand instruction:
vfmadd132pd ymm0, ymm1, ymm2
FMA4 and FMA3 are incompatible extensions with different operand counts and encodings.
AMD’s Piledriver (2012) added FMA3 support while still keeping FMA4.
Bulldozer and its successors supported FMA4, while Haswell and later Intel CPUs supported only FMA3. AMD later dropped FMA4 support starting with its Zen families in favor of FMA3, and FMA4 does not appear in current AMD CPUID-reported feature flags.
vfmadd132ps zmm0, zmm1, zmm2
vfmadd213ps zmm0, zmm1, zmm2
vfmadd231ps zmm0, zmm1, zmm2
vfmaddpd ymm0, ymm1, ymm2, ymm3
The market fragmentation meant developers had to use CPU-specific code paths or risk crashes. Intel’s market dominance won, FMA4 died with Bulldozer’s failure. AMD eventually added FMA3 support in later architectures, which is engineering-speak for “we were wrong, Intel won, let’s just copy them.”
On a personal note. I much prefer the FMA4 syntax, because it was non-destructive.
The Technical Core of the Dispute
The conflict wasn’t just about operand ordering; it was about register destruction.
Fused Multiply-Add requires three input values (A×B+CA×B+C). To store the result in a fourth register (DD) requires a 4-operand instruction.
-
AMD’s FMA4 introduced a special extension to VEX allowing 4 distinct operands. It was fully non-destructive.
-
Intel’s FMA3 stuck to the standard VEX limit of 3 operands. To make the math work, the destination register must also serve as the third input.
vfmaddpd ymm0, ymm1, ymm2, ymm3
vfmadd231pd ymm0, ymm1, ymm2
The Xeon Phi “GPCORE” Connection
The Xeon Phi’s core architecture (codenamed “GPCORE”) was a radical departure from Intel’s mainstream cores. Designed by a separate team working on the Larrabee research project, it featured:
- Wide but shallow pipelines optimized for throughput over latency
- 512-bit vector units as the primary execution resource
- No out-of-order execution in early versions (Knights Corner)
Why 512 Bits? The Xeon Phi Imperative
Intel’s drive to 512-bit vectors wasn’t primarily about mainstream CPUs, it was about Xeon Phi and competing with GPUs in HPC. The Knights Landing (KNL) project, announced at ISC 2014, was the first to implement AVX-512, targeting 3+ TFLOPS of double-precision peak theoretical performance per single node.
These customers represented a small percentage of Intel’s revenue but demanded disproportionate engineering investment. Like the “Trinity” Supercomputer at NNSA (National Nuclear Security Administration). $174 million deal awarded to Cray that will feature Haswell and Knights Landing (Note: 13th slide)
This leads me to believe Intel sales teams used their contracts to justify AVX-512 development internally. High-value enterprise customers often get special treatment, even when they represent a tiny fraction of the overall market population wise…
I also miss affordable RAM.
Who Demanded 512-Bit SIMD?
- National Labs (DOE) - Required for TOP500 supercomputer competitiveness against NVIDIA GPUs
- Weather Modeling Agencies (NOAA, ECMWF) - Needed 2x+ vector throughput for atmospheric simulations
- Quantitative Finance - HFT firms paying premium for any FP performance edge
- Oil & Gas - Seismic processing workloads that were GPU-prohibitive due to data transfer costs
These were the customers who would call Intel and say “we’ll give you $50 million if you add this instruction.” And Intel, being a corporation, would say “yes, absolutely, right away, here’s an entire engineering team.”
Part V: The AVX-512 Nightmare (2016-2026)
“To know your Enemy, you must become your Enemy.”
– Sun Tzu, The Art of War.
The Power Virus Reality
Travis Downs’ detailed analysis revealed that AVX-512 on Skylake-X caused massive license-based downclocking.
| License Level | Base Frequency | Notes |
|---|---|---|
| L0 (Non-AVX) | 3.2 GHz | Standard operation |
| L1 (AVX) | 2.8 GHz | 12.5% reduction |
| L2 (AVX-512) | 2.4 GHz | 25% reduction from base |
The thermal/power calculus: 512-bit SIMD units consumed approximately 3x the power of 256-bit units at the same frequency. Intel had to either:
- Downclock when 512-bit instructions executed (their choice)
- Increase TDP significantly (unacceptable for mainstream)
- Disable cores to maintain power budget (theoretical, never implemented)
They chose option 1, which meant your $500 processor would deliberately slow itself down if you even dared to use its most advanced features.
Linus Torvalds’ Famous Rant (July 2020)
Linus Torvalds, the creator of Linux, is not known for holding back. In July 2020, he delivered one of the great tech rants of all time:
“I want my power limits to be reached with regular integer code, not with some AVX512 power virus that takes away top frequency (because people ended up using it for memcpy!) and takes away cores (because those useless garbage units take up space).”
“I hope AVX512 dies a painful death, and that Intel starts fixing real problems instead of trying to create magic instructions to then create benchmarks that they can look good on.”
“I’d much rather see that transistor budget used on other things that are much more relevant. Even if it’s still FP math (in the GPU, rather than AVX512). Or just give me more cores (with good single-thread performance, but without the garbage like AVX512) like AMD did.”
The AVX-512 Fragmentation Problem
AVX-512 became a “family of instruction sets” rather than a single standard:
- Knights Landing (2016): AVX-512-BW, CD, ER, PF (no main CPU features)
- Skylake-X (2017): AVX-512-F, CD, BW, DQ, VL, etc.
- Cannon Lake (2018): Added AVX-512-VNNI (AI/ML instructions)
- Ice Lake (2019): Better frequency scaling, added BF16
- Alder Lake (2022): Disabled entirely due to hybrid architecture conflicts
It got to the point where you couldn’t even tell which AVX-512 features a processor supported without looking at the spec sheet. Intel had essentially created an instruction set that was different on every chip. This is the opposite of standardization.
Why Alder Lake Killed It
Intel’s hybrid architecture had Performance-cores (Golden Cove) and Efficiency-cores (Gracemont). Only P-cores had 512-bit units, E-cores maxed at 256-bit. This caused:
- Scheduling nightmares for the OS thread director.
- Power management conflicts between core types
- Customer confusion over which instructions would work where
Intel’s solution: Fuse it off in silicon to prevent BIOS workarounds, then create AVX10 as a unified replacement. This is what happens when you build a feature so complex that even the company that created it can’t figure out how to make it work across different product lines.
Why AMD Resisted (And How They Finally Won, for now)
AMD’s position (2017-2021):
“We’re not rushing to add features that make Intel’s chips throttle.”
The Zen 4 Breakthrough (2022): When AMD finally added AVX-512 in Ryzen 7000, they did it with a stroke of genius: “double-pumping.” Instead of building massive 512-bit execution units that generated enormous heat, they executed 512-bit instructions using two cycles on their existing 256-bit units.
It was simply logical. Developers got the instruction set support they wanted (VNNI, BFloat16), but the processors didn’t downclock. This approach avoided the “garbage” power penalties that had plagued Intel’s implementation.
Zen 5’s Power Play (2024): With Ryzen 9000, AMD finally moved to a true full-width 512-bit datapath. While this doubled raw throughput, it brought the laws of physics back into play—lighting up a 512-bit wire simply generates more heat than a 256-bit one. While it avoided the catastrophic downclocking of Intel’s Skylake era, it forced AMD to manage power density much more aggressively than with Zen 4.
Raja Koduri’s Defense (August 2020)
“There are people who are doing real work with AVX-512. It’s not just benchmarks. And it’s not going away.”
Intel’s Raja Koduri (who would later return to Intel after adventures at Apple and Samsung) tried to defend AVX-512 against Torvalds’ criticism. The subtext seemed to be: “Linus, you don’t understand. National labs and AI researchers actually use this stuff!”
Linus’ response was not diplomatic, but it was memorable.
The 2022 Resolution: Intel Finally Surrenders
In January 2022, the debate reached its inevitable conclusion. Intel disabled AVX-512 on Alder Lake processors, not through a BIOS option, but by fusing it off in silicon. The official rationale was hybrid architecture conflicts: Performance-cores had 512-bit units while Efficiency-cores maxed at 256-bit, creating scheduling nightmares for the OS thread director.
But the subtext was clear: Linus & common sense had won. The “power virus” that downclocked entire processors lineage, the transistor budget consumed by features most developers never used or didn’t even know the names of, the fragmentation across SKUs, all of it was quietly retired.
As Linus noted in November 2022, neural net inference is one of the few legitimate use cases for AVX-512. For everything else, from video encoding to database operations to general-purpose computing, the costs outweighed the benefits.
And when in an age where literally every personal computer either has an integrated GPU or an external GPU, CPU SIMD seems like a weird transitional phase. It definitely has it’s uses but is applied in contexts where its costs outweigh benefit.
The resolution wasn’t a technical decision. It was a market decision. Intel’s hybrid architecture demanded coherent vector support across all cores. AVX-512 couldn’t provide that. So it is being slowly removed. Just like Itanium, x87 and the x86 we used to know.
The Fragmentation Spiral
- 2013-2016: Intel splits AVX-512 across incompatible implementations
- 2017-2021: Different SKUs have different feature subsets (bifurcation strategy)
- 2022: Alder Lake fuses off AVX-512 entirely
- 2023: Intel announces AVX10 to unify the mess
- 2026: Nova Lake with AVX10.2 targets coherent 512-bit support across all cores (confirmed November 2025)
The irony was that AVX-512 was designed to unify Intel’s vector strategy. Instead, it became the most fragmented instruction set extension in x86-64 history. Requiring multiple replacement specifications to fix the damage. This is the equivalent of creating a problem so complex that you need to create a new solution just to solve the original solution’s problems.
Lessons Learned
-
Backward compatibility drives architecture: The register aliasing decision haunted MMX for years, but it enabled rapid adoption without OS changes.
-
Marketing matters as much as engineering: Intel’s aggressive MMX marketing, despite modest real-world gains, established SIMD as essential for consumer processors.
-
Competition accelerates innovation: AMD’s 3DNow! forced Intel to add FP SIMD capabilities years earlier than planned. The FMA controversy showed how fragmented standards hurt developers.
-
Compromises become permanent: Intel’s “sorta” 128-bit SSE implementation influenced x86 SIMD architecture for a decade.
-
Customer requirements can override engineering sanity: AVX-512 was pushed by a small percentage of customers but created massive fragmentation and power issues for everyone.
-
Fragmentation has costs: AVX-512’s bifurcation across SKUs and eventual disablement in hybrid architectures shows the danger of over-engineering for edge cases.
-
Sometimes the market decides: AMD won the FMA fight not through technical superiority, but through market dominance. The best instruction set is the one everyone actually uses.
The Legacy
The engineers who built x86 SIMD made decisions that shaped computing for decades, often under intense pressure and uncertainty. Their legacy is in every video encode, 3D render, AI inference, and scientific simulation happening on x86 processors today.
The battle continues with AVX10, but the lessons from MMX through AVX-512 remain: architecture decisions made in conference rooms in Haifa, Santa Clara, and Austin echo through decades of computing. The next chapter is being written now, will AVX10 finally unify Intel’s fractured vector strategy, or will history repeat itself?
One thing is certain: somewhere, right now, an engineer is making a decision that will seem brilliant, stupid, or utterly incomprehensible to programmers thirty years from now. That’s the nature of this business. And honestly? That’s what makes it fun.
“I’d much rather see that transistor budget used on other things that are much more relevant.”
– Linus Torvalds, 2020
Appendix A: x86 SIMD Syntax Reference
This appendix is taken from the documents which can be found here with the best of my ability. If you see any problems here, please don’t hesitate to contact me.
A.1 Register Naming Conventions
| Extension | Registers | Width | Naming Scheme |
|---|---|---|---|
| MMX | MM0-MM7 | 64-bit | MM<n> where n = 0-7 |
| SSE | XMM0-XMM15 | 128-bit | XMM<n> where n = 0-15 |
| AVX | YMM0-YMM15 | 256-bit | YMM<n> (upper 128 of ZMM) |
| AVX-512 | ZMM0-ZMM31 | 512-bit | ZMM<n> (full register) |
A.2 Instruction Suffix Encoding
The instruction suffix encodes the data type and operation:
| Suffix | Meaning | Example |
|---|---|---|
S | Signed integer | PMOVSXBD |
U | Unsigned integer | PADDUSB |
B | Byte (8-bit) | PADDB |
W | Word (16-bit) | PADDW |
D | Doubleword (32-bit) | PADDD |
Q | Quadword (64-bit) | PADDQ |
S | Single-precision FP | ADDPS |
D | Double-precision FP | ADDPD |
A.3 Assembly Syntax Variations
Intel Syntax (used throughout this document):
vaddps zmm0 {k1}{z}, zmm1, zmm2
vmovups zmmword ptr [rax], zmm3
AT&T Syntax (GNU assembler):
vaddps %zmm2, %zmm1, %zmm0{%k1}{z}
vmovups %zmm3, 64(%rax)A.4 EVEX/VEX Encoding Fields
Modern AVX-512 uses EVEX encoding with four modifier bytes:
| Field | Bits | Purpose |
|---|---|---|
pp | 2 | Opcode extension (00 = no extension) |
mm | 2 | VEX.mmmmm equivalent |
W | 1 | Vector width (0 = 128/256, 1 = 512) |
vvvv | 4 | Destination register specifier |
aaa | 3 | {k}{z} mask register (000 = no mask) |
B | 1 | Broadcast/Round control |
R | 1 | Register specifier extension |
The complete encoding follows:
$$ \text{EVEX} = 0x62 ;\Vert; \text{RR}{}^\prime\text{B} ;\Vert; \text{vvvv} ;\Vert; \text{aaa} $$
A.5 Intrinsic Type Mappings
| SIMD Type | C/C++ Intrinsic | Width (bits) |
|---|---|---|
__m64 | MMX | 64 |
__m128 | SSE | 128 |
__m128d | SSE (double) | 128 |
__m256 | AVX | 256 |
__m256d | AVX (double) | 256 |
__m512 | AVX-512 | 512 |
__m512d | AVX-512 (double) | 512 |
A.6 Common Operation Mnemonics
| Category | Instructions | Description |
|---|---|---|
| Arithmetic | PADD*, PSUB*, PMUL*, PMADD* | Integer arithmetic |
| FP Arithmetic | ADDPS, MULPS, DIVPS, SQRTPS | Single-precision FP |
| Compare | PCMPEQ*, PCMPGT*, CMPPS | Equality/greater-than |
| Logical | PAND, POR, PXOR, ANDPS, ORPS | Bitwise operations |
| Shuffle | PSHUFLW, SHUFPS, VPERM* | Lane manipulation |
| Load/Store | MOVAPS, MOVUPD, VBROADCAST* | Memory transfers |
| Convert | CVTDQ2PS, CVTPS2DQ, VCVT* | Type conversion |
| Mask | KAND, KOR, KXNOR, KNOT | Mask register ops |
A.7 Mask Register Operations
AVX-512 introduced dedicated mask registers (k0-k7):
vaddps zmm0, zmm1, zmm2
vaddps zmm0 {k1}{z}, zmm3, zmm4
vpaddd zmm5 {k2}, zmm6, zmm7
The mask value at position $i$ is computed as:
$$ \text{mask}[i] = \begin{cases} 1 & \text{if } \text{cond}(\text{src1}[i], \text{src2}[i]) \ 0 & \text{otherwise} \end{cases} $$
A.8 Lane Concepts in SIMD
A “lane” is a sub-vector within a wider register:
ZMM31 (512 bits) = 8 lanes of 64 bits each
|-----|-----|-----|-----|-----|-----|-----|-----|
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
YMM15 (256 bits) = 4 lanes of 64 bits each
|-----------|-----------|-----------|-----------|
| 0 | 1 | 2 | 3 |
XMM0 (128 bits) = 2 lanes of 64 bits each
|-----------------------|-----------------------|
| 0 | 1 |
Lane-crossing operations require special handling and may incur performance penalties.
Appendix B: Assembly Code Analysis
Experimental Context: The assembly output examined in this appendix was generated by compiling the source C code with GCC 2.7.2.3 on a Debian Potato (2.2) system running under qemu-system-i386 virtualization (QEMU emulator version 9.2.4 (qemu-9.2.4-2.fc42)). The host system was an 11th Gen Intel(R) Core(TM) i7-11370H processor. This experimental setup recreates the 1997-era GCC compilation environment while running on modern hardware.
B.1 Source Code and Compiler Context
The following analysis examines the GCC 2.7.2.3 assembly output for the integer vector addition function discussed in Section I. The source C code was:
void add_i32(int* dest, const int* a, const int* b, int n);
void add_i32(int* dest, const int* a, const int* b, int n)
{
int i;
for(i = 0; i < n; ++i) {
dest[i] = a[i] + b[i];
}
}
Compiled with: gcc -O2 -S test.c (GCC 2.7.2.3, 1996-era)
Compiler Version Context: GCC 2.7.2.3 was released August 20, 1997, during the Pentium MMX era. This version predates any MMX intrinsic support in GCC. MMX intrinsics first appeared in GCC 3.1 (2002), and auto-vectorization was not added until GCC 4.0 (2005) .
B.2 Generated Assembly Analysis
The complete assembly output (assets/out.s) consists of 40 lines. The following analysis provides a line-by-line examination with verification against contemporary documentation.
Header Section (Lines 1-7):
| Line | Assembly | Analysis |
|---|---|---|
| 1 | .file "test.c" | Debug info directive, specifies source filename |
| 2 | .version "01.01" | GAS assembler version string |
| 3 | gcc2_compiled.: | VERIFIED: Valid GNU assembly identifier. The trailing dot is part of the symbol name, used by libg++ to identify GCC-compiled objects |
| 4 | .text | Code section directive |
| 5 | .align 4 | 16-byte alignment (2^4 = 16). Correct for Pentium Pro/Pentium II instruction fetch optimization |
| 6 | .globl add_i32 | Exports symbol globally |
| 7 | .type add_i32,@function | ELF symbol type directive for debug info |
Prologue and Parameter Loading (Lines 8-17):
The function follows the System V i386 ABI with cdecl calling convention :
add_i32:
pushl %ebp
movl %esp,%ebp
pushl %edi
pushl %esi
pushl %ebx
movl 8(%ebp),%esi
movl 12(%ebp),%ebx
movl 16(%ebp),%ecx
movl 20(%ebp),%edx
Stack Offset Verification:
- After
pushl %ebp: 4(%ebp) = saved %ebp - 8(%ebp) = first argument (dest)
- 12(%ebp) = second argument (a)
- 16(%ebp) = third argument (b)
- 20(%ebp) = fourth argument (n)
Register Allocation: The compiler saves %edi, %esi, %ebx as callee-saved registers per ABI. This is conservative. Only %edi is truly modified as an accumulator.
Main Loop (Lines 22-28):
.L5:
movl (%ebx,%eax,4),%edi
addl (%ecx,%eax,4),%edi
movl %edi, (%esi,%eax,4)
incl %eax
cmpl %edx,%eax
jl .L5
Addressing Mode Analysis:
- Scale factor of 4 correctly represents sizeof(int)
- Base+index addressing is optimal for array access
- No memory operands in instructions other than loads/stores
Epilogue (Lines 29-35):
.L3:
leal -12(%ebp),%esp
popl %ebx
popl %esi
popl %edi
leave
retB.3 Critical Finding: Why No MMX Instructions?
The claim in Section I—that GCC 2.7.2.3 “failed” to generate MMX code—requires clarification. What does “failed” mean in this context?
GCC 2.7.x had no capability to generate MMX instructions whatsoever . This was not an implementation failure but a fundamental design decision by Intel and the GCC project. Intel released MMX technology in January 1997 with aggressive marketing claims about performance improvements, yet they did not collaborate with the GCC team to ensure compiler support.
GCC MMX Support Timeline:
| Version | Release | MMX Support |
|---|---|---|
| GCC 2.7.x | 1995-1997 | None |
| GCC 3.0 | 2001 | Broken -mmmx flag |
| GCC 3.1 | 2002 | Initial intrinsics |
| GCC 4.0 | 2005 | Auto-vectorization |
(GCC 3.0: -mmx partial backend support; intrinsics incomplete and unstable)
What “Failed” Really Means:
The term “failed” implies Intel expected automatic MMX code generation from existing compilers. However, Intel did not work with the GNU Compiler Collection project to add MMX support. GCC was the primary compiler for Linux, BSD, and many embedded systems in 1997. If Intel wanted their marketing claims about MMX performance to reach everyday developers, they should have:
- Provided MMX intrinsic headers and documentation to the GCC team in 1996-1997
- Collaborated on machine description updates for MMX instruction selection
- Ensured GCC could generate MMX code alongside proprietary compilers like Intel’s ICC
Without this collaboration, the “marketing reality gap” widened. Intel claimed 50-700% improvements for MMX-optimized software, but developers using GCC could not achieve these speedups without writing hand-optimized assembly. The comparison in Section I between GCC output and MMX code is therefore a comparison between what Intel’s hardware could do and what Intel’s failure to work with the dominant open-source compiler allowed developers to achieve.
Evidence from GCC development history: Richard Henderson stated in December 2004: “As mentioned in another thread, we can’t generate proper MMX/3DNOW code ourselves. The existing intrinsics expect users to write them manually” .
The comparison in Section I is therefore comparing compiler output to what would require hand-written assembly. This is not a “failure” of GCC 2.7.2.3 in the sense of a bug or regression. It is a fundamental limitation of 1996-era compiler technology that Intel could have addressed but chose not to.
B.4 Performance Gap Analysis
Using instruction latency data from Agner Fog’s optimization manuals and Intel documentation, we can quantify the performance difference between the generated scalar code and an optimal MMX implementation .
Scalar Implementation (GCC output):
- Instructions per element: 5
- CPI (estimated): 1.1
- Cycles per element: ~5.5
Optimal MMX Implementation:
- Instructions per 2 elements: 4 (2 times MOVQ + 2 times PADDD)
- Instructions per element: 2.5
- CPI (estimated): 1.0
- Cycles per element: ~2.5
Performance Comparison (Pentium MMX, 233 MHz):
| Metric | Scalar | MMX | Improvement |
|---|---|---|---|
| Cycles/element | 5.5 | 2.5 | 2.2x |
| Elements/sec (10M array) | 38.8M | 93.2M | 2.4x |
| Memory ops per 8 elements | 24 | 12 | 2.0x |
| Branch ops per 8 elements | 8 | 4 | 2.0x |
EMMS Overhead: The EMMS instruction (required after MMX code) costs 2-4 cycles. For loops processing N elements (4 per iteration), overhead is 4/N cycles per element, negligible (0.4%) for N=1000.
B.5 The Productivity Gap
The 2-4x performance gap between hardware capability and compiler output in 1997 represents what we call the productivity gap, the difference between what SIMD hardware could do and what compilers could exploit .
Industry Response:
- Intel released MMX Technology Programmer’s Reference Manual (245794, 1997) encouraging manual intrinsics
- Developers wrote assembly code directly
- GCC eventually added intrinsics (GCC 3.1, 2002) and auto-vectorization (GCC 4.0, 2005)
- The fundamental challenge persists: modern compilers still miss 30-50% of vectorization opportunities
18 Feb 2026
I would like to thank to bal-e and hailey from the lobste.rs forum for noticing the AI hallucinations used to proofread this article. I am sincerely sorry for ever letting this happen. I shouldn’t have used them in the first place and I promise I will never use any AI tools to proofread nor help assist my articles from this blog hereon.
I would also like to thank hoistbypetard for inviting me to lobste.rs
I would also like to thank Peter Kankowski for pointing out the flaw in one of the examples at SSE3.
Kamikura, Masaru. “Intel 45nm Processor Demo” https://www.youtube.com/watch?v=TGCt4NyJWTY.
See also: “Intel Buys Into XML Processing With Conformative.” EE Times, February 8, 2006. https://www.eetimes.com/intel-buys-into-xml-processing-with-conformative/. Accessed January 15, 2026.
See also: Mysticial. “Zen4’s AVX512 Teardown.” MersenneForum, September 26, 2022. https://www.mersenneforum.org/node/21615. Accessed January 15, 2026.
See also: “Intel Officially Confirms AVX10.2 and APX Support in Nova Lake.” TechPowerUp, November 13, 2025. https://www.techpowerup.com/342881/intel-officially-confirms-avx10-2-and-apx-support-in-nova-lake/. Accessed January 16, 2026.
Additional Technical References
June 20, 2017. https://www.intel.com/content/www/us/en/developer/articles/technical/intel-avx-512-instructions.html. Accessed January 16, 2026.
https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html. Accessed January 16, 2026.