AWS Trainium3 Deep Dive – A Potential Challenger Approaching

原始链接: https://newsletter.semianalysis.com/p/aws-trainium3-deep-dive-a-potential

相关文章

原文

Hot on the heels of our 10K word deep dive on TPUs, Amazon launched Trainium3 (Trn3) general availability and announced Trainium4 (Trn4) at its annual AWS re:Invent. Amazon has had the longest and broadest history of custom silicon in the datacenter. While they were behind in AI for quite some time, they are rapidly progressing to be competitive. Last year we detailed Amazon’s ramp of its Trainium2 (Trn2) accelerators aimed at internal Bedrock workloads and Anthropic’s training/inference needs.

Since then, through our datacenter model and accelerator model, we detailed the huge ramp that led to our blockbuster call that AWS would accelerate on revenue.

Today, we are publishing our next technical bible on the step-function improvement that is the Trainium3 chip, microarchitecture, system and rack architecture, scale up, profilers, software platform, and datacenters ramps. This is the most detailed piece we've written on an accelerator and its hardware/software, on desktop there is a table of contents that makes it possible to review specific sections.

With Trainium3, AWS remains laser-focused on optimizing performance per total cost of ownership (perf per TCO). Their hardware North Star is simple: deliver the fastest time to market at the lowest TCO. Rather than committing to any single architectural design, AWS maximizes operational flexibility. This extends from their work with multiple partners on the custom silicon side to the management of their own supply chain to multi-sourcing multiple component vendors.

On the systems and networking front, AWS is following an “Amazon Basics” approach that optimizes for perf per TCO. Design choices such as whether to use a 12.8T, 25.6T or a 51.2T bandwidth scale-out switch or to select liquid vs air cooling are merely a means to an end to provide the best TCO for the given client and the given datacenter.

For the scale-up network, while Trn2 only supports a 4x4x4 3D Torus mesh scaleup topology, Trainium3 adds a unique switched fabric that is somewhat similar to the GB200 NVL36x2 topology with a few key differences. This switched fabric was added because a switched scaleup topology has better absolute performance and perf per TCO for frontier Mixture-of-Experts (MoE) model architectures.

Even for the switches used in this scale-up architecture, AWS has decided to not decide: they will go with three different scale-up switch solutions over the lifecycle of Trainium3, starting with a 160 lane, 20 port PCIe switch for fast time to market due to the limited availability today of high lane & port count PCIe switches, later switching to 320 Lane PCIe switches and ultimately a larger UALink to pivot towards best performance.

On the software front, AWS’s North Star expands and opens their software stack to target the masses, moving beyond just optimizing perf per TCO for internal Bedrock workloads (ie DeepSeek/Qwen/etc which run a private fork of vLLM v1) and for Anthropic’s training and inference workloads (which runs a custom inference engine and all custom NKI kernels).

In fact, they are conducting a massive, multi-phase shift in software strategy. Phase 1 is releasing and open sourcing a new native PyTorch backend. They will also be open sourcing the compiler for their kernel language called “NKI” (Neuron Kernal Interface) and their kernel and communication libraries matmul and ML ops (analogous to NCCL, cuBLAS, cuDNN, Aten Ops). Phase 2 consists of open sourcing their XLA graph compiler and JAX software stack.

By open sourcing most of their software stack, AWS will help broaden adoption and kick-start an open developer ecosystem. We believe the CUDA Moat isn’t constructed by the Nvidia engineers that built the castle, but by the millions of external developers that dig the moat around that castle by contributing to the CUDA ecosystem. AWS has internalized this and is pursuing the exact same strategy.

Trainium3 will only have Day 0 support for Logical NeuronCore (LNC) = 1 or LNC = 2. LNC = 1 or LNC = 2 is what ultra-advanced, elite L337 kernel engineers at Amazon/Anthropic want, but LNC=8 is what the wider ML research scientist community prefers before widely adopting Trainium. Unfortunately, AWS does not plan on supporting LNC=8 until mid-2026. We will expand much more on what LNC is and why the different modes are critical for research scientist adoption further down.

Trainium3’s go-to-market opens yet another front Jensen must now contend with in addition to the other two battle theatres facing the extremely strong perf per TCO Google’s TPUv7 as well as a resurgent AMD’s MI450X UALoE72 with potentially strong perf per TCO (especially after the “equity rebate” OpenAI gets to own up to 10% of AMD shares).

We still believe Nvidia will stay King of the Jungle as long as they continue to keep accelerating their pace of development and move at the speed of light. Jensen needs to ACCELERATE even faster than he has over the past 4 months. In the same way that Intel stayed complacent in the CPU while others like AMD and ARM raced ahead, if Nvidia stays complacent they will lose their pole position even more rapidly.

Today, we will discuss the two Trainium3 rack SKUs that support switched scale-up racks:

We will start by briefly reviewing the Trn2 architecture and explaining the changes introduced with Trainium3. The first half of the article will focus on the various Trainium3 rack SKUs’ specifications, silicon design, rack architecture, bill of materials (BoM) and power budget before we turn to the scale-up and scale-out network architecture. In the second half of this article, we will focus on discussing the Trainium3 Microarchitecture and expand further on Amazon’s software strategy. We will conclude with a discussion on Amazon and Anthropic’s AI Datacenters before tying everything together with a Total Cost of Ownership (TCO) and Perf per TCO analysis.

In total, there are four different server SKUs between Trainium2 and Trainium3 that the supply chain often refers to by their codenames, which are different than AWS’s branding.

Readers probably will find it confusing untangling the various generation and rack form factor combinations, switching back and forth between AWS’s branding and the codenames that the ODMs/Supply Chain use. Our plea to AWS: whoever is in charge of product marketing and naming needs to stop with these confusing names. It would be ideal if they could follow along with Nvidia and AMD with a nomenclature whereby the second half of the product name denotes the scaleup technology and the world size i.e. the NVL72 in GB200 NVL72 referring to NVLink with a 72 GPU world size.

In the table below we aim to unconfuse our readers with a Rosetta stone for the various naming conventions different groups have been using:

Source: SemiAnalysis, AWS

Trainium3 delivers several notable generation on generation upgrades when it comes to specifications.

OCP MXFP8 FLOPs throughput is doubled, and OCP MXFP4 support is added but at the same performance as OCP MXFP8. Performance for higher precision number formats like FP16 and FP32 interestingly remain the same as for Trn2. In the microarchitecture section, we will describe the implications of these tradeoffs.

The HBM3E is upgraded to 12-high for Trainium3 which brings memory capacity to 144GB per chip. Despite maintaining 4 stacks of HBM3E, AWS has achieved a 70% increase in memory bandwidth by going from below average pin speeds of 5.7Gbps for Trn2 to 9.6Gbps for Trn3, which is the highest HBM3E pin speeds we’ve seen yet. In fact, the 5.7Gbps pin speed that was used in Trn2 is more in line with HBM3 speeds, but it is still classified as HBM3E because it uses 24Gb dies to provide 24GB per stack in an 8-layer stack. The speed deficiency was due to using memory supplied by Samsung, whose HBM3E is notably sub-par compared to that of Hynix or Micron. For the HBM used in Trainium3, AWS is switching to Hynix and Micron to achieve much faster speeds. For per vendor share of HBM by accelerator, use our accelerator model.

The scale-up bandwidth per Trainium3 chip is doubled vs Trn2 by moving to PCIe Gen 6 which offers 64Gbps per lane (uni-directional) vs the 32Gbps per lane offered by PCIe Gen 5. Trainium3 uses 144 active lanes of PCIe for scale up, which on Gen6 means each Trainium3 supports 1.2 TB/s (uni-directional) of scale-up bandwidth per chip.

Scale-out bandwidth support is doubled to a maximum of 400 Gb/s, but most Trainium3 racks produced will stick with the 200Gb/s per XPU scale-out speed that was used for Trn2.

For Trainium4, Amazon will use 8 stacks of HBM4, achieving 4x the memory bandwidth and 2x the capacity compared to Trainium3.

Zooming out to the rack solution level, AWS announced at re:Invent the Trainium3 (Gen1) UltraServer and the Trainium3 (Gen2) UltraServer, which correspond to the Trainium3 NL32x2 Switched and Trainium3 NL72x2 Switched names respectively. The key difference between the Trainium3 NL32x2 Switched and the Trainium3 NL72x2 Switched is in the scale-up networking topology and rack architecture – this section will cover how topology and architecture differs between the two SKUs and discuss the AI workloads that each architecture is the most suitable and optimized for.

Let’s start by going through the physical layout of each server type. The table below displays key specifications for each rack-scale SKU:

While Trainium2 will only be available in the first two rack SKU types – namely the Trn2 NL16 2D Torus and Trn2 NL32x2 2D Torus servers, Trainium3 will be offered in all four rack SKU types, with the majority of Trainium3 delivered in the Trainium3 NL32x2 Switched SKU in 2026. We expect that most of the Trainium3 to be deployed in the Trainium3 NL32x2 Switched and Trainium3 NL72x2 Switched SKUs during its lifecycle.

Trainium3’s compute moves to the N3P node from the N5 node that is used for Trn2. Trainium3 will be one of the first adopters of N3P along with Vera Rubin and the MI450X’s Active Interposer Die (AID). There have been some issues associated with N3P leakage that need to be fixed, which can push timelines out. We have detailed this and its impact in the accelerator model.

We see TSMC’s N3P as the “HPC dial” on the 3nm platform, a small but meaningful step beyond N3E for extra frequency or lower power without new design rules. Public data suggests N3P keeps N3E’s rules and IP but adds about 5% higher speed at iso-leakage, or 5-10% lower power at iso-frequency, plus roughly 4% more effective density on mixed logic/SRAM/analog designs. This is exactly the type of incremental, low-friction gain hyperscalers want for giant AI ASICs.

Trainium3 is a good example of the kind of product that makes sense to build on this node. Trainium3 is emblematic of why custom accelerators will soak up 3 nm HPC capacity: dense matrix engines, fat SRAM slices, and very long on-die interconnects that benefit from every small reduction in device delay and leakage.

Under the hood, N3P is less a single breakthrough than many stacked Design-Technology Co-Optimization (DTCO tweaks). N3 generation FinFlex libraries let designers mix wider and narrower fins inside a block, trading off drive strength versus area and leakage at fine granularity. TSMC has also refined liners and barrier processes in the lower metal stack of N3P that reduce line and via resistance compared with earlier 3nm incarnations. Together, these changes claw back enough margin to support higher clocks or lower Vmin on long global paths.

The challenge is that N3P does this while pushing interconnect scaling and patterning almost as far as current EUV tools allow. Minimum metal pitches in the low 20s of nanometers, high aspect ratio vias, and tighter optical shrinks all amplify BEOL variability and RC. Issues like via profile control, under-etch, and dielectric damage become first order timing problems. For TSMC, that means more fragile process windows, more sophisticated in-line monitoring, and heavier use of DTCO feedback loops to keep design rules aligned with what the line can print at scale. We currently see a slower than expected improvement in defect density of N3P, which is causing chip designers to either re-spin for yield or to wait in the queue for process improvements.

Readers who decipher die markings will see that the package shot above is just Trn2, and this is what we’ve used, as its package layout is exactly the same as Trainium3. The package is composed of two CoWoS-R assemblies, rather than one large interposer. The two compute dies interface with each other through the substrate.

Trainium3 will continue to utilize TSMC’s CoWoS-R, the platform that pushes power and latency limits while staying cost competitive. Instead of a full silicon interposer, Trainium3 follows its predecessor Trainium2 to use an organic thin-film interposer with six layers of copper RDLs on polymer, spanning a reticle-scale footprint at much lower cost and better mechanical compliance than silicon interposers. It still supports fine wiring and micro-bump pitches of a few tens of micrometers between dies and the interposer, which is critical for dense chiplet fabrics and HBM interfaces. Underneath sits a high layer count ABF substrate of twenty build-up layers that fans out power and XSR signals to 130 to 150 micrometer C4 bumps at the module boundary, where the MCM connects to the board.

Multiple RDL layers above six layers on CoWoS-R are a deliberate compromise rather than a hard limit. Purely organic interposers are cheap and compliant but they eventually run out of beachfront when we try to integrate more lanes at 32 gigabits per second or more. IPDs (integrated passive devices) close that gap by dropping small silicon passive components into the organic landscape only where necessary. These thousands of IPDs in each RDL interposer enable sub-micron wiring density, very fine micro-bump pitches, and strong decoupling under the noisiest parts of the chip such as HBM PHY rings and core fabrics.

The chip’s front end is designed by Annapurna with the PCIe SerDes licensed from Synopsys. Alchip does the back end physical design and package design. We believe there may be some interface IP inherited from Marvell-designed Trainium2 in Trainium3, but it’s not meaningful in terms of content. Marvell also has package design at other 3rd party vendors.

Interestingly, there are two tapeouts with one mask set owned by Alchip (called “Anita”), and another set owned by Annapurna directly (“Mariana”). With the Anita variant, Alchip procures chip components from TSMC directly, and Annapurna procures chip components directly for Mariana. The majority of volumes will be for Mariana. While Alchip had similar levels of design involvement for Mariana, the revenue they will see from Mariana should be lower than for Anita. Amazon and Annapurna are heavily cost focused and drive their suppliers hard. Compared to Broadcom’s ASIC deals, the Trainium projects have much less profit pool available to the chip design partners Alchip and Marvell. When it comes to performance per TCO – Annapurna places greater emphasis on driving down the TCO denominator.

Marvell ends up being the big loser from this. While they designed Trainium2, they lost the design bakeoff with Alchip for this generation. Marvell’s Trainium3 version was a chiplet based design, with the I/O being put onto a separate chiplet, instead of on a monolithic die with the compute as is the case with Trainium2 and what will be Trainium3.

Marvell lost this socket due to poor execution for Trainium2. The development timeline took far too long. Marvell had problems designing the RDL interposer for the package as well, and Alchip had to step in to help deliver something workable.

For Trainium4, multiple design houses will be involved across two different tracks based on different scale-up protocols. We first detailed the UALink / NVLink split for Trainium 4 in the accelerator model 7 months ago in May. Alchip just as in Trainium3 leads the backend design for both tracks.

There will likely be a significant timing gap between Nvidia’s VR NVL144, and NVLink Fusion products like Trainium4. For the NVLink fusion track, timelines may slip even further, as the fusion chiplet introduces additional integration and validation requirements & most of the NVIDA mixed signal engineers will be focusing their attention on Nvidia VR NVL144 new product introduction.

While Trainium4 with NVLink fusion may not arrive soon, we believe AWS secured favorable commercial terms and is unlikely to be paying Nvidia’s typical ~75 % gross margins. Nvidia has strong strategic incentives to enable interoperability with Trainium4, since allowing AWS to use NVLink helps preserve Nvidia’s system-level lock-in. As a result, Nvidia is likely to offer more attractive pricing than it would under its standard gross margin structure.

Unlike VR NVL144, which is limited to a fixed 72-package NVLink domain, Trainium4 can extend NVLink scale-up via cross rack AECs, enabling much larger 144+ coherent domains. NVLink 6 uses 400G bidirectional SerDes, allowing 200G RX and 200G TX simultaneously on the same wire at the same time. This 400G BiDi signaling already pushes copper to its practical limits, and although some vendors may attempt a half-generation step to 600G BiDi.

Trainium2 NL16 2D Torus and Trainium2 NL64 3D Torus SKUs are branded as the Trainium2 Server and the Trainium2 UltraServer respectively and were announced at re:Invent 2024. We covered both architectures in our Trainium2 deep dive:

To briefly revisit the Trainium2 SKUs - the key difference between the Trainium2 NL16 2D Torus and Trainium2 NL64 3D Torus SKUs is the scale-up world size. While Trainium2 NL16 2D Torus occupies half a server rack for the entire scale-up world size, with that world size containing 16 Trainium2s in a 4x4 mesh 2D torus, Trainium2 NL32x2 3D Torus connects four Trainium2 NL16 2D Torus half racks together – taking up two racks in total. The four Trainium2 NL16 2D Torus half rack servers are connected using AECs to create a scale up world size of 64 Trainium2s in an 4x4x4 3D Torus.

联系我们 contact @ memedata.com