80386乘法和除法

80386乘法和除法
80386 Multiplication and Division

原始链接: https://nand2mario.github.io/posts/2026/80386_multiplication_and_division/

## 英特尔80386：一场个人电脑革命英特尔80386于1985年发布，是一款关键的处理器，标志着IBM PC向32位计算的过渡，并为当今主导地位的x86架构奠定了基础。它解决了16位8086/286处理器的局限性，将内存访问扩展到4GB，并引入了虚拟内存和保护模式等功能。重要的是，它保持了与现有DOS软件的向后兼容性。除了架构之外，386还显著提高了算术性能。它实现了用于乘法和除法的专用硬件，达到了每周期一位的速度——比其前身有了实质性的飞跃。乘法使用了“加法和移位”算法，利用内部寄存器和微代码来实现8、16和32位操作的灵活性。除法采用非恢复算法，有符号除法需要额外的步骤进行符号校正。虽然是迭代式的，但这种设计在当时是最先进的。现代CPU现在使用专用的乘法器阵列来进行更快地计算，但386的方法仍然证明了高效的硬件利用率，以及FPGA等实现的实用模型。它的影响是不可否认的，为早期版本的Windows、Linux以及无数其他塑造现代计算的应用提供了动力。

黑客新闻新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录 80386乘法和除法 (nand2mario.github.io) 12点由 nand2mario 2小时前 | 隐藏 | 过去 | 收藏 | 讨论指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系搜索：

When Intel released the 80386 in October 1985, it marked a watershed moment for personal computing. The 386 was the first 32-bit x86 processor, increasing the register width from 16 to 32 bits and vastly expanding the address space compared to its predecessors. This wasn't just an incremental upgrade—it was the foundation that would carry the PC architecture for decades to come.

The timing was significant. By the mid-1980s, the IBM PC had established x86 as the dominant PC architecture, but the 16-bit 8086/286 processors were hitting their limits. Memory was constrained to 1MB (or 16MB with the 286's limited protected mode). Competing 32-bit architectures like the Motorola 68020 threatened Intel's dominance. The 386 was Intel's answer: full 32-bit computing with backward compatibility for the massive library of existing DOS software.

The 386 introduced important and long-lasting x86 features: a flat 4GB address space, virtual memory with paging, and a protected mode that actually worked. It would go on to run Windows 3.0, Windows 95, early Linux, and countless other operating systems that shaped modern computing.

Faster arithmetic

In addition to its architectural advances, the 386 delivered a major jump in arithmetic performance. On the earlier 8086, multiplication and division were slow — 16-bit multiplication typically required 120–130 cycles, with division taking even longer at over 150 cycles. The 286 significantly improved on this by introducing faster microcode routines and modest hardware enhancements.

The 386 pushed performance further with dedicated hardware that processes multiplication and division at the rate of one bit per cycle, combined with a native 32-bit datapath width. The microcode still orchestrates the operation, but the heavy lifting happens in specialized datapath logic that advances every cycle.

Here are the actual cycle counts from the Intel 386 Programmer's Reference Manual:

Instruction	8-bit	16-bit	32-bit
MUL	9-14	9-22	9-38
IMUL	9-14	9-22	9-38
DIV	14	22	38
IDIV	19	27	43

The ranges for MUL/IMUL reflect an "early-out" optimization—the loop exits early when the remaining multiplier bits are all zeros (or all ones for signed). Division has no early-out, so cycle counts are fixed at roughly width + overhead.

To save silicon, the 386 reuses the main ALU for the per-iteration add/subtract work rather than having a separate multiplier unit. The microcode controls the iteration, while dedicated datapath logic handles the shifting and loop termination. Let's look at how these algorithms work.

Add-and-shift multiplication

The classic multiplication algorithm in processors is the Booth algorithm. However, the 80386 does not use that. Instead, an "add-and-shift" multiplication algorithm is used. This is similar to grade-school long multiplication. The difference is that instead of moving from lower digits to higher, we shift to the right. Here's the data layout:

Multiplication data layout

Three key internal registers participate in multiplication: MULTMP, TMPB, and SIGMA. A notable challenge is that x86 instructions support 8-bit, 16-bit, and 32-bit operands. Consistent with the design philosophy of the 8086, the 386 achieves this flexibility by reusing the same registers and microcode routines for all operand sizes. In most cases, the identical hardware and microcode sequence accommodate different widths seamlessly. The diagram above shows how, for example, the result of multiplying two 16-bit numbers is arranged within a 32-bit product: it occupies the lower half of the SIGMA register and the upper half of TMPB.

Here is the multiplication algorithm in pseudocode:

1: COUNTR = width-1
2: while (true):
3:   if (TMPB[0]) SIGMA <= SIGMA + MULTMP
4:   {SIGMA, TMPB} >>= 1      // arithmetic shift for signed
5:   if (--COUNTR==0) break
6:   if (remaining TMPB bits are all 0 or all 1 for signed) break
7: {SIGMA, TMPB} >>= COUNTR   // compensate for early exit
8: correction for signed multiplication

Shifting to the right rather than the left simplifies the hardware circuits. Line 6 implements the important "early-out" optimization, which means the loop can terminate early if the remaining multiplier bits are all zeros—or all ones, in the case of signed multiplication. When this happens, line 7 adjusts for the early exit by shifting the accumulated result right by the number of remaining COUNTR bits.

Lines 1–7 fully describe unsigned multiplication. To extend this to signed multiplication, only a few tweaks are needed: use arithmetic (not logical) shifts on lines 4 and 7, and, as a final correction in line 8, subtract the multiplicand from the upper product register (SIGMA) if the multiplier was negative. For a deeper dive into the mathematics, see college-level computer organization resources such as this one.

The 80386 multiplication microcode closely mirrors the algorithm described above, and shows both the timing and the likely underlying hardware involved. The microcode routine shown here handles register-based multiplication—both unsigned and signed—and supports all three operand sizes: 8, 16, and 32 bits. Other forms, such as multiplying with a memory operand, are implemented similarly.

Before we examine the code, it’s helpful to quickly review the 80386’s microcode syntax and conventions. While the 8086 used 21-bit micro-operations, the 80386 expanded these to 37 bits, adding fields to control more complex hardware functionality. Moves are written as src->dest, which simply means copying data from one register to another. The alujmp field directs either the ALU (using src and alu_src as inputs) or the microcode control flow, handling everything from arithmetic to jumps to indirect operations (alu_src as the jump target). Pay special attention to the RPT keyword found on the third line of the upcoming listing: this signals the microcode sequencer to repeatedly execute a micro-instruction, decrementing the COUNTR register each time, and continuing until COUNTR reaches zero, i.e. looping for COUNTR+1 iterations.

; MUL/IMUL r
; src     dest    alu_src        alujmp  uop sub busop
DSTREG -> MULTMP  BITS_V         LDCNTR          ; MULTMP=r (multiplicand), COUNTR=width-1
eAX_AL -> TMPB    0              PASS2           ; TMPB=multiplier (AL/AX/EAX)
SIGMA             TMPB           IMUL3   RPT DLY ; hardware mult loop with early-out 
SIGMA                            PASS            ; pass through SIGMA
COUNTR -> TMPD                                   ; save remaining COUNTR
RESULT -> TMPC    TMPD           LDBSR8          ; load shift count: right shift, COUNTR
SIGMA  -> TMPD    TMPC           SHIFT           ; shift {SIGMA,RESULT} to get low result
SIGMA  -> eAX_AL  TMPD           MULFIX          ; write low result, set flags, signed mult correction
SIGMA             TMPD           SHIFT   RNI     ; shift {0,ProdU} to get high result
SIGMA  -> eDX_AH                                 ; write high result

The RESULT register is used by both multiplication and division. For multiplication, it accumulates the lower half of the product as bits shift right out of TMPB during the loop. MULFIX is the correction for signed multiplication on pseudocode line 8.

Other variants

The 386 introduced two new forms of IMUL beyond the original single-operand form:

Two-operand: IMUL reg, r/m - multiplies reg by r/m, stores in reg (single-width result)
Three-operand: IMUL reg, r/m, imm - multiplies r/m by immediate, stores in reg (single-width result)

These variants are interesting because they only produce a single-width result (discarding the upper half), making them faster for common cases where overflow isn't expected. The microcode for these uses a slightly different entry point that skips writing the upper result to EDX/DX/AH.

Division

80386 uses the standard non-restoring division algorithm for division. Here's the data layout:

Division data layout

The dividend is {SIGMA, DIVTMP} (max 64 bits), while the divisor is TMPB (max 32 bits). Each iteration shifts the dividend left by one bit and either adds or subtracts the divisor, building up the quotient in RESULT one bit at a time.

1: do:                               // loop body is DIV7
2:     {SIGMA,DIVTMP} <<= 1;
3:     if (SIGMA < 0) SIGMA += TMPB;
4:     else           SIGMA -= TMPB;
5:     RESULT = (RESULT << 1) | (SIGMA >= 0 ? 1 : 0)
6:     COUNTR--;
7: while (COUNTR > 0)
8: if (SIGMA < 0) SIGMA += TMPB;     // DIV5

Let's look at the division routine (DIV r at F6.6/F7.6) directly.

; DIV r
eAX_AL -> DIVTMP  BITS_V         LDCNTR          ; DIVTMP = lower half of dividend, COUNTR=width-1
eDX_AH                           PASS            ; SIGMA = upper half of dividend
DSTREG -> TMPB                                   ; TMPB = divisor
SIGMA             TMPB            DIV7   RPT DLY ; Loop: dividend={SIGMA,DIVTMP}, divisor=TMPB
SIGMA             TMPB            DIV5           ; Final correction
SIGMA                            PASS            ; Preserve remainder through ALU
RESULT -> eAX_AL                         RNI     ; accumulator = quotient 
SIGMA  -> eDX_AH                                 ; upper-half reg = remainder

DIV7 and DIV5 are both single-cycle micro-operations. DIV7 implements the core of the division loop, corresponding to pseudocode lines 2–5 (excluding the COUNTR decrement). With each iteration, DIV7 updates SIGMA (the remainder) and RESULT (the quotient accumulator). The loop is controlled by the RPT instruction, which keeps the sequencer repeatedly executing DIV7 for COUNTR+1 iterations—there’s no early exit for division. After completing the main loop, DIV5 performs the final correction required by the non-restoring division algorithm (pseudocode line 8).

Signed division (IDIV)

IDIV is more complex than DIV because it must handle signs. The approach is:

Convert dividend and divisor to absolute values
Perform unsigned division
Adjust signs of quotient and remainder

Here's the IDIV microcode:

; IDIV r
-1                BITS_V         ADD             ; COUNTR=width-2
SIGMA  -> COUNTR
eDX_AH                           PASS            ; SIGMA=upper dividend
eAX_AL -> DIVTMP                                 ; DIVTMP=lower dividend
DSTREG -> TMPB                                   ; TMPB=divisor
SIGMA             TMPB           PREDIV          ; |dividend|divisor|, save signs, first iteration
SIGMA             TMPB            DIV7   RPT DLY ; main division loop
SIGMA             TMPB            DIV5           ; non-restoring correction
SIGMA             TMPB           IDIV1           ; correct remainder sign
SIGMA                            PASS
SIGMA  -> TMPB                                   ; save remainder
RESULT                           IDIV2           ; correct quotient sign -> SIGMA
TMPB   -> eDX_AH                         RNI     ; write remainder
SIGMA  -> eAX_AL                                 ; write quotient

The key micro-ops are:

PREDIV: Computes absolute values of dividend and divisor, saves their signs in internal flip-flops, and performs the first division iteration
IDIV1: Corrects the remainder's sign (remainder has same sign as dividend)
IDIV2: Corrects the quotient's sign (negative if operand signs differ)

This explains why IDIV takes 5 more cycles than DIV - the extra cycles handle sign computation and correction.

Additional notes

One of the biggest hurdles in deciphering CPUs is interpreting the role and meaning of each micro-operation and constant. Their interdependence often makes the process both challenging and fascinating. Consider BITS_V: at first glance, seeing it used in LDCNTR and loop logic, you might assume it simply represents the instruction’s bit width—such as 8, 16, or 32—meaning the RPT instruction would run for that number of COUNTR cycles. This approach seems to suffice for MUL and DIV. However, when applied to IDIV and AAM, the microcode repeatedly failed to function as expected. After many hours spent troubleshooting, I finally came across a clue in a seemingly unrelated part of the microcode:

; PUSHAd
ESP               BITS_V         SUB         DLY      0    
SIGMA     INDSTK  -1             ADD             IN=+      
...
SIGMA  -> eSP                                DLY

This finally gave me the hint that BITS_V is width-1 instead of width. Here PUSHA pushes 8 registers to the stack, so SP should be subtracted by 8*2=16 or 8*4=32 bytes. The existence of SIGMA-1 (SIGMA, -1, ADD) after SIGMA=ESP-BITS_V (ESP, BITS_V, SUB) clearly indicates that BITS_V is one less than 16 or 32.

Comparison with modern CPUs

The 386's iterative approach to multiplication and division was state-of-the-art for its time, but modern x86 processors have moved far beyond it:

Era	Processor	32-bit MUL	32-bit DIV
1985	80386	9-38 cycles	38 cycles
1993	Pentium	10 cycles	41 cycles
2000s	Core 2	3-4 cycles	17-41 cycles
2020s	Zen 3/Alder Lake	3-4 cycles	13-19 cycles

Modern CPUs use dedicated multiplier arrays (often Booth-encoded Wallace trees) that can multiply 64-bit numbers in just a few cycles. Division remains slower because it's inherently sequential - each quotient bit depends on the previous remainder. However, modern CPUs use radix-4 or radix-16 division (computing 2-4 bits per cycle) and sophisticated prediction to speed things up.

The 386's "one bit per cycle" approach is elegant in its simplicity and its reuse of the main ALU. For an FPGA implementation, this microcode-driven design is actually quite practical - it minimizes hardware while still achieving reasonable performance.

Follows me on X (@nand2mario) for updates, or use RSS.

Credits: This analysis of the 80386 draws on the microcode disassembly and silicon reverse engineering work of reenigne, gloriouscow, smartest blob, and Ken Shirriff. For a detailed examination of the silicon itself, see Ken Shirriff’s silicon reverse engineering series on the 80386.