每天学点旧知识，第十八部分：浮点单元检测是如何工作的？

原文

This post ended up being much longer than originally intended because halfway into writing it, I found that 286 and later CPUs don’t behave the way I had assumed they would…

While investigating a bug related to a program using floating-point math on a 386SX system with no FPU, I started pondering how exactly FPU detection works on 286 and newer CPUs. Although math co-processors became standard some 30 years ago, on old PCs they were an uncommon and expensive add-on, and a 66 MHz 486SX2 would still have a usable yet FPU-less processor in the mid-1990s.

The CPU/FPU interface and FPU detection on the 8086/8088 was discussed before. To recap, the 8086/8087 interface is a little odd because it is in fact a generic co-processor interface. The 8086 was launched in 1978; probably sometime in 1979, the Intel 8089 I/O Coprocessor arrived; the 8087 only appeared in 1980.

The ESC instruction (opcode range D8h-DFh) was used for communication with a co-processor on the 8086. While the CPU didn’t exactly execute the instruction, it had to know how to decode it. The ESC instruction used a standard ModR/M byte to indicate an optional memory operand, which the CPU needed to be able to write to or read from the co-processor.

If there is no co-processor attached to an 8086, the ESC instructions simply do nothing because the co-processor isn’t there to read or write any data. However, the WAIT instruction designed for synchronization will (in a typical 8088/8086 PC design) hang indefinitely because the missing co-processor acts as if it were permanently busy. For that reason, FPU detection must use the non-waiting FNINIT/FNSTSW sequence (or an equivalent) to avoid hangs on 8086-class machines.

Additional information about what things look like from the 8087’s perspective has been recently published.

As an aside, it should be noted that the IBM PC had a mechanism to report FPU presence (or absence) in the BIOS equipment word (INT 11H). However, this detection relied entirely on a user-settable DIP switch on the PC and PC/XT motherboard. An article in PC Tech Journal in June 1985 (Machine Specifics, by Ted Forgeron) notes that IBM’s own manual gave users incorrect instructions, telling them that the DIP switch needed to be in the ON position to signal FPU presence. In reality, the DIP switch setting to report FPU presence was OFF. As a consequence, the BIOS FPU presence bit could not be trusted in PC and XT systems (on the PC/AT, BIOS detected FPU presence during POST) and software needed to explicitly check for FPU presence to be certain.

By the time the 80286 was rolled out in 1982, Intel had effectively given up on a generic co-processor interface (no co-processor other than the 80287 is known). Although Intel’s 286 documentation mentions the ESC instruction here and there, it is not listed in the instruction reference at all (unlike WAIT). ESC is only indirectly documented in the 80287 programming reference. The situation was the same with the 80386/80387 documentation; no ESC in the 386, only FPU instructions in the 387, according to Intel.

Unlike the 8086, the 286 and later had a convenient ability to simplify floating-point instruction emulation (a non-trivial topic on the 8086). The EM bit in the Machine Status Word (MSW, later the low 16 bits of CR0 register), when set, causes the ESC opcode to trigger a Coprocessor Not Available fault (exception 7).

Intel’s idea clearly was that on systems with no FPU, the EM bit should be set. Which is all well and good, except firmware or operating system still needed to figure out how to set the EM bit based on FPU presence or absence. The hardware itself offered no aid to detect an FPU; it had to be done in software. And to detect whether an FPU was present, Intel suggested executing an FNINIT/FNSTSW or similar sequence.

So… based purely on Intel’s programming documentation, in order to detect the presence of an FPU, one had to execute FPU instructions—defined for a chip that might or might not be present. How could that possibly work?!

286/287 Interface

Intel’s documentation was a lie, as usual. Although the 80286/80287 interface (and the very similar 80386/80387 interface) was quite different from the original 8086/8087 interface in detail, it was conceptually not that different.

The original 80287 was very closely related to the 8087 and used the same execution unit (NEU, or Numeric Execution Unit) as the 8087. However, the bus interface (BIU, or Bus Interface Unit) was significantly different.

Rather than the 287 FPU snooping on the CPU’s bus and looking for ESC opcodes on its own, it responded to I/O cycles on reserved ports 00F8h, 00FAh, and 00FCh. As with the 8087, the 287 was still connected directly to the 286’s data bus in order to exchange data, but all memory accesses were performed through the 286, not directly by the 287 (as was the case with the 8087). The 8087 could become a bus master and directly access memory; the 287 could only communicate with the CPU and had to ask the processor to perform memory accesses on its behalf.

It is obvious that the 286 (not 287) still had to decode ESC opcodes, regardless of what Intel’s programming documentation says (or rather doesn’t say). When no FPU was present, the I/O cycles generated by the CPU had no effect, and the FPU never asked for any data transfers… except see below.

There was one other significant change brought by the 286. On the 8086/8087, users had to code WAIT instructions before every floating-point instruction (most assemblers did that automatically). That was because the FPU couldn’t respond to the next floating-point instruction while it was still busy with a previous one.

The 286/287 no longer required these explicit WAIT instructions. As Intel put it (page B-2 of the Intel 80286 and 80287 Programmer’s Reference Manual, 1987): the 80286 automatically tests the BUSY line from the 80287 to ensure that the 80287 has completed its previous instruction before executing the next ESC instruction.

That is an interesting fact, because it requires the 286 (not 287!) to understand which instructions require the BUSY line testing and which ones don’t.

Parallelism

The 8086 was designed to allow the CPU to execute in parallel with a co-processor, using the WAIT instruction for synchronization.

First of all, please note that there is some confusion about the WAIT instruction, also sometimes called FWAIT. (F)WAIT is sometimes classified as an FPU instruction; it is not really, as it is executed purely by the CPU. Unlike FPU instructions, (F)WAIT does not communicate with the FPU at all; it only observes the BUSY signal input to the CPU. Of course this line was blurred since the 486, when the FPU was added to the same chip as the CPU.

Why are there two mnemonics for the same instruction? As always, there is a reason. The WAIT instruction is opcode 9Bh and that’s that. FWAIT, however, may be assembled as opcode 9Bh or as the sequence 90h, 9Bh (that is, NOP / WAIT). The two-byte sequence is emitted when producing floating-point code that can be emulated on 8086/8088 systems. Since those systems have no built-in facility for FPU emulation, floating-point instructions as well as WAIT need to be replaced with software interrupts. And because a software interrupt needs at least two bytes, the extra NOP is necessary to leave enough space. (With an FPU emulator, there is no parallelism and no waiting is needed; however, WAIT still needs to not hang the system!)

One might say that the 8087 was well suited to parallel execution because it was slow. Simple floating-point addition or subtraction took around 100 clock cycles. Division took more than 200. The FYL2X and FYL2XP1 instruction could take around 1,000 cycles.

How useful this parallelism was in practice is another question. Most FPU instructions were closely followed by another FPU instruction and the CPU could not do a whole lot in between. When executing a lengthy FPU instruction, the CPU almost certainly needed the actual result and couldn’t just forge ahead. That said, the CPU was able to handle things like hardware interrupts while the FPU was busy. In a multi-tasking system, the CPU might be able to switch to a different task, as long as that task didn’t use the FPU as well.

On the FPU itself, there were two classes of instructions: math and control (or administrative). The NEU took care of the slow (or very slow) math instructions (FADD, FMUL, FSQRT etc.). The BIU executed the control instructions like FINIT, FLDCW, FSTSW, or FSAVE/FRSTOR.

There was also parallelism between the BIU and NEU within the FPU. For example, the FNSTSW instruction could be executed, at least on the 8087 and presumably 80287, while the NEU was busy–which was reflected in the BSY bit of the FPU Status Word (FSW).

In general, the programmer had to explicitly synchronize the CPU and FPU execution by using the (F)WAIT instruction or using the waiting forms of CPU instructions. However, certain control instructions required no explicit synchronization because they already did the work internally. This is how it was described by Intel (80287 Numeric Processor Extension (NPX), 1987, page 2-49):

There are several NPX control instructions where automatic data synchronization is provided; however, the FSTSW /FNSTSW, FSTCW /FNSTCW, FLDCW, FRSTOR, and FLDENV instructions are all guaranteed to finish their execution before the CPU can read or alter the referenced memory locations.

The 80287 provides data synchronization for these instructions by making a request on the Processor Extension Data Channel before the CPU executes its next instruction. Since the NPX data transfers occur before the CPU regains control of the local bus, the CPU cannot change a memory value before the NPX has had a chance to reference it. In the case of the FSTSW AX instruction, the 80286 AX register is explicitly updated before the CPU continues execution of the next instruction.

In other words, for some FPU control instructions, the FPU effectively held the CPU busy during the ESC opcode execution. This ensured that the CPU couldn’t modify any operands the FPU might still read, and at the same time the CPU couldn’t access memory written by the FPU before the FPU was done.

If one thinks about the 8086/8087 architecture, it is obvious that the BIU had to execute in lockstep with the CPU. As a consequence, control instructions could be executed without any waiting because if the CPU was ready to execute the next ESC opcode, the BIU had to be done with any previous ESC opcodes, even though the NEU could still be busy.

This is also why assemblers supported both waiting and non-waiting forms of these instructions. For example FNSTSW (non-waiting form) could start and finish executing while the NEU was busy. While that may have been useful in some cases, if the programmer wanted to read the FPU Status Word (FSW) as it was after completing the previous FPU calculation, FSTSW (the waiting form) had to be used.

Some control instructions internally synchronized between the BIU and NEU. For example, the FNSTENV and FNSAVE instructions could be executed even if the FPU was busy, however the state would not be saved until the FPU was done (i.e. the NEU was no longer busy).

The FNINIT instruction performs an FPU reset. For that reason, FNINIT could also be executed without waiting and might abort any NEU operation still in progress.

What If There’s No FPU?

Here’s an example of FPU detection logic from Intel’s 287 documentation:

FND_287: 
FNINIT         ; initialize numeric processor.
FSTSTW STAT    ; Store status word into location
MOV AX,STAT    ; STAT.
OR AL,AL       ; Zero Flag reflects result of OR.
JZ GOT_287     ; Zero in AL means 80287 is present.
; No 80287 Present
SMSW AX
OR AX,0004H    ; Set EM bit in Machine Status Word
LMSW AX        ; to enable software emulation of
JMP CONTINUE   ; 287.
; 80287 is present in system
GOT_287:
SMSW AX
OR AX,0002H    ; Set MP bit in Machine Status Word
LMSW AX        ; to permit normal 80287 operation
CONTINUE:          ; and off we go

In principle, the FSTSW instruction in the example ought to have been FNSTSW, otherwise the code would likely hang on an 8086/8088 system with no FPU. Then again, the code is obviously written for a 286 (using LMSW/SMSW instructions), so running it on an 8086 wasn’t a concern.

The example also clearly shows how software is responsible for setting the MSW. The hardware can’t do it; software must detect the FPU presence or absence and act accordingly.

The manual includes a curious note about the sample: It assumes that the system hardware causes the data bus to be high if no 80287 is present to drive the data lines during the FSTSW (Store 80287 Status Word) instruction. More about that later.

Intel’s documentation is pretty clear on what happens when an FPU is present. FNINIT resets the FPU, FSTSW stores the status word which will always have a zero value in the low 8 bits.

If there’s no FPU however… things get interesting. If one takes Intel’s 286/287 documentation literally, the detection can’t ever work because with no FPU, there are no valid instructions to execute (remember, ESC is not documented as a valid 286 instruction).

Obviously that’s not how it works in reality. The 286 is not entirely different from the 8086 and ESC is still a CPU instruction. The CPU can execute ESC instructions just fine, but if there’s no FPU, ESC is a no-op… but only mostly.

That’s why there’s that note about the data bus having to be driven high. If there’s no FPU to execute F(N)STSW, who would write to memory? On an 8086/8087 system, it is clear that the 8087 handles all writes. No 8087, no memory writes by ESC opcodes. But the 286/287 is different. Unlike the 8087, the 287 does not become a bus master in order to access memory. All memory accesses are performed by the 286 on behalf of the 287. This is obviously required for memory protection to work.

I don’t have a 286 on hand at the moment, but I do have a 386 system (Intel 80386DX-33) with no math co-processor plugged into the socket on the board. I can confirm that the FNSTSW m16 instruction does write to memory even if there is no FPU. On my system, it writes FFFFh. I cannot tell if that is what the CPU writes because there is no FPU, or (much more likely) that is the usual “unused” bus value which typically results when attempting to read from nonexistent memory or I/O ports.

Clearly, ESC opcodes are not just NOPs. The 80386 knows that FNSTSW m16 writes one word to memory, and writes it on behalf of the FPU. If the FPU is not there, the CPU still writes to memory.

Co-processor Segment Overrun

Let’s take a detour to examine one odd aspect of the x86 architecture which evolved with every early CPU generation.

The 286 Case

The 286/287 needed to solve a new problem that didn’t exist on the 8086, namely memory protection. The FPU must not be allowed to access memory past segment limits, just like the CPU is not allowed to (otherwise memory protection would go out of the window).

For every ESC instruction which accesses memory, the 286 knows where the access starts, but clearly not where it ends. Because the 286 does not know how big FPU instruction operands are, it needs the Processor Extension Segment Overrun interrupt, also known as Co-processor Segment Overrun interrupt (number 9). If the starting address is outside of segment limits, the 286 immediately triggers a General Protection Fault (interrupt number 13). But if the memory access is only partially outside of segment boundaries, the 286 won’t find out immediately.

I do not know exactly how it is implemented, but I suspect that the 286 keeps track of the segment base and limit that the most recent FPU instruction was accessing, and it also knows the starting address of the memory access. As the FPU accesses subsequent words of the memory operand, the 286 keeps checking if the access is within segment limits. If it is not, the dreaded Processor Extension Segment Overrun (Interrupt 9) occurs.

Why dreaded? Because Interrupt 9 is one of the very few non-restartable exceptions. The 286 manual warns that the only FPU instruction which can be safely executed when Interrupt 9 occurs is FNINIT, which implies that the FPU state is lost. Because Interrupt 9 occurs asynchronously, it may be even triggered after a task switch, in the context of a task different from the one that initiated the faulting FPU instruction.

In any case, if Interrupt 9 occurs on a 286, the process which triggered it is effectively beyond salvation.

The 386 Case

On the 386, the Coprocessor Segment Overrun (no longer called Processor Extension Segment Overrun) still exists, but it takes real work to trigger. It only occurs “if the 80386 detects a page or segment violation while transferring the middle portion of a coprocessor operand to the NPX”. Emphasis on “middle”. In other words, the 386 knows exactly how long FPU instruction operands are, but there are edge cases it does not handle.

It is clear that the 386 validates the start and end of an FPU operand (remember, it can be up to 108 bytes long in the case of FSAVE!). There are pathological cases where the operand wraps around the addressing limit such that the starting and ending addresses are both valid, but one or more of the middle addresses is not. This can happen if the segment limit is slightly smaller than the wrap-around limit (e.g. addressing limit is FFFFH and segment limit is FFFDH), or pages are misaligned with respect to the segments such that there is a small “gap” at the start or end of the addressing limit which falls into an invalid page.

On the 80386, Interrupt 9 is similarly non-restartable and generally very bad news. However, an operating system can entirely avoid Interrupt 9 caused by page faults, and minimize the likelihood of triggering it by going past segment limits. In addition, because it requires addressing wrap-around, Interrupt 9 will never be triggered on a 386 by normal, reasonably written software.

The 486 Case

In the 80486, Intel simplified the Processor Extension Segment Overrun quite a lot—it no longer exists at all. This implies that the 486 must be capable of fully verifying a memory access before a FPU instruction starts performing its operation. Any protection violations trigger a General Protection Fault or a Page Fault, just like non-FPU instructions.

Clearly, the 486 must understand FPU instructions quite well. Then again, since the FPU is either built-in or entirely absent, that’s not too surprising.

What Does the 386 Know?

It is clear that the 386 knows much more than Intel lets on about 387 instructions.

The following 386 instructions write to memory in the absence of a 387: FSTSW, FSTCW, FSTENV, FSAVE.

The following 386 instructions do not write to memory in the absence of a 387: FIST, FST.

It is rather interesting what the FSTENV and FSAVE instructions do. The FSW/FCW/FTW as well as (in case of FSAVE) the FP registers are stored as all ones—clearly that is data which would come from the FPU, if it were there.

But even without an FPU, FSTENV and FSAVE store the FP instruction and data pointers! In other words, the 386, not the 387, tracks this information. Which, in retrospect, is how it has to be, for two reasons.

One reason is that FSTENV/FSAVE can store the pointers in four different formats—all combinations of 16-bit/32-bit and real/protected mode. While the 287 had the FSETPM instruction, on the 387 it’s a no-op. Yet the 386/387 knows which format to store the information in. If the 386 is in charge, that simplifies things quite a bit.

The other reason is that the 80386 needed to be able to work with the 80287, a stopgap measure necessitated by the fact that the 387 wasn’t available for about two years after the 386 was released. If the 386 tracked the instruction and data pointers, it could work with a 287 which had no clue about 32-bit addressing.

It is clear that what started as a generic co-processor interface on the 8086 turned into a single-purpose FPU interface on the 80386, and to a lesser extent it must have been that way on the 80286 already.

Unsurprisingly, the 386 does even more. For example, attempting to execute an FLD instruction on an invalid address will fault in protected mode, even if no 387 is present. However, executing FST does not fault, presumably because the write never happens.

On the other hand, FNSTSW can trigger faults even with no FPU. That is unsurprising; as shown above, FNSTSW writes to memory regardless of whether an FPU is present to not.

It is clear that the 386 took over some of the responsibilities of the original 8087 BIU. The 386 has significant knowledge of FPU instructions. FPU control instructions are to some extent implemented by the 386, although the 387 still needs to supply or accept numeric data.

What Does a 486SX Know?

The Intel 486SX is a rather odd case for two reasons. It is the last mainstream processor without a built-in FPU, and unlike earlier CPU generations, it cannot have an FPU added (that is not the case with Cyrix 486S, which can work with an external add-on FPU).

Examining an AMD Am486SX-66 (not known to be distinguishable from Intel parts in software), and later confirming with a genuine Intel 486SX, it is apparent that the 486SX behavior is not very different from a 386. Even though it cannot be equipped with an FPU, the CPU still does a lot of FPU-related work.

Like the 386, the 486SX tracks FP instruction/data pointers and validates memory operands. Like the 386, the 486SX writes to memory when FSTSW, FSTSW, FSTENV, or FSAVE is executed. It is very likely that the microcode is not vastly different between the 386 and 486.

Unlike a 386, the 486SX also reports protection faults on the FST instruction. This may be related to the fact that the 486 no longer generates Coprocessor Segment Overrun, which implies that memory accesses must be pre-checked and validation is not postponed until the FPU actually starts accessing memory.

Also unlike a 386, the FIST and FST instructions do write to memory on a 486SX.

One behavioral difference I found between an AMD Am486SX2-66 and an Intel i486SX (S-spec SX683) is that the former writes FP instruction/data pointers in FSTENV/FSAVE and the latter does not (only writes FFh words). Such differences are not surprising when one wades deep into undocumented behavior.

Other Vendors

My one system with an IBM 486BL2 processor behaves slightly differently. The behavior is generally similar to a 386, but the values written to memory do not have all bits set. On my test system, the high byte of each word was FFh, but the low byte was inconsistent, though never zero. Therefore, one cannot rely on e.g. FNSTSW to always write FFFFh to memory on systems with no FPU.

On the other hand, a Cyrix Cx486S seems to behave much like an AMD Am486SX2-66.

Safe FPU Detection

How to properly detect an FPU then, without running into problems on systems that don’t have one? Here’s one possible approach (16-bit, able to deal with 8086/8088):

check87 proc near
        push    bp
        mov     bp,sp                   ; establish stack frame
	xor	ax,ax			; initialize with known value
        push    ax
        fninit                          ; reset FPU
        fnstcw   word ptr [bp-2]        ; save FPU control word
        pop     ax                      ; move FCW into AX
        mov     al,0                    ; assume no FPU
        cmp     ah,3                    ; 00h or FFh if no FPU
        jnz     nox87
        mov     al,1                    ; indicate FPU present
nox87:  mov     ah,0                    ; clear AH
        mov     sp,bp                   ; clean up stack
        pop     bp
        ret
check87 endp

The key points are:

FNINIT (not FINIT) must be used because the FPU may be in an unknown state and a WAIT instruction may hang
Storage for the FPU status word must be initialized with a known value
FNSTCW must be used instead of FSTCW
After FNSTCW, no WAIT is needed for synchronization

On an FPU-less 8088/8086 system, FNSTCW will not write anything to memory, which is why the value on the stack must be initialized. On a 286 and later with no FPU, the FNSTCW instruction writes (usually) FFFFh to memory. If a real FPU is present, the actual FCW is stored and the high byte will be 03 after FNINIT.

Summary

While detecting the presence of an FPU is well understood, detecting its absence is much less obvious. It relies on CPU behavior which is effectively undocumented on 80826 and later processors. While the 8086 had a generic co-processor interface, the 286 and later have significant knowledge of x87 FPU instructions. That includes the 486SX, which cannot be equipped with an FPU. Even when there is no FPU present, FP instructions on the 80286 and later are far from no-ops and may behave in surprising ways.