486唐 – 在一张信用卡大小的FPGA板上

486唐 – 在一张信用卡大小的FPGA板上
486Tang – 486 on a credit-card-sized FPGA board

原始链接: https://nand2mario.github.io/posts/2025/486tang_486_on_a_credit_card_size_fpga_board/

## 486Tang：将486 PC移植到Sipeed Tang Console 开发者最近发布了486Tang v0.1，这是ao486 MiSTer PC核心移植到Sipeed Tang Console 138K FPGA的版本。这是ao486首次在非Altera FPGA上运行。关键的调整包括切换到SDRAM作为主内存（采用“双泵”技术来弥补其16位宽度），并利用SD卡作为SD卡支持的IDE，因为Tang缺乏高速MCU接口。实现了一个引导加载程序，用于从SD卡加载必要的PC组件。该系统的启动严重依赖Verilator进行快速的、全系统仿真，并借助调试钩子，利用Bochs BIOS调试输出和针对声音卡和IDE等子系统的定向跟踪来辅助调试。最初的性能较低，约为25MHz，但通过优化——包括复位树的减少、指令提取的简化和TLB的修改——性能提高了35%，达到了大约486SX-20的水平。该项目突出了x86架构相对于更简单的RISC设计的复杂性，以及在FPGA开发中时钟速度扩展的重要性。

## 在FPGA上实现486：复古计算项目一位开发者成功地在信用卡大小的FPGA板上实现了486处理器([nand2mario.github.io](https://nand2mario.github.io))。该项目使用DDR3 SDRAM用于视频，之所以选择它是因为与FPM或EDO内存相比，它更容易与软CPU配合使用，尽管这与486原始时代并不相符。目前，该实现使用了FPGA的44%的LUT和59%的BRAM。讨论围绕内存选择展开，一些人指出DDR3在复古项目中的局限性在于时钟速度的限制。另一些人则探索其他兼容486的硬件，例如Via Eden板或AMD的Élan/Geode处理器，但承认它们的复杂性更高或已停产。该项目引发了人们对在现代工艺中重现经典硬件的兴趣，以及构建能够运行较旧游戏（如DOOM）的小型低功耗系统的潜力。它还引用了Valve在E3演示中Half-Life 2中的一个幽默时刻，俏皮地暗示了该系统的能力。

原文

Yesterday I released 486Tang v0.1 on GitHub. It’s a port of the ao486 MiSTer PC core to the Sipeed Tang Console 138K FPGA. I’ve been trying to get an x86 core running on the Tang for a while. As far as I know, this is the first time ao486 has been ported to a non-Altera FPGA. Here’s a short write‑up of the project.

486Tang Architecture

Every FPGA board is a little different. Porting a core means moving pieces around and rewiring things to fit. Here are the major components in 486Tang:

Compared to ao486 on MiSTer, there are a few major differences:

Switching to SDRAM for main memory. The MiSTer core uses DDR3 as main memory. Obviously, at the time of the 80486, DDR didn’t exist, so SDRAM is a natural fit. I also wanted to dedicate DDR3 to the framebuffer; time‑multiplexing it would have been complicated. So SDRAM became the main memory and DDR3 the framebuffer. The SDRAM on Tang is 16‑bit wide while ao486 expects 32‑bit accesses, which would normally mean one 32‑bit word every two cycles. I mitigated this by running the SDRAM logic at 2× the system clock so a 32‑bit word can be read or written every CPU cycle (“double‑pumping” the memory).
SD‑backed IDE. On MiSTer, the core forwards IDE requests to the ARM HPS over a fast HPS‑FPGA link; the HPS then accesses a VHD image. Tang doesn’t have a comparable high‑speed MCU‑to‑FPGA interface—only a feeble UART—so I moved disk storage into the SD card and let the FPGA access it directly.
Boot‑loading module. A PC needs several things to boot: BIOS, VGA BIOS, CMOS settings, and IDE IDENTIFY data (512 bytes). Since I didn’t rely on an MCU for disk data, I stored all of these in the first 128 KB of the SD card. A small boot loader module reads them into main memory and IDE, and then releases the CPU when everything is ready.

System bring-up with the help of a whole-system simulator

After restructuring the system, the main challenge was bringing it up to a DOS prompt. A 486 PC is complex—CPU and peripherals—more so than the game consoles I’ve worked on. The ao486 CPU alone is >25K lines of Verilog, versus a few K for older cores like M68K. Debugging on hardware was painful: GAO builds took 10+ minutes and there were many more signals to probe. Without a good plan, it would be unmanageable and bugs could take days to isolate—not viable for a hobby project.

My solution was Verilator for subsystem and whole‑system simulation. The codebase is relatively mature, so I skipped per‑module unit tests and focused on simulating subsystems like VGA and a full boot to DOS. Verilator is fast enough to reach a DOS prompt in a few minutes—an order of magnitude better if you factor in the complete waveforms you get in simulation. The trick, then, is surfacing useful progress and error signals. A few simple instrumentation hooks were enough for me:

Bochs BIOS can print debug strings to port 0x8888 in debug builds. I intercept and print these (the yellow messages in the simulator). The same path exists on hardware—the CPU forwards them over UART—so BIOS issues show up immediately without waiting for a GAO build.
Subsystem‑scoped tracing. For Sound Blaster, IDE, etc., I added --sound, --ide flags to trace I/O operations and key state changes. This is much faster than editing Verilog or using GAO.
Bochs BIOS assembly listings are invaluable. I initially used a manual disassembly—old console habits—without symbols, which was painful. Rebuilding Bochs and using the official listings solved that.

A lot of the bugs were in the new glue I added, as expected. ao486 itself is mature. Still, a few issues only showed up on this toolchain/hardware, mostly due to toolchain behavior differences. In one case a variable meant to be static behaved like an automatic variable and didn’t retain state across invocations, so a CE pulse never occurred. Buried deep, it took a while to find.

Here’s a simulation session. On the left the simulated 486 screen. On the right is the simulator terminal output. You can see the green VGA output and yellow debug output, along with other events like INT 15h and video VSYNCs.

Performance optimizations

With simulation help, the core ran on Tang Console—just not fast. The Gowin GW5A isn’t a particularly fast FPGA. Initial benchmarks put it around a 25 MHz 80386.

The main obstacle to clock speed is long combinational paths. When you find a critical path, you either shorten it or pipeline it by inserting registers—both risks bugs. A solid test suite is essential; I used test386.asm to validate changes.

Here are a few concrete wins:

Reset tree and fan-out reduction. Gowin’s tools didn’t replicate resets aggressively enough (even with “Place → Replicate Resources”). One reset net had >5,000 fan-out, which ballooned delays. Manually replicating the reset and a few other high‑fan-out nets helped a lot.

Instruction fetch optimization. A long combinational chain sat in the decode/fetch interface. In decoder_regs.v, the number of bytes the fetcher may accept was computed using the last decoded instruction’s length:

reg [3:0] decoder_count;
assign acceptable_1     = 4'd12 - decoder_count + consume_count;
always @(posedge clk) begin
  ...
  decoder_count <= after_consume_count + accepted;
end

Here, 12 is the buffer size, decoder_count is the current occupancy, and consume_count is the length of the outgoing instruction. Reasonable—but computing consume_count (opcode, ModR/M, etc.) was on the Fmax‑limiting path. By the way, this is one of several well-known problems of the x86 - variable length instructions complicating decoding, another is complex address modes and “effective address” calculation.

The fix was to drop the dependency on consume_count:

assign acceptable_1    = 4'd12 - decoder_count;

This may cause the fetcher to “under‑fetch” for one cycle because the outgoing instruction’s space isn’t reclaimed immediately. But decoder_count updates next cycle, reclaiming the space. With a 12‑byte buffer, the CPI impact was negligible and Fmax improved measurably on this board.

TLB optimization. The Translation Lookaside Buffer (TLB) is a small cache that translates virtual to physical addresses. ao486 uses a 32‑entry fully‑associative TLB with a purely combinational read path—zero extra cycles, but a long path on every memory access (code and data).

DOS workloads barely stress the TLB; even many 386 extenders use a flat model. As a first step I converted the TLB to 4‑way set‑associative. That’s simpler and already slightly faster than fully‑associative for these workloads. There’s room to optimize further since the long combinational path rarely helps.

A rough v0.1 end‑to‑end result: about +35% per Landmark 6 benchmarks, reaching roughly 486SX‑20 territory.

Reflections

Here are a few reflections after the port:

Clock speed scaling. I appreciate the lure of the megahertz race now. Scaling the whole system clock was the most effective lever—more so than extra caches or deeper pipelines at this stage. Up to ~200–300 MHz, CPU, memory, and I/O can often scale together. After that, memory latency dominates, caches grow deeper, and once clock speeds stop increasing, multiprocessing takes over—the story of the 2000s.

x86 vs. ARM. Working with ao486 deepened my respect for x86’s complexity. John Crawford’s 1990 paper “The i486 CPU: Executing Instructions in One Clock Cycle” is a great read; it argues convincingly against scrapping x86 for a new RISC ISA given the software base (10K+ apps then). Compatibility was the right bet, but the baggage is real. By contrast, last year’s ARM7‑based GBATang felt refreshingly simple: fixed‑length 32‑bit instructions, saner addressing, and competitive performance. You can’t have your cake and eat it.

So there you have it—that’s 486Tang in v0.1. Thanks for reading, and see you next time.