Z80 Sans – 一种以字体形式呈现的汇编器 (2024)

Z80 Sans – 一种以字体形式呈现的汇编器 (2024)
Z80 Sans – a disassembler in a font (2024)

原始链接: https://github.com/nevesnunes/z80-sans

## Z80 Sans：一种将反汇编作为字体的方案本项目展示了“Z80 Sans”，一种独特的字体，可以直接从十六进制输入反汇编Z80机器码。它利用OpenType的高级特性——字形替换（GSUB）和字形定位（GPOS）——将十六进制序列转换为可读的Z80指令。创建这种字体涉及一个复杂的过程，因为Z80的指令集存在多种变化（16位地址、操作数顺序、有符号偏移量）。一个自定义脚本生成字形和查找规则，最初使用`fontcustom`和`ImageMagick`（需要通过RVM管理特定的Ruby和OpenSSL版本）。核心逻辑依赖于递归下降解析器和字体文件（.ttx格式，直接编辑）内的上下文链。挑战包括处理大量可能的指令组合和乱序操作数。解决方案是使用不同的字形编码四位二进制数和位移量，并利用前瞻和连字来管理复杂情况。虽然总体上是成功的，但在某些指令中仍然存在轻微的渲染故障。作者建议未来的工作可以从探索字体塑形器或利用FontForge的脚本功能进行特征修改中受益。

一个名为“Z80 Sans”的新项目引起了Hacker News社区的关注，它将一个Z80反汇编器嵌入到字体文件中。这意味着字体本身可以解码Z80机器码。用户对该项目的巧妙和异想天开印象深刻，认为这是一个聪明的“实用玩笑”，将解析、处理和渲染合并为一步。讨论强调了创造性地滥用字体格式的趋势——之前的例子包括包含俄罗斯方块、大型语言模型，甚至像Zork这样的完整游戏嵌入在PDF中的字体。虽然字体*可以*利用WebAssembly来实现更复杂的程序，但Z80 Sans的方法因其优雅和直接的实现方式而受到赞扬。一些评论员指出，像6502或8051这样更简单的处理器将更容易用这种方式表示。总的来说，该项目被誉为一项非常聪明且有趣的成就。

原文

What's your favourite disassembler? Mine's a font:

1.mp4

This font converts sequences of hexadecimal lowercase characters into disassembled Z80 instructions, by making extensive use of OpenType's Glyph Substitution Table (GSUB) and Glyph Positioning Table (GPOS).

If you just want to try it out, a copy is available under ./test/z80-sans.ttf.

Tested on Debian GNU/Linux 12. Note that this Debian version ships with ruby version 3, while fontcustom was written for ruby version 2, and is incompatible with later versions (e.g. syntax errors). A ruby install also requires a compatible OpenSSL version. Therefore, RVM can be used to manage both ruby and a local install of OpenSSL.

apt install imagemagick potrace
pip install fonttools

git submodule update --init --recursive

# fontforge
(
cd ./modules/fontforge/
git checkout 4f4907d9541857b135bd0b361099e778325b4e28
git apply ../../resources/fontforge.diff
mkdir -p build
cd build
cmake -GNinja ..
ninja
ninja install
)

# woff2
(
cd ./modules/woff2/
make clean all
)

# fontcustom
rvm use 2.7
rvm pkg install openssl
rvm install 2.4 --with-openssl-dir=$HOME/.rvm/usr
gem update --system 3.3.22
(
export PATH=$PWD/modules/woff2/build:$PATH
cd ./modules/fontcustom/
git apply ../../resources/fontcustom.diff
gem build fontcustom.gemspec
gem install ./fontcustom-2.0.0.gem
)

cp ./resources/droid-sans-mono.ttf /tmp/base.ttf
./gen.py ./resources/instructions.json

The .ttf font file is copied to ~/.local/share/fonts/, which is used by e.g. LibreOffice.

Compared to other cursed fonts, Z80 Sans has these challenges:

Multiple characters to render: it would be impractical to manually define character by character all substitution rules for rendering, so we can create glyphs that combine multiple literals (e.g. mnemonics like CALL), however this also ties to the next point...
Multiple combinations: recall that some Z80 instructions can take 16-bit addresses and registers as operands, which means that a single instruction can have up to 65536 * 7 = 458752 possible combinations;
Out-of-order operands: e.g. register and offsets can be encoded into hexadecimal bytes in one order, but disassembled in another order, which complicates backtracking/lookaheads rules;
Little-endian addresses: Characters for the least-significant byte need to be rendered before the most-significant byte;
Signed offsets: All offsets in range 0x80..0xff need to be rendered as a negative two's-complement number;

All of this invites a programmatic solution. While fontcustom and ImageMagick take care of generating glyphs, it seems that a convenient way to write lookup rules is the .fea format, but I didn't find a way to integrate it with fonttools' .ttx format (which is basically xml). I took the lowest common denominator approach of directly editing the .ttx of Noto Sans Mono (although glyph shapes are computed from Droid Sans Mono, as that's what I started with when patching FontForge).

A recursive descent parser is used to generate all possible glyphs, which helps with evaluating expressions in encodings (e.g. SET b,(IX+o) takes a bit and a displacement, encoded as expression DD CB o C6+8*b). These encodings were then expanded to all possible values that operands can take, before finally associating 1 or more hexadecimal bytes to each disassembly glyph required to render an expanded instruction.

There are some nice references for OpenType features, but they are written at a high-level, or in .fea(?) format:

It's never very clear how to translate them to .ttx, so in the end I just converted all of the Noto Sans family and used the good ol' fashioned bruteforce approach of "learning by example". This is even more fun that it sounds, thanks to plenty of silent failures when converting from .ttx to .ttf, where lookups will not match due to some assumptions not validated by fonttools (e.g. class definitions for contextual chaining substitutions must have at least one coverage glyph with class value="1").

Pretty much most challenges were solved with contextual chaining rules. To handle addresses, each nibble in range 0..f was encoded with distinct glyphs, with spacing characters used to create multiple substitutions, one character at a time. Displacements also have additional signed variants. This gives us a total of (4 + 2) * 16 glyphs for numbers. This was already enough to keep the font file under the 65536 glyphs limit.

The worst part was of course out-of-order operands. However, due to the limited number of variations these have in instructions, they could be covered by the same strategy as instructions with ambiguously encoded prefixes, e.g.

["SET b,(IX+o)", "DD CB o C6+8*b"],
["SET b,(IY+o)", "FD CB o C6+8*b"],

Is covered by the same lookup rules as:

["SRA (IX+o)", "DD CB o 2E"],
["SRA (IY+o)", "FD CB o 2E"],
["SRL (IX+o)", "DD CB o 3E"],
["SRL (IY+o)", "FD CB o 3E"],

An interesting property in the Z80 ISA is that bits and registers have up to 8 variations, and these out-of-order cases only involve offsets and one of those specific operands. Therefore, we can encode bits or registers as literals. With sufficient lookaheads, we can match up to the last hexadecimal byte, and create dedicated lookups for each case. The last literals can be reduced by generating a ligature that matches the suffix glyph. The end result was dozens more generated lookups for these cases (which can likely be grouped to reduce this number).

While all of the original instruction set should be disassembled, some instructions have minor glitches:
- LD (IX+o),r is rendered as LD (IX+o r),;
- SET b,(IX+o) is rendered as SET b,(IX+o));
"CTF quality" code 😅;

FontForge supports scriptable modification of features using commands GenerateFeatureFile() and MergeFeature() (briefly covered in The Terrible Secret of OpenType Glyph Substitution - Ansuz - mskala's home page). I was only aware of this after making the .ttx based implementation, but it could potentially have avoided messing with .ttx files.

For more complex instruction sets, an alternative approach that seems to have less constraints is to use font shapers. Some examples:

Z80 Sans – 一种以字体形式呈现的汇编器 (2024) Z80 Sans – a disassembler in a font (2024)

Z80 Sans – 一种以字体形式呈现的汇编器 (2024)
Z80 Sans – a disassembler in a font (2024)