我把实时3D着色器放在了Game Boy Color上。

我把实时3D着色器放在了Game Boy Color上。
I put a real-time 3D shader on the Game Boy Color

原始链接: https://blog.otterstack.com/posts/202512-gbshader/

## 游戏男孩色彩实时渲染：深度解析该项目展示了在游戏男孩色彩上使用巧妙的技术克服硬件限制，实现实时图像渲染。开发者创建了一个着色器，通过操纵法线贴图，使光源围绕旋转的3D物体运行。3D模型最初是在Blender中设计的，并使用密码贴图进行精确的颜色控制，以及涉及自定义着色器的流程生成法线贴图。该技术的核心在于在有限的游戏男孩硬件上高效地计算朗伯特阴影。这是通过将点积计算转换为球坐标来实现的，从而显著提高了计算速度。为了进一步优化，代码使用了固定点数学，具有8位小数和对数查找表，以绕过游戏男孩缺乏乘法和浮点支持的问题。为了提高性能，使用了自修改代码，减少了关键计算中的周期数。在尝试使用AI辅助代码生成（Python脚本、小型汇编片段）时，开发者发现它对次要任务很有用，但最终由于准确性和性能问题，仍然依赖于手工编写的优化汇编代码来实现核心着色器。最终的演示程序将游戏男孩色彩推向了极限，实现了视觉上令人印象深刻的效果，大约89%的CPU时间用于渲染。 [项目GitHub](https://github.com/nukep/gbshader) & [YouTube视频](https://www.youtube.com/watch?v=SAQXEW3ePwo)

黑客新闻新的 | 过去的 | 评论 | 提问 | 展示 | 招聘 | 提交登录我在Game Boy Color上放置了一个实时3D着色器 (otterstack.com) 7点由 adunk 1小时前 | 隐藏 | 过去的 | 收藏 | 讨论指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系搜索：

Demonstration

I made a Game Boy Color game that renders images in real time. The player controls an orbiting light and spins an object.

Play it here

Check out the code, download the ROMs

https://github.com/nukep/gbshader

3D Workflow

Early lookdev

Before really diving into this project, I experimented with the look in Blender to see if it would even look good. IMO it did, so I went ahead with it!

I experimented with a "pseudo-dither" on the Blender monkey by adding a small random vector to each normal.

Blender to normal map workflow

tl;dr: Cryptomattes and custom shaders to adjust normal maps

It doesn't really matter what software I used to produce the normal maps. Blender was the path of least resistance for me, so I chose that.

For the teapot, I simply put in a teapot, rotated a camera around it, and exported the normal AOV as a PNG sequence. Pretty straight-forward.

For the spinning Game Boy Color, I wanted to ensure that certain colors were solid, so I used cryptomattes in the compositor to identify specific geometry and output hard-coded values in the output.

The geometry in the screen was done by rendering a separate scene, then compositing it in the final render using a cryptomatte for the screen.

The Math

Normal Maps

The above animations are normal map frames that are used to solve the value of each pixel

Normal maps are a core concept of this project. They're already used everywhere in 3D graphics.

And indeed, normal map images are secretly a vector field. The reason normal maps tend to have a blue-ish baseline color, is because everyone likes to associate XYZ with RGB, and +Z is the forward vector by convention.

In a typical 3D workflow, a normal map is used to encode the normal vector at any given point on a textured mesh.

Source: Own work (Danny Spencer). Suzanne model (c) Blender Foundation.

Calculating a Lambert shader using dot products

The simplest way to shade a 3D object is using the dot product:

v = \mathbf{N} \cdot \mathbf{L}

where $N$

Expanded out component-wise, this is:

v = \mathbf{N}_x\mathbf{L}_x + \mathbf{N}_y \mathbf{L}_y + \mathbf{N}_z\mathbf{L}_z

When the light vector is constant for all pixels, it models what most 3D graphics software calls a "distant light", or a "sun light".

Spherical Coordinates

To speed up computation on the Game Boy, I use an alternate version of the dot product, using spherical coordinates.

A spherical coordinate is a point represented by a radius $r$

The dot product of two spherical coordinates:

(r_1, \theta_1, \varphi_1) \cdot (r_2, \theta_2, \varphi_2) = r_1r_2 (\sin \theta_1 \sin \theta_2 \cos(\varphi_1 - \varphi_2) + \cos \theta_1 \cos \theta_2)

Because all normal vectors are unit length, and the light vector is unit length, we can just assume the radius $r$

v = \sin \theta_1 \sin \theta_2 \cos(\varphi_1 - \varphi_2) + \cos \theta_1 \cos \theta_2

And using the previous variable names, we get the formula:

v = \sin N_\theta \sin L_\theta \cos(N_\varphi - L_\varphi) + \cos N_\theta \cos L_\theta

Making it work on the Game Boy

Encoding normal maps in the Game Boy ROM

In the ROM, I decided to fix $L_\theta$

This means that we can extract constant coefficients $m$

\begin{aligned} m &= \sin N_\theta \sin L_\theta \\ b &= \cos N_\theta \cos L_\theta \\ v &= m \cos(N_\varphi - L_\varphi) + b \end{aligned}

The ROM encodes each pixel as a 3-byte tuple of $(N_\varphi, \log(m), b)$

Why $\log(m)$

The Game Boy has no multiply instruction

Not only does the SM83 CPU not support multiplication, but it also doesn't support floats. That's a real bummer.

We have to get really creative when the entire mathematical foundation of this project involves multiplying non-integer numbers.

What do we do instead? We use logarithms and lookup tables!

Logarithms have this nice property of being able to factor products to outside the $\log$

\begin{aligned} \log_b(x \cdot y) &= \log_b(x) + \log_b(y) \\ x \cdot y &= b^{\log(x) + \log(y)} \end{aligned}

This requires two lookups: a log lookup, and a pow lookup.

In pseudocode, multiplying 0.3 and 0.5 looks like this:

pow = [ ... ]              



x = float_to_logspace(0.3) 
y = float_to_logspace(0.5)

result = pow[x + y]

One limitation of this is that it's not possible to take the log of a negative number. e.g. $\log(-1)$

We can overcome this by encoding a "sign" bit in the MSB of the log-space value. When adding two log-space values together, the sign bit is effectively XOR'd (toggled). We just need to ensure the remaining bits don't overflow into it. We ensure this by keeping the remaining bits small enough.

The pow lookup accounts for this bit and returns a positive or negative result based on it.

All scalars and lookups are 8-bit fractions

It's advantageous to restrict numbers to a single byte, for both run-time performance and ROM size. 8-bit fractions are pretty extreme by today's standards, but believe it or not, it works. It's lossy as hell, but it works!

All scalars we're working with are between -1.0 and +1.0.

Byte	Resolved linear-space value	Resolved log-space value
0	${0 \over 127} = 0$	$2^{0} = 1$
1	${1 \over 127} \approx 0.0079$	$2^{- {1 \over 6}} \approx 0.89$
2	${2 \over 127} \approx 0.0158$	$2^{- {2 \over 6}} \approx 0.79$
...
126	${126 \over 127} \approx 0.9921$	${2^{-{126 \over 6}}} \approx 0$
127	${127 \over 127} = 1$	${2^{-{127 \over 6}}} \approx 0$
128	undefined	$-2^{0} = -1$
129	$-{127 \over 127} = -1$	$-2^{- {1 \over 6}} \approx -0.89$
130	$-{126 \over 127} \approx -0.9921$	$-2^{- {2 \over 6}} \approx -0.79$
...
254	$-{2 \over 127} \approx -0.0158$	$-{2^{-{126 \over 6}}} \approx -0$
255	$-{1 \over 127} \approx -0.0079$	$-{2^{-{127 \over 6}}} \approx -0$

Addition and multiplication both use... addition!

Consider adding the two bytes: 5 + 10 = 15

Addition uses linear-space values: ${5 \over 127} + {10 \over 127} = {15 \over 127}$
Multiplication uses log-space values: $2^{-{5 \over 6}} \cdot 2^{-{10 \over 6}} = 2^{-{15 \over 6}}$

Why is the denominator 127 instead of 128? It's because I needed to represent both positive and negative 1. In a two's-complement encoding, signed positive 128 doesn't exist.

You might notice that the log-space values cycle and become negative at byte 128. The log-space values use bit 7 of the byte to encode the "sign" bit. As mentioned in the previous section, this is important for toggling the sign during multiplication.

The log-space values also use $2^{1 \over 6}$

The lookup tables look like this:

Where:

$\text{encode}(y)$
$\text{decode}(x)$
$\text{encode}(y) = \text{round}(127y) \bmod 256$
$\text{decode}(x) = {\text{signedbyte}(x) \over 127}$

Reconstructed functions look like this. The precision error is shown in the jagged "staircase" patterns:

It may look like there's a lot of error, but it's fast and it's passable enough to look alright! ;)

What's with cos_log?

It's basically a combined $\log(\cos x)$

The core calculation for the shader is:

v = m \cos(N_\varphi - L_\varphi) + b

And we can rewrite it as:

v = \text{pow}(m_{log} + \cos_{log}(N_\varphi - L_\varphi)) + b

This amounts to, per-pixel:

1 subtraction
1 lookup to cos_log
1 addition
1 lookup to pow
1 addition

For a total of, per-pixel:

3 additions/subtractions
2 lookups

How fast is it?

The procedure processes 15 tiles per frame. It can process more if some of the tile's rows are empty (all 0), but it's guaranteed to process at least 15.

Figure: Mesen's "Event Viewer" window, showing a dot for each iteration (tile row) of the shader's critical loop.

There's some intentional visual tearing as well. The image itself is more than 15 tiles, so the ROM actually switches to rendering different portions of the image for each frame. The tearing is less noticeable because of ghosting on the LCD display, so I thought it was acceptable.

A pixel takes about 130 cycles, and an empty row's pixel takes about 3 cycles.

At one point I had calculated 15 tiles rendering at exactly 123,972 cycles, including the call and branch overhead. This is an overestimate now, because I since added an optimization for empty rows.

The Game Boy Color's CPU runs up to 8.388608 MHz, or roughly 139,810 T-cycles per frame (1/60 of a second).

${123972 \over 139810} \approx 89 \%$

Self-modifying code

Figure: A hex representation of the shader subroutine instructions in RAM. The blue digits show a patch to change sub a, 0 into sub a, 8.

The core shader subroutine contains a hot path that processes about 960 pixels per frame. It's really important to make this as fast as possible!

Self-modifying code is a super-effective way to make code fast. But most modern developers don't do this anymore, and there are good reasons: It's difficult, rarely portable, and it's hard to do it right without introducing serious security vulnerabilities. Modern developers are spoiled by an abundance of processing power, super-scalar processors that take optimal paths, and modern JIT (Just-In-Time) runtimes that generate code on the fly. But we're on the Game Boy, baybeee, so we don't have those options.

If you're a developer who uses higher-level languages like Python and JavaScript, the closest equivalent to self-modifying code is eval(). Think about how nervous eval() makes you feel. That's almost exactly how native developers feel about modifying instructions.

On the Game Boy's SM83 processor, it's faster to add and subtract by a hard-coded number than it is to load that number from memory.

i.e. x += 5 is faster than x += variable.

unsigned char Ltheta = 8;


v = (*in++) - Ltheta;


v = (*in++) - 8;

In SM83 assembly, this looks like:

; Slower: 28 cycles
ld a, [Ltheta]   ; 12 cycles: Read variable "Ltheta" from HRAM
ld b, a          ; 4 cycles:  Move value to B register
ld a, [hl+]      ; 8 cycles:  Read from the HL pointer
sub a, b         ; 4 cycles:  A = A - B

; Faster: 16 cycles
ld a, [hl+]      ; 8 cycles: Read from the HL pointer
sub a, 8         ; 8 cycles: A = A - 8

The faster way shaves off 12 cycles. If we're rendering 960 pixels, this saves a total of 11,520 cycles. This doesn't sound like a lot, but it's roughly 10% of the shader's runtime!

So how can we get the faster subtraction if the value we're subtracting with changes?

By modifying the instruction operand!

2A      ld a, [hl+]
D6 08   sub a, 8

An overall failed attempt at using AI

"AI Will Be Writing 90% of Code in 3 to 6 Months"
— Dario Amodei, CEO of Anthropic (March 2025 - 9 months ago as of writing)

95% of this project was made by hand. Large language models struggle to write Game Boy assembly. I don't blame them.

Update: 2026-02-03: I attempted to use AI to try out the process, mostly because 1) the industry won't shut up about AI, and 2) I wanted a grounded opinion of it for novel projects, so I have a concrete and personal reference point when talking about it in the wild. At the end of the day, this is still a hobbyist project, so AI really isn't the point! But still...

I believe in disclosing all attempts or actual uses of generative AI output, because I think it's unethical to deceive people about the process of your work. Not doing so undermines trust, and amounts to disinformation or plagiarism. Disclosure also invites people who have disagreements to engage with the work, which they should be able to. I'm open to feedback, btw.

I'll probably write something about my experiences with AI in the future.

As far as disclosures go, I used AI for:

Python: Reading OpenEXR layers, as part of a conversion script to read normal map data
Python/Blender: Some Python scripts for populating Blender scenes, to demo the process in Blender
SM83 assembly: Snippets for Game Boy Color features like double-speed and VRAM DMA. Unsurprising, because these are likely available somewhere else.

I attempted - and failed - to use AI for:

SM83 assembly: (Unused) Generating an initial revision of the shader code

I'll also choose to disclose what I did NOT use AI for:

Writing this article
The algorithms, lookups, all other SM83 assembly
3D assets
The soul 🌟 (AI techbros are groaning right now)

I tried to make AI write Game Boy assembly

Just to see what it would do, I fed pseudocode into Claude Sonnet 4 (the industry claims that it's the best AI model for coding in 2025), and got it to generate SM83 assembly:

https://claude.ai/share/846cb7d4-e4a6-40ab-8aaa-6e4c308e3da3

It was an interesting process. To start, I chewed Claude's food and gave it pseudocode, because I had a data format in mind, and I assumed it'd struggle with a higher-level description.

I was skeptical that it wouldn't do well, but it did better than I thought it would. It even produced code that worked when I persisted it and guided it enough. However, it wasn't very fast, and it made some initial mistakes by assuming the SM83 processor was the Z80 processor. I attempted to get Claude to optimize it by offering suggestions. It did well initially, but it introduced errors until I reached the conversation limit.

After that point, I manually rewrote everything. My final implementation is aggressively optimized and barely has any resemblance to Claude's take.

And it loved telling me how "absolutely right" I always was. 🥺

It was better for small tasks and snippets of code. The tile demo in my video was partially AI scripted. A Game Boy subroutine for copying to VRAM was authored by AI. Few issues there.

An early iteration of the normal map conversion script accepted OpenEXR files. I didn't feel like drudging through a new library, so I asked ChatGPT to convert an OpenEXR file to a numpy array. It did pretty well! It however also introduced a very subtle bug that I didn't catch for weeks. Once I finally read the code, I realized it was sorting channel names alphabetically (so XYZ sorts as XYZ, but RGB sorts as BGR). It's the sort of error I'd never make myself.

Update: 2026-02-03 - Yeah, so the OpenEXR code could've been done in two lines this whole time. One of the first examples in the official PyPi readme shows how to get a numpy array from an OpenEXR file - exactly what I needed. I could update this snippet for different channels too in theory, but basically it's this. ChatGPT gave me 30 lines to handle edge cases that simply won't happen.

with OpenEXR.File("readme.exr") as infile:
    RGB = infile.channels()["RGB"].pixels

At this point, I can't emphasize verifiable enough.

This, and other experiences, made me realize how easy it is to let your guard down when using AI like this, even if you're an experienced coder. AI can be helpful, but discretion is very much a required skill. I'm just thankful I never relied on it for installing hallucinated packages.

If you like this, share this post or like and comment on the YouTube video!

https://www.youtube.com/watch?v=SAQXEW3ePwo

(post will be updated once I post on Bluesky)