将Shufflepuck Cafe移植到8位Apple II的挑战

将Shufflepuck Cafe移植到8位Apple II的挑战
The challenges of porting Shufflepuck Cafe to the 8 bits Apple II

原始链接: https://www.colino.net/wordpress/archives/2026/02/23/the-challenges-of-porting-shufflepuck-cafe-to-the-8-bits-apple-ii/

## Shufflepuck Cafe 在 Apple II 上的移植之旅本文详细介绍了 1989 年游戏 *Shufflepuck Cafe* 成功移植到 8 位 Apple II 平台的过程。作者最初缺乏 Apple II 精灵处理经验，先从一个更简单的 *Glider* 移植开始，以建立基础知识。主要挑战包括显示伪 3D 桌面、优化 1MHz 处理器的性能以及管理有限的内存（64KB）。 “3D”效果是通过透视变换和查找表来实现的，以加速计算。精灵缩放是通过预渲染多个精灵版本来处理的。通过利用异或运算来绘制精灵而不是遮罩，用内存换取速度，从而提高了性能。声音通过“减速”技术进行优化，在保持清晰度的同时减小样本大小。内存限制通过按需加载特定对手的代码和资源、利用压缩以及策略性地管理内存映射来解决。还实现了一种双人串行通信模式，允许进行联网游戏。尽管缺少一些原始游戏的功能，但作者认为这次移植是成功的，捕捉了 *Shufflepuck Cafe* 的精髓，并增加了多人游戏的功能。该项目的源代码和可下载游戏可在网上找到。

Hacker News 新闻 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录将Shufflepuck Cafe移植到8位Apple II的挑战 (colino.net) 8 分，由 homarp 1小时前发布 | 隐藏 | 过去 | 收藏 | 1 条评论求助 homarp 53分钟前 [–] 以及 https://gamesfromtheblackhole.wordpress.com/2020/10/04/shuff... 讲述了玩Shufflepuck Cafe的乐趣。回复指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系搜索：

This post originally appeared in the June 2025 issue of Juiced.GS, and has been expanded with more details.

I am very proud to have succeeded in porting Shufflepuck Cafe to the 8 bits Apple II, bringing a very dynamic 1989 game to an 1979 platform without losing in playability or details. In this article, I will share the challenges that made me pause, and how I solved them. If you are interested in the game itself, please head over to the Shufflepuck Cafe for Apple II project page.

1 : Displaying sprites and moving them

The first challenge was that I didn’t even know how to cleanly display a sprite, handle the mouse, etc. I have solved this one by stepping down and… starting with a Glider port, as this game is technically much less complicated than Shufflepuck. It took me a month of very late evenings and hyper-focusing the whole week-ends with no kids, but I then had a much better idea of how to write a game, some foundation to build on, and a cool port of Glider.

In the process, I also learned more things that would prove useful for Shufflepuck Cafe: how to easily play sounds, and how to fit a lot of data both on a 140kB floppy disk and in memory. I already wrote things that could play sounds, but in these previous projects¹, the sound samples came over the serial port. Instead of manually rewriting my player, this time I did things well and wrote a player generator, which, as I figured later during Shufflepuck’s development, made things far easier. I’ll come back to those subjects later.

2 : A 3D table ?

As you probably know, Shufflepuck is a “3D” game.

Of course, in reality, Shufflepuck is not a 3D game. The table is a background, and sprites (two pushers and one puck) are displayed over it. The only thing required to make it look 3D is some coordinate transformation, and some scaling.

2a: The coordinates

For the coordinates, viewed from “inside” the code, the table is a 255 pixels wide by 192 pixels high rectangle. I realize now that the rectangle could have been 255 pixels high, but this is not a problem as-is.

So, after lots of theory reading, I figured that what I wanted to do is a one-point perspective transformation.

I drew over the background and figured the vanishing and other key points using a temporary layer :

For a given X on the rectangle, the “graphical” X (gx) on the screen will be X multiplied by x_factor% and shifted by x_shift pixels, where the factor and the shift depend of Y: gX = X*x_factor(y) + x_shift(y); and gy will be Y multiplied by y_factor% (as the puck goes backward, its graphical Y is “flattened” compared to the geometric one, otherwise, it seems to accelerate the further it is.)

The magic values come from the points of interest: V (the vanishing point), F (the back of the table) and M (the front of the table). For a given x,y on the rectangle, we will have:

depth(y) = Fy + My - y
y_divisor(y) = depth(y)/((My-Vy)/Fy)
x_factor(y) = Fy/depth(y)
x_shift(y) = (-Vx * Fy)/depth(y) +Vx + Mx
gx = x * x_factor(y) + x_shift(y)
gy = My - (((My - y) * Fy) / y_divisor(y)

This was of course tested and iterated over and over using a simple SDL proof of concept:

The table with a set of calculated gx,gy every 10 points

Of course, this kind of computation is not something one wants to do in 6502 assembly when speed matters, so I wrote a lookup table generator, which created three tables for me to quickly find gX,gY given X,Y. These three tables required one multiplication and one division (none of which the 6502 can do natively, so they are expensive) to compute gX, as the X factor depends on Y and I couldn’t make a 255*192 lookup table, of course (48kB, haha). I got rid of the division: instead of storing a percentage, I store a “per256tage”, which replaces the division with a single instruction (moving the high byte of the multiplication result to the low byte). I am proud of that _transform_xy function that trades size for speed. It takes 612 bytes (of which 580 are the lookup tables) and executes in 138 cycles.

I made the first Shufflepuck commit once I got that right, because it seemed like the absolute basis for this game.

2b: The perspective

Of course, in three dimensions, the closer an object is, the bigger it is. You guessed it: it would be impossible to scale sprites in real time on a 1MHz 6502, so once again, we’ll trade size for speed. There are four sprites for the player’s pusher, six sprites for the puck, and two for the opponent’s pusher. The sprite to use is determined using its Y coordinate.

The different versions of the pushers and puck

Given that each pixel is a bit on the Apple II, all of these sprites are also stored 7 times, each version shifted one pixel to the right, in order to quickly position the sprite on the X axis instead of shifting everything manually. We sure did trade size for speed there: these three elements occupy 7329 bytes of memory.

3 : Draw fast enough

Drawing sprites on the Apple II is a time-consuming process. One has to select the correct version of the sprite via X modulo 7 (via a lookup table); get the first line’s start address (two more lookup tables, thanks to HGR interlacing!), add the X divided by 7 offset (a fourth lookup table!) to get the first byte to update; fetch a background byte to back it up, AND that with the sprite mask, OR this with the sprite data, and store the result on screen; do that for a full line, then iterate to the next line, etc.

The front pusher’s sprite is rather large at 49×17 pixels, and the puck when in front is also a bit large. Drawing each sprite at each frame amounted to almost 14000 cycles, and that was too much, for two reasons: the first one is that there’s a new frame to draw every ~17ms (or ~17000 cycles – the joy of 1MHz is that a cycle basically equals a microsecond. It’s not exactly that but it’s close enough), and in addition to drawing, one has to actually have enough cycles to run the game logic, too. And the second one: you don’t have 17000 cycles to draw a frame, unless you only draw on the last line of the screen!

This page explains it really well. If you want a flicker-free, clean draw, you have to be faster than the CRT beam. Start drawing right when the beam leaves the bottom-right corner of the screen, and you have 4550 cycles to draw on the first line before it arrives there. After that, you have an extra 65 cycles per line.

At 14000 cycles per update, we couldn’t win the race and our pusher flickered when it was at the end of the player’s side of the table.

At first, I solved that by updating only half the screen every frame. When the puck was in the opponent’s side of the table, I drew their pusher and the puck on even frames, or the player’s pusher on odd frames. When the puck was in our side of the table, I either drew the opponent’s pusher, or the player’s plus the puck. This made each frame drawn in less that 7000 cycles, but with this technique, I only had a 30 frames-per-second rate.

So, I changed the way I draw the player’s pusher. Instead of masking it with the background (which amounts to 20-25 cycles per byte: LDA screen, STA backup, AND mask, OR sprite, STA screen), I used a well-known old technique for performance: exclusive ORing. Exclusive ORing (LDA screen, EOR sprite, STA screen) has three advantages:

First, it allows us to spare cycles by not needing to save the background (-4 or 5 cycles per byte).
Two, it allows us to spare cycles by not needing to mask (-4 or 5 cycles per byte).
Three, as a bonus, the two first speed gains also translate to memory usage gain, getting us rid of the background save buffer (136 bytes), and the mask (all 7 seven versions of all 4 sizes of the sprite: 3400 bytes!)

Finally, I linked these assets in the correct order and places so that the biggest pusher and puck sprites would be page-aligned, to avoid the page-crossing penalty and spare three to five extra cycles per byte.

Aesthetically, it was pleasing enough; and, performance-wise, it allowed to draw the three sprites for every frame in about 11000 cycles, which was enough to win the race against the beam, provided I drew the sprites in the correct order, from top to bottom; a thing that was needed, anyway, for simple geometry reasons so that the sprite in front of the other would obscure the other.

I kept the “clean” (background-masking) method of drawing the puck and the opponent pusher, as the EOR method was extremely ugly on them.

During all these optimisation phases, I needed to count cycles to have an idea of how much I needed to gain, or how much slack I had. Counting manually over such large functions is of course unfeasible, so I ran all of these tests with MAME’s CPU tracing log on, and used the trace logs with my debugger in Callgrind Profile mode:

In this trace, we can see an average of 11793 cycles per draw (4.7M cycles/402 calls)

While imperfect, this debugger helps me a lot on each project.

4. The size of the sounds

In the original Shufflepuck Cafe, related to playing, there are the same number of sounds as there are moving elements: three. One when a pusher hits the puck, one when the puck hits a wall, and one famous “window crash” sound when a player misses the puck.

But there are variations on these sounds: their pitch is lower on the opponent’s side. I could not have multiple samples to replicate that effect – just the crash sound, a 800ms sample at 8KHz, is 6500 bytes. I couldn’t decently use all of the Apple II memory for sprites and sounds! Instead, I added a feature to my sound player generator – and this is where I got really happy to have it count cycles for me. The idea was that before jumping to the sound player with the sample in the X register, I would set a “slowdown” factor in the Y register. Y=0 means no slowdown. Afterward, each of the sound player’s duty cycles would decrement that counter to 0, effectively wasting cycles, and making the pitch lower:

.proc slow_sound
         tya                       ; 2 cycles - backup Y
         ldy snd_slow              ; 5
:        dey                       ; 7
         bpl :-                    ; 9
         tay                       ; 11 - restore Y
         rts                       ; 17
.endproc

The method has two drawbacks: the first one is that in order to avoid having the sound player code explode in size, I chose to do it in a subroutine, meaning a 12 cycles (JSR + RTS) penalty in addition to the minimal execution length (11 cycles). These 23 cycles that couldn’t be used to drive the speaker meant that the resolution of my sound player went down from 47 to 30 levels (almost 5 bits versus almost 6bits), making the sound less clear. It also meant that the sound player’s carrier is lower when a sample is slowed down, making it more audible. But this was the only reasonable thing to do, and real-life testing showed that the sound effects were still clear enough.

5. The memory management

As I continued adding opponents and bells and whistles, it was more and more clear that not everything was going to fit at once in the very limiting 64kB of RAM the Apple II has. So far, I had:

The table background
The sprites
The sounds
The opponent’s sprite, with variations when they win or lose a point
The opponent’s algorithm
The opponent’s winning and/or losing sounds

I also wanted to add a splash screen, an intro sound, the Cafe itself for selecting an opponent, the Roster of Champions, and probably more!

To solve that, I reused the same principles I used when developing Glider, but pushed the concept to the maximum: general code resident in memory, and each opponent’s code would come into a shared memory space, in different ca65 segments linked outside of the main binary, and loaded when needed.

I did the same with everything that was not used in-game, and did not even think twice about compression: My Glider experience showed me that not only it makes more data fit on the floppy, it is also faster to read less data from the floppy, and spend the necessary cycles to move it where needed and decompress it.

In the end, at different moments of the game, the memory map of Shufflepuck Cafe looks likes this:

$800	$C00	$2000	$4000	$6D00	$7D00	$BCFF	$BEFF	$D400
File I/O buffer	LOW. CODE	Splash screen image	Temp. buffer	General code	DATA, RODATA, ONCE code	Software stack	ProDOS
		Intro sound	Intro sound		DATA, RODATA, BSS			Lang. card code
		Cafe image	Bar code
		Roster image	Bar code
		Table image	Skip
			Visine
			…
			Eneg

The first thing done at startup at $6D00 is to jump into the ONCE segment at $7D00 for early inits, then into the preloader function. This function will initialize the mouse, then load the LOW.CODE file containing the LOWCODE segment and splash screen image. While the image is displayed, the Language Card code is loaded, then – if /RAM is available – the init will, one after the other, load the Cafe image, the Cafe code, and the Table image at $4000, and back them up in uncompressed /RAM files. During this time, everything happens behind the scene, with a nice splash screen visible. If /RAM is not there, on 64kB machines, we’ll just have to reload the Cafe and the Table from floppy each time we need it. Slower, but it makes Shufflepuck compatible all the way back to the original Apple II with a non-extended language card!

Once everything is preloaded, the preloader will turn on text mode – for cleanliness – and overwrite the splash screen image by loading the Intro sound at $2000 and up to $6D00 – it is a huge sound! Finally, the intro sound will be played, and the Cafe image and code are restored to $2000 and $4000.

When the player selects an opponent, the Cafe image is replaced by the Table image, and the Cafe code is replaced by the opponent’s code and assets.

When 128kB of memory are available, this makes for a rather long load – the preloading takes 17 seconds – but once it is done, switching from the Cafe to a game is rather fast, requiring only the opponent file to be loaded from floppy – a little 2 seconds.

The last advantage of this is that the “window crash” lines (the ones that appear when a point is scored), which are large and random, are easier to “clean up”, in a fast enough and less main-memory consuming manner… Unless you only have 64kB of RAM: on the 64kB Apple ][, where /RAM is not available, the preloading will be much faster, but every switch from Cafe to game and vice-versa will require reloading assets from floppy. Also, the table background has to be reloaded from floppy between each point, which is kind of annoying and also flashes raw LZSA1 compressed data on the screen buffer.

6. More sound problems

Playing sampled sound on the Apple II presents two drawbacks: the samples are memory-consuming, and you can’t do anything else, like drawing or calculating, during the time it is played.

6a: The blocking of the playback

This is, surprisingly, not problematic in many circumstances. Stopping animations while a sound plays, in Shufflepuck, is only impossible during a round, when the puck hits a pusher or a wall. If those sample are too long, the puck’s movement visibly pauses, and this feels really wrong. No silver bullet there : I made these two samples extremely short, less than 20ms, which translates to one dropped frame. One dropped frame is below the visible threshold. (By the way, one frame is dropped every sixth frame on the 60Hz US Apple II models too, otherwise the game would be 20% faster and harder than on the 50Hz EU models). The drawback is that the pusher hit sound is less satisfying than on the original Macintosh game.

For the window crash sound, it is not a problem to draw the crash lines, change the opponent’s sprite to their winning or losing face, then play the sound while everything is frozen on the screen.

And for the rest of the samples, especially those played during win animations, cutting them down and playing them multiple times allows to alternate between a graphical update and a sound in a, I think, satisfying manner. (Example below, as 6a and 6b are intertwined).

6b. The size of the samples

It happened more than once when ripping sound effects from the original Shufflepuck that I thought “good, it’s short enough, it’s going to fit”, and that I would be wrong. The opponent segment is 11.5 kBytes, and once I put the actual code and various sprites inside, there is often less than one second of sound available. For Biff’s laugh, Eneg’s ear, Bejin’s service, or even DC3’s beeps, this is not enough. The solution here has been to cut the samples short and play them twice or more, or reassemble parts of them in different order, or a mix of both.

Bejin, for example, makes a different sound when she is going to serve to the left or to the right. Analyzing these two sounds from the original Mac version, I have noticed that when she serves to the right, a wavy sound is played twice. When she serves to the left, the start of the sound is different, but it finishes the same way (and as such is a little longer). In my code, that translates as:

choose_direction:
        lda     #END_SERVICE_RIGHT_DX ; Serve to the right, maybe
        sta     serve_direction

        jsr     _rand                 ; Randomize left or right
        and     #1
        beq     :+                    ; if rand is even, go play the
                                      ; serve-right sound

        lda     #END_SERVICE_LEFT_DX  ; Serve left in the end
        sta     serve_direction

        ldy     #0                    ; Play the serve-left prelude
        jsr     _play_serve_g_left

:       ldy     #0                    ; Play the serve-right sound,
        jsr     _play_serve_g_right
        ldy     #0                    ; twice
        jsr     _play_serve_g_right

For Biff, where he laughs four times at the player when he scores, the full sample was much too long too. I kept one of the four laughs, and played it four times. This seemed a bit too mechanical, so the third and fourth ones are marginally slowed down. In between each replay, I update the sprite so his face moves while he laughs:

.proc sound_win
        jsr     win_animation     ; Make Biff's face laugh
        ldy     #0                ; Play a laugh
        jsr     _play_win_h
        jsr     normal_animation  ; Make Biff's face normal
        ldy     #0                ; Play a laugh
        jsr     _play_win_h
        jsr     win_animation     ; Make Biff's face laugh
        ldy     #1                ; Play a slowed-down laugh
        jsr     _play_win_h
        jsr     normal_animation  ; Make Biff's face normal
        ldy     #1                ; Play a slowed-down laugh
        jmp     _play_win_h
.endproc

This is the result:

7. Fitting the hand

My Shufflepuck Cafe almost ready, there was a single detail missing. An important detail in my opinion: the robotic score-updating hand. It is a very rememberable detail from the original game, and I wanted it in my clone to consider it complete.

The Macintosh version’s hand, drawing the last point

But that hand is a moving sprite, which means that it has to be duplicated 7 times to align it to the pixel, and it has to be masked, and it is large, at 84×33 pixels, which means ((84+7)/7)*33*14 bytes of data – 6 kilobytes – plus a bit of code to be able to draw a partial sprite as it comes out of the border of the screen.

My ca65 map file was telling me no: I had only 2034 bytes left in the Apple II’s main memory, and /RAM was not going to save me there.

So I took liberties with the original, and I made the hand come from the scoreboard’s lamp. The sprite became small enough that way, 14×29 pixels translating to 1218 bytes of data. It also allowed me to partially draw the sprite by skipping lines rather than columns, which was easier and more compact. The hand was in.

8. The serial protocol

Finally, I realized I had all the space I would need to ship serial drivers and make my version of Shufflepuck multiplayer, something that should have happened in 1989: a serial opponent is “just” another opponent, and as such, could use as much as 10kB of memory in the opponent segment.

As you may know, the Apple II is quite slow, and the interval separating two frames is 16.7ms, during which we first have to draw the sprites while racing the CRT beam. It takes ~12ms at worst to draw the three sprites (yes. The Apple II is slow.) so there is not a lot available to do the rest – the game logic, and the serial communication must happen in 4.7ms at most, which roughly translates to 4700 cycles.

Testing serial on emulator proved difficult with brain-hand-window-synchronisation

There was no practical way to have the communication asynchronous using IRQ, as they are expensive (at least 200 cycles), have a tendancy to fire at exactly the wrong moment, and complicate the handling of the VBL IRQ, so no buffering was possible either. So, the implementation takes advantage of the fact that the serial chip(s) in Apple II computers (the ACIA 6551 for the 8 bits ones, and the Zilog 8530 for the IIgs) are full-duplex capable.

At each frame, both computers send a single byte, in a non-blocking manner. This byte can either

contain the player’s paddle coordinates, mirrored, or
contain a flag that something happened.

In the first case, which is most of the time, each computer mirrors the X and Y coordinates of their player’s paddle and sends that to the other computer. These coordinates fit in one byte, thanks to the ranges involved and a bit of rounding error: X can be 0-224 (mirrored to 224-0) and Y can be 154-191 (mirrored to 38-1). The byte is then built with X’s five high bits and three bits of Y shifted right three times): XXXXXYYY. The rounding error resulting is 7 pixels on the X axis and 6 pixels on the Y axis, which is acceptable: from the other side of the table, that’s barely visible.

This byte is pushed to the serial chip data register, and right after that, we check if a byte arrived. If not, on to the next frame! Otherwise, we unpack and update the other player’s paddle coordinates so that it’s redrawn at the new place.

In the second case, if something happened – that is if the player hit the puck – the computer of that player has to provide the new puck parameters to the other one: the X/Y coordinates of the hit, and the new velocity (delta X and delta Y). The hitter is the source of truth, otherwise rounding errors make the trajectory drift apart quite quickly.

In this case, the “event” flag is sent ($FF, or 11111111, a value that can’t happen when exchanging coordinates because X=224 is 11100000), but after sending it, the computer blocks until it gets the same byte as an acknowledgment.

At this point, I know both computers are at known places in their loop and ready to chat more (this is called a barrier). One status byte follows (‘H’ for hit), an answer is awaited, then the four bytes (X, Y, dX, dY) required are sent using the same technique: putting the value in the serial data register, and wait for an answer to be read.

The same flag-based I/O barrier happens when a player misses the puck and it crashes on their side of the table. That fact is sent over serial, even if the opponent’s game engine may have arrived at the same conclusion at the same time: if for some reason a desync happens, that fixes it.

And that’s all. The serial communication takes about 300 cycles per frame in general, and 600 when the puck is hit, most of those consisting of waiting for a reply.

At first, the communication was synchronous for every message, but as both computers’ vertical blanking is never synchronized unless one is very lucky turning them on at the same nanosecond, this induced frame dropping, which was unsatisfying. The opponent’s paddle coordinates can, after all, be late one frame!

Conclusion

I have taken a lot of pleasure coding this clone of Shufflepuck Cafe. I don’t think I would have succeeded if I had not started my game development adventure with the simpler Glider, and I learned a lot in the process of developing these two games. I am very happy with the result I achieved – even though my clone is not feature-complete compared to the original (pusher configuration and blockers are missing, and the opponent’s pushers are identical whereas some of them have smaller ones in the original), I consider I have captured the essence of Shufflepuck’s ambiance. And, in my opinion, these missing features are really offset by the presence of two-player mode.

I hope you will have as much pleasure playing it, and if you have read this article this far, I guess you’re a nerd and suppose it brought you joy too.

You can download Shufflepuck Cafe for Apple II on the project’s homepage, and look at the source code on the Github page. I tried to comment it profusely, both for the reader and for my future self.