2024 年的 Box64 和 RISC-V:如何在 RISC-V 上运行《巫师 3》
Box64 and RISC-V in 2024: What It Takes to Run the Witcher 3 on RISC-V

原始链接: https://box86.org/2024/08/box64-and-risc-v-in-2024/

RISC-V 后端支持在 RISC-V 机器上运行 AAA 游戏等流行软件,在过去一年中取得了重大进展。 值得注意的是,《巫师 3》最近通过 Box64、Wine 和 DXVK 在 RISC-V 上执行。 此前,由于大量新 x86_64 指令的快速实现以及与 GPU 的兼容性有限,导致 RV64 DynaRec 中存在大量错误,因此只能管理《星露谷物语》和《粘粘世界》等轻量级原生 Linux 游戏。 然而,最近推出的功能强大的 RISC-V 设备(例如 64 核 Milk-V Pioneer)(具有用于显卡的 PCIe 插槽)使 RV64 DynaRec 得以进一步发展。 此外,较新的 RISC-V 芯片(例如 SpacemiT K1/M1 SoC)现在提供 RVV 支持,为增强游戏体验打开了大门。 尽管仍然存在诸如缺少高效 x86 仿真的基本指令等挑战,但运行《巫师 3》所取得的进展表明了通过持续开发提高性能的潜力。

与 x86-64 或 ARM 相比,为 RISC-V 编写软件时,通常需要进行最少的更改。 主要区别在于可用寄存器的数量、内存模型和某些微优化考虑因素。 平均而言,RISC-V 由于其压缩指令扩展而较小。 然而,RISC-V 上提供的 32 个通用寄存器允许在需要将数据存储在堆栈上之前使用更多的中间值,从而提高紧密循环中的性能。 其他差异源于 x86-64 和 ARM 基础架构中缺少的功能,例如向量扩展和人口计数指令。 这些因素在微优化期间变得很重要,但通常不会影响整体程序行为或性能。 此外,RISC-V 与 ARM 共享的弱内存模型减少了软件中对强内存模型的依赖。 可执行文件大小可能略有不同,但由于其压缩指令扩展,RISC-V 的代码密度与 32 位 ARM 和 x86 相当。 总体而言,虽然 RISC-V 为硬件和系统设计人员带来了独特的挑战和机遇,但这些问题主要由软件工程师的编译器处理,几乎不需要对常见编码实践进行调整。
相关文章

原文

It’s been over a year since our last update on the state of the RISC-V backend, and we recently successfully ran The Witcher 3 on an RISC-V PC, which I believe is the first AAA game ever to run on an RISC-V machine. So I thought this would be a perfect time to write an update, and here it comes.

The Witcher 3 Running on RISC-V via Box64, Wine, and DXVK.

The Story

A year ago, RV64 DynaRec could only run some relatively “easy-to-run” native Linux games, such as Stardew Valley, World of Goo, etc.

On the one hand, this was because after a large number of new x86_64 instructions were implemented quickly in RISC-V, there were many bugs left in the DynaRec. Things won’t work if you don’t implement the x86_64 ISA correctly. But the most important factor is that we had no RISC-V device could be plugged into an AMD graphics card at the time, and the IMG integrated graphics cards on VisionFive 2 and LicheePi 4A did not support OpenGL, only OpenGL ES.

We can get a certain level of OpenGL support using gl4es, which allows games like Stardew Valley to run, but it is not enough for other more serious Linux games, as well as all Windows games in general.

So this became a hard barrier for us to test more x86 programs in the wider world, until both ptitSeb and I received the Milk-V Pioneer from Sophgo, which is a 64-core RISC-V PC, and of course, it also has a PCIe slot for a graphics card. Many thanks to Sophgo!

In addition, another core contributor xctan also found a way to “plug” an AMD graphics card into VisionFive 2 via the M.2 interface. With that, we were exposed to the wider world and we’ve since fixed a ton of RV64 DynaRec bugs and also added a ton of new x86 instructions. Changing in quantity leads to changes in quality, more and more games were working, and finally, we tried running The Witcher 3 for the first time, and it just worked!

That’s the story of running The Witcher 3 on RISC-V.

What is the Current Status of RISC-V DynaRec?

The x86 instruction set is very very big. According to rough statistics, the ARM64 backend implements more than 1,600 x86 instructions in total, while the RV64 backend implements about 1,000 instructions. Among them, more than 300 of these instructions are newly supported AVX ones that we haven’t implemented at all in RISC-V. Anyway, still need some catching up.

Also, for SSE instructions, we use scalar instructions for implementation, while AArch64 uses the Neon extension and LoongArch64 uses the LSX extension. So the performance is quite poor compared to the other two backends.

However, things are not set in stone. RISC-V has a vector extension called the Vector extension. Yeah I know, so I will call it RVV from now on.

There are already some devices that support RVV on the market, such as the Milk-V Pioneer mentioned above, which supports the xtheadvector extension, which is a variant of RVV version 0.7.1 (things are a bit complicated). In addition, the SpacemiT K1/M1 SoC released not long ago supports the ratified version of RVV 1.0. Currently, the Banana Pi F3 and Milk-V Jupiter equipped with this SoC are already available for purchase.

With these devices available, recently we have added basic RVV support to box64 and implemented several common SSE instructions. However, this work is still very early, so it will not help the performance for now. But the future is promising, right?

Next, let’s talk about the two dark clouds hanging over the RISC-V backend. These are the stuff where I feel RISC-V is most lacking in x86 emulation over the past year.

The Most Wanted Instructions for x86 Emulation

At least in the context of x86 emulation, among all 3 architectures we support, RISC-V is the least expressive one. Compared with AArch64 and LoongArch64, RISC-V lacks many convenient instructions, which means that we have to use more instructions to emulate the same behavior, so the translation efficiency will be lower.

Among them, two instructions are the most critical ones — the ability to pick a range of bits from one register into another; and the ability to insert some bits from one register into a range of another register.

Both LoongArch64 and AArch64 have equivalent instructions, but the RISC-V world has no counterparts for these two instructions, whether official or vendor extensions. It’s not some complex instructions that break the RISC philosophy, so it’s a shame they do not exist on RISC-V.

But why it’s so important for x86 emulation? Because the x86 ISA tends to preserve the unchanged bits.

For example, for an ADD AH, BL instruction, box64 needs to extract the lowest byte from RBX, added to the second lowest byte of RAX, and then insert it back into the second lowest byte of RAX while keeping all other bytes in RAX unchanged.

On LoongArch64, we have BSTRPICK.D to pick the bits, and BSTRINS.D to insert the bits, so the implementation would be:

BSTRPICK.D scratch1, xRAX, 15, 8
BSTRPICK.D scratch2, xRBX, 7, 0
ADD scratch1, scratch1, scratch2
BSTRINS.D xRAX, scratch1, 15, 8

Simple and intuitive, right? And it would be as simple on ARM64, with UBFX and BFI opcodes. On RISC-V, however, we have to do this:

# extract the second lowest byte of RAX
SRLI scratch1, xRAX, 8
ANDI scratch1, scratch1, 0xFF
# extract the lowest byte of RBX
ANDI scratch2, xRBX, 0xFF
# do the addition
ADD scratch1, scratch1, scratch2
# fill scratch3 with mask 0xFFFF_FFFF_FFFF_00FF
LUI	scratch3, 0xFFFF0
ADDIW   scratch3, scratch3, 0xFF
# insert it back
AND xRAX, xRAX, scratch3
ANDI scratch1, scratch1, 0xFF
SLLI scratch1, scratch1, 8
OR xRAX, xRAX, scratch1

So a whole of 10 instructions for a simple byte add and this is by no means an isolated case! There are many similar instructions in x86, and their implementation on RISC-V is more cumbersome.

The Frustration of 16-byte Atomic Instructions

x86 has LOCK prefixed instructions for lock-free atomic operations, and box64 mainly uses LR/SC sequence to emulate these. LR/SC is short for Load-Reserved / Store-Conditionally.

For example, for LOCK ADD [RAX], RCX, we generate the following code:

MARKLOCK:
LR.D scratch1, (xRAX)
ADD scratch2, scratch1, xRCX
SC.D scratch3, scratch2, (xRAX)
BNEZ scratch3, MARKLOCK

If the address in RAX is unaligned, things become a bit more complex, but in general, this works really well.

Except for the LOCK CMPXCHG16B instruction, which compares RDX:RAX with 16 bytes of memory and exchanges RCX:RBX to the memory address. While some 16-byte atomic instructions in AArch64 and LoongArch64 can be used to implement this, again, there are no counterparts in RISC-V whatsoever, unfortunately.

Therefore, we cannot implement this instruction as perfectly as other architectures, and even more unfortunately, many programs use this instruction, such as Unity games.

The End

In the end, and despite all those short-comming, The Witcher 3 actually runs, at up to 15 fps in-game and full speed on the main menu with box64! So not that bad for a machine never designed to run AAA games!

The Witcher 3 Menu with DXVK_HUD running on RiSC-V
联系我们 contact @ memedata.com