马克的神奇乘法
Mark's Magic Multiply

原始链接: https://wren.wtf/shower-thoughts/marks-magic-multiply/

## 马克的魔法乘法:在嵌入式系统上优化浮点数 本文详细介绍了在资源受限的嵌入式处理器上加速单精度浮点数乘法的工作。作者探讨了通过自定义RISC-V扩展(Xh3sfx)来实现浮点运算,而不是依赖硬件FPU或软件模拟,旨在获得“固件浮点”解决方案。 最初的实现使用了标准的解包-乘法-归一化-打包方法,在Hazard3内核上使用专用乘法器实现了16个周期内的乘法。然而,作者受到Mark Owen的一个巧妙技巧的启发,显著提高了性能。 Owen的方法专为Armv6-M架构设计,仅使用*两个* 32x32位乘法就实现了正确舍入的乘法,巧妙地利用了误差界限并最大限度地减少了位操作。该方法被适配到RISC-V,从而获得了一个30周期的实现——一项显著的改进。 核心思想是计算一个近似的乘积,然后根据可预测的误差范围对其进行校正。该技术强调,在短流水线中,乘法本身并不是瓶颈,而是高效的位操作和数据路由是优化的关键。作者建议将该技术推广到双精度乘法。

Hacker News新 | 过去 | 评论 | 提问 | 展示 | 工作 | 提交登录Mark's Magic Multiply (wren.wtf)33 分,luu 发表于 7 小时前 | 隐藏 | 过去 | 收藏 | 2 条评论帮助 mysterydip 发表于 23 分钟前 | 下一个 [–] 我欣赏文章开头的这段警告:这篇文章包含浮点数。浮点数已知会导致加州哺乳动物两足动物产生困惑和恐惧反应。我明智地按了后退键 :)回复awjlogan 发表于 3 小时前 | 上一个 [–] 链接到 Mark Owen 的优秀 QFP 库,用于 Cortex-M0+ (ARMv6-M) 和 Cortex-M3/M4 (ARMv7-M) 的软浮点运算。https://www.quinapalus.com/qfplib.html写得很好,我也喜欢“硬浮点”这个想法。回复 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文
Mark’s Magic Multiply

This post is about a topic very near and dear to my heart. That’s right: single-precision floating-point multiplication on embedded processors. I’ll start with some background on why I’ve been so invested in this topic recently, walk through the implementations I’ve come up with on my own, and end by dissecting an absolutely ridiculous trick by Mark Owen for floating-point multiplication on 32-bit embedded cores, which was the original inspiration for this post.

⚠️ This post contains floating point. Floating point is known to the State of California to cause confusion and a fear response in mammalian bipeds. The standard recommendation is What Every Computer Scientist Should Know About Floating-Point Arithmetic. The actual IEEE 754-2008 standard is also uncharacteristically concise and readable, provided you ignore the fan fiction about radix != 2. For a more tactile experience try poking ones and zeroes into IEEE 754 Calculator (start with binary16).

Lately I’ve been working on a custom RISC‑V extension called Xh3sfx for accelerating soft floating-point routines. This is a halfway house between having an FPU and not having an FPU, which I feel is an under-explored space. You could call it firm floating point.

When you compile a C program using float variables for a target that lacks floating-point hardware support, the compiler inserts calls to a runtime library like libgcc or compiler-rt to perform the requested operations. This is sometimes called floating point emulation because it fills the role of a hardware FPU, but really it’s just one approach to implementing the floating-point operations specified in IEEE 754.

Although Xh3sfx is a custom extension, I’m not signing up to maintain and distribute a forked compiler. It’s easier to just replace the compiler runtime routines with accelerated versions. The new routines use a handful of specialised ALU operations to handle the gritty and ugly parts of floating-point formats, mixed in with regular integer instructions for the actual computation. The runtime libraries have a mostly documented and stable API surface. Adding support to your program just requires linking the acceleration library or adding its source files to your build, which is a reasonable approach for embedded firmware.

For a nominal fee of a few hundred gates, Xh3sfx gives you single-precision addition in 14 cycles and multiplication in 16 cycles, ignoring function call overhead. (It can do other stuff too, these are just examples.) Qualitatively this turns floating point from “oh god why is this so slow” to something that Just Works™ in general applications code and light audio DSP. I originally posted about it on Mastodon here. You can read about the instructions here and see some library routines here.

The default single-precision multiply implementation in the Xh3sfx library has the following steps:

  • Unpack the exponents and significands from the two floating-point inputs.
  • Calculate the exact significand product with a 32 × 32 → 64-bit multiply (mul; mulh).
  • Squash the product back into a 32-bit register in a way that preserves its rounding direction.
  • Normalise the product.
  • Pack the product with the sum of the original exponents to yield the final floating-point result.

Concretely it looks like this:

Xh3bextm for h3.bextmi and Xh3sfx for the rest. ALU ops and non-taken forward branches are one cycle each, so this comes out to 16 cycles (ignoring the function call overhead).

This implementation is optimal assuming mul and mulh are equally fast, and assuming you require correctly-rounded results for all inputs. It’s possible to save two cycles if you can tolerate ~0.5 ulp of error; this is left as an exercise for the reader (I always wanted to say that).

Hazard3 has three hardware configurations for multiplication:

  1. Minimal: All divide and multiply instructions execute on a sequential multiply/divide circuit, usually configured for either 1 or 2 bits per cycle, plus a couple of extra cycles for sign correction.
  2. Intermediate: 32 × 32 → 32-bit mul executes on a dedicated fast multiplier; the remaining multiplies mulh/mulhu/mulhsu still execute on the sequential multiply/divide circuit.
  3. Full: All multiplies execute on a dedicated fast 32 × 32 → 64-bit multiplier. Divides still execute on the sequential circuit.

These options are arranged in ascending order of both performance and area cost. The full 32 × 32 → 64-bit multiply is the option that was taped out on RP2350. For a minimal implementation with just the sequential multiply/divide, the mul; mulh implementation discussed earlier is again optimal. Option 2 is still interesting because it’s a well-balanced design: mul is overwhelmingly the most commonly executed instruction in the M extension.

My first attempt at optimising for this configuration was to break the 32 × 32 → 64-bit multiplication into four 16 × 16 → 32-bit multiplications, which each execute in a single cycle using the mul instruction.

Karatsuba multiplication is a neat identity that reduces the number of partial product multiplies from four to three. It’s kind of like Strassen’s algorithm but for scalars. Unfortunately it brings some setup and teardown costs, as well as a requirement to handle 33-bit intermediates. I haven’t done the working but the seat of my pants says it’s going to be slower. It’s probably more useful at higher ratios of product size to multiplier size, so you get some compounded savings from the recursion.

I was happy with my implementation until Mark Owen emailed me out of the blue with this trick (Armv6‑M):

 lsls r0,#8     @ x mantissa
 lsls r1,#8       @ y mantissa
 lsrs r0,#9
 lsrs r1,#9

 adds r2,r0,r1    @ for later
 mov r12,r2
 lsrs r2,r0,#7    @ x[22..7] Q16
 lsrs r3,r1,#7    @ y[22..7] Q16
 muls r2,r2,r3    @ result [45..14] Q32: never an overestimate and
                  @ worst case error is 2*(2^7-1)*(2^23-2^7)+(2^7-1)^2
                  @ = 2130690049 < 2^31
 muls r0,r0,r1    @ result [31..0] Q46
 lsrs r2,#18      @ result [45..32] Q14
 bcc 1f
 cmp r0,#0
 bmi 1f
 adds r2,#1       @ fix error in r2
1:
 lsls r3,r0,#9    @ bits off bottom of result
 lsrs r0,#23      @ Q23
 lsls r2,#9
 adds r0,r2       @ cut'n'shut
 add r0,r12       @ implied 1*(x+y) to compensate for no insertion of
                  @ implied 1s
@ result-1 in r3:r0 Q23+32, i.e., in range [0,3)

Mark is the author of the RP2040 ROM float library, and his emails are always hard-wrapped to 80 characters. This function returns correctly-rounded single-precision multiplies with just two 32 × 32 → 32-bit partial products. It also does it with a lot less bit twiddling and general waffling about than my schoolbook multiplication, even with a much more limited instruction set.

The core trick is to compute a 23 × 23 → 46-bit product using two multiplies. This trick doesn’t work for 24-bit inputs, so he leaves out the implicit one, multiplies just the 23-bit fractional parts, then compensates later for the missing one. Starting to work through:

 muls r0,r0,r1    @ result [31..0] Q46

The lower multiply directly gives us the lower 32 bits of the result. This part is exact.

 lsrs r2,r0,#7    @ x[22..7] Q16
 lsrs r3,r1,#7    @ y[22..7] Q16
 muls r2,r2,r3    @ result [45..14] Q32: never an overestimate and worst case error is 2*(2^7-1)*(2^23-2^7)+(2^7-1)^2 = 2130690049 < 2^31

The upper multiply gives bits 45:14 of a number that is close to our result. The precise result for these bits would be (x * y) >> 14 (without intermediate truncation); we’ve instead calculated (x >> 7) * (y >> 7), which is close but misses the contributions from the 7 LSBs on each side.

If we could just glue bits 45:32 from the high multiply onto bits 31:0 from the low multiply, we would have our full 46-bit result. Unfortunately it’s just an approximate result at this point (let’s call it approx[45:14]). Mark calculates the error e is bounded −231 < e ≤ 0 (with the lower bound rounded down to a power of two). This bound eliminates all but two possibilities:

  • Result bits 45:31 are already correct.
  • Result bits 43:31 are wrong, but are corrected by adding 231.

This correction doesn’t fix result bits 30:14, but that’s fine: we have those already from the low multiply. The correction logic is here:

 lsrs r2,#18      @ result [45..32] Q14
 bcc 1f
 cmp r0,#0
 bmi 1f
 adds r2,#1       @ fix error in r2
1:

In English, the above code says: “Increment approx[32] if approx[31] is set, AND exact[31] is clear. Discard approx[31:14].” This is equivalent to the carry-out from adding approx[31] ^ exact[31] into approx[31], which is exactly the necessary correction. The 14 LSBs of r2 now contain bits 45:32 of the exact result, so we have our full 46-bit product with two multiplies. The rest of the code shifts and concatenates these bits in a way that is convenient for rounding later, and partially compensates for the missing implicit 1.

When I asked Mark about it he proferred this explanation, which I think is elegant but not quite the right shape for my brain:

So the error bound you calculated means that, if r[31] differs between the high and low mul, then the correct fix is to increment the high mul at r[31]. Your code applies an increment to r[32] only in the case where r[31] is set in the high mul and clear in the low mul, which would yield a carry into r[32]. Your code ignores the opposite case of r[31] being clear in the high mul and set in the low mul, because this yields no carry into r[32]. Am I on the right track?

Yes. The way I think of it is that the bottom 32 bits are guaranteed all correct. The worst case error in taking the top 16 bits of the product of the top halves is 2, so if you could arrange to overlap those two results by two bits instead then you are guaranteed at most one carry into the top part and you can fix everything up.

It comes out quite different on RISC-V, but it turns out this trick is both usable and profitable.

wren.wtf

联系我们 contact @ memedata.com