有处理器实现整数平方根指令吗?
Did any processor implement an integer square root instruction?

原始链接: https://retrocomputing.stackexchange.com/questions/29787/did-any-processor-implement-an-integer-square-root-instruction

Quake 技巧是一种曾经流行的计算平方根的方法,但由于本机指令的可用性,它不再是现代处理器上最有效的方法。 这些指令的性能明显优于 Quake 技巧,通常只需要几个时钟周期。 关于 Quake 技巧的一篇有影响力的论文的作者承认其局限性,特别是在某些硬件上的浮点计算方面。 相反,采用非均匀间隔的值表、快速索引检索和后续插值可为软件和硬件实现提供卓越的性能。 像哈雷方法这样的技术,其收敛速度比牛顿-拉夫森法更快,可能会带来更大的改进。 现代方法包括使用多项式、切比雪夫级数、Padé 近似法以及根据特定要求和约束定制的其他先进技术进行近似。 即使对于整数计算,分治之类的技术也可以通过表查找和并行处理提供显着的优势。 正在进行的研究继续进一步优化这些方法,使旧技术基本上过时。

Neon 处理器支持单指令多数据 (SIMD) 计算,允许高效地并行处理大量数据。 本文讨论了名为 FRSQRTE 的指令的复杂性,其中包括将结果减半和限制结果的步骤。 这些步骤对于有效使用定点数似乎至关重要,确保数据不会超出其范围并导致溢出。 尽管该指令看起来很复杂,但它遵循处理缩放整数的标准实践。 其主要目的是执行倒数平方根估计,通常用于数字信号处理 (DSP) 算法和计算机图形学。 尽管 FRSQRTE 很复杂,但它并不是唯一的,因为倒数平方根估计本身就是一项重要的运算。 虽然由于尺寸限制,它不太可能在单个时钟周期内执行,但它展示了现代处理器的强大功能和复杂性。
相关文章

原文

The claim in another answer of the Quake trick being the most efficient has not been true for a long time, and was only true regarding low-quality results for floats on specific hardware. On pretty much every modern chip the native instructions are much, much faster, often a few clock cycles. (I'm the Chris Lomont that wrote an early, widely cited paper on the Quake trick, providing generalizaitions and an improvement that seems to have been copied everywhere, despite it being a terrible idea now).

A much quicker method, one used in hardware (with many more tricks), is to store a (non-equal spaced) table of values, use a quick method to pull two values, linear (or better) interpolate, shift base 2 exponent with any odd excess becoming a multiple by constant sqrt2, then, if needed, one iteration of methods better than Newton–Raphson.

Things like Halley's (and many others) converge quicker than Netwon–Raphson, and are often much faster depending on what time various operations take.

Approximations for square root on a fixed interval (since all the methods for computers are bounded) are often also faster, polynomial, Cheby stuff, Pade and higher versions, all can be done in software or hardware, depending on what tradeoffs you want.

If you only want integer, say 2^32, the same trick applies, do it in fixed point, and some not too hard analysis lets you bound tables very quickly. Another simple method used in hardware for integers is divide and conquer: each say 8 bits maps to a table of 256 fixed point values, instantly looked up in parallel, then 3 multiples (2 in parallel) give the 32 bit value (after a free truncate).

There's still plenty of research being done on speeding these up (e.g., https://inria.hal.science/hal-03424131), so any technique over 10 years old is most surely obsolete for any metric: speed, power consumption, die size, etc.

联系我们 contact @ memedata.com