（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=40544716

在计算机游戏的早期，特别是在 20 世纪 90 年代末，由于处理能力有限，计算倒数平方根是计算密集型的。像《Quake 3》这样的游戏需要大量此类计算来求解复杂的光照方程和矢量变换。如今，CPU 可以在不涉及 GPU 的情况下高效地处理这些任务。尽管由于其卓越的架构而导致时钟速度较低，但现代 CPU 在单浮点运算方面的性能优于 GPU。在有关牛顿-拉夫森迭代法求平方根的帖子中发现了错误的计算。简化公式可以使用更少的计算步骤获得准确的结果。尽管早在 1985 年左右，Ardent Corporation（一家生产 CPU 的公司）的 Greg Walsh 就曾声称发明了一种快速反平方根算法，但尚不清楚这是否早于 1985 年晚些时候制定的正式 IEEE 754 浮点标准。声明以及 1985 年之前 Green Hills Software C 编译器中存在的类似快速平方根算法强调了优化数值算法以提高当时软件开发中的计算效率的重要性。

If your computer was built after 1999, it probably supports the SSE instruction set. It contains the _mm_rsqrt_ps instruction, which is faster and will give you four reciprocal square roots at once: https://www.intel.com/content/www/us/en/docs/intrinsics-guid...

That being said, the techniques discussed here are not totally irrelevant (yet). There still exists some hardware with fast instructions for float/int conversion, but lacking rsqrt, sqrt, pow, log instructions, which can all be approximated with this nice trick.

Amusingly, (to me at least) there's also an SSE instruction for non-reciprocal square roots but it's so much slower than reciprocal square root that calculating sqrt(x) as x * 1/sqrt(x) is faster assuming you can tolerate the somewhat reduced precision.

I wouldn't be surprised if _mm_rsqrt_ps is actually implemented using the same bit level trick.

Same as Carmack's, we did a single step of Newton's method and it was definitely good enough.

I dunno about Intel and AMD, but ARM and RISC-V use lookup tables for rsqrt. Unlike AMD and Intel, those tables are precisely defined in their respective specs.

I don't recall the coprocessor having either reciprocal or reciprocal square root? I didn't do much Intel until later in my career though, so I might be missing something though.

Both _mm_rcp_ps (rcpps) and _mm_rsqrt_ps (rsqrtps) are only good for about half the bits.

the vast, vast majority of computers do not support the sse instruction set or indeed any of the i386 and amd64 instruction set at all, and the fraction of computers that do support them (other than through emulation) is continually shrinking

gpu instruction sets, arm, risc-v, avr, pic, 8051, fpga... of course, in many cases, these do have a built-in approximate reciprocal square root operation, but probably implemented with this algorithm

gpus reign supreme for games and scientific, and none of them are x86; larrabee failed. and you'd be surprised at the 3-d kinematics and control systems that people do with microcontrollers

To nitpick the article a bit:

> It's important to note that this algorithm is very much of its time. Back when Quake 3 was released in 1999, computing an inverse square root was a slow, expensive process. The game had to compute hundreds or thousands of them per second in order to solve lighting equations, and other 3D vector calculations that rely on normalization. These days, on modern hardware, not only would a calculation like this not take place on the CPU, even if it did, it would be fast due to much more advanced dedicated floating point hardware.

Calculations like this definitely take place on the CPU all the time. It's a common misconception that games and other FLOP-heavy apps want to offload all floating-point operations to the GPU. In fact, it really only makes sense to offload large uniform workloads to the GPU. If you're doing one-off vector normalization--say, as part of the rotation matrix construction needed to make one object face another--then you're going to want to stay on the CPU, because the CPU is faster at that. In fact, the CPU would remain faster at single floating point operations even if you didn't take the GPU transfer time into account--the GPU typically runs at a slower clock rate and relies on parallelism to achieve its high FLOP count.

I wrote a implementation in MMIX:

          % Constants
  FISRCON GREG #5FE6EB50C7B537A9
  THREHAF GREG #3FF8000000000000
          % Save half of the original number
          OR $2,$0,0
          INCH $2,#FFF0
          % Bit level hacking
          SRU $1,$0,1
          SUBU $0,FISRCON,$1
          % First iteration
          FMUL $1,$2,$0
          FMUL $1,$1,$0
          FSUB $1,THREHAF,$1
          FMUL $0,$0,$1
          % Second iteration
          FMUL $1,$2,$0
          FMUL $1,$1,$0
          FSUB $1,THREHAF,$1
          FMUL $0,$0,$1

This implementation makes an assumption that the original number is greater than 2^-1021.

This is a good post explaining a lot of interesting concepts, with a section that has surprisingly bad algebra.

> The exact steps to go from the first form to this one are numerous to say the least, but I've included it in full for completeness.

The algebra included after that line has many unnecessary steps, as well as multiple cancelling sign errors. Most notably the second line to the third line does not distribute the negative sign correctly. Picking up after the second line, I would simply have:

y_n+1 = y_n + (1 - x * y_n^2) / y_n^2 * (y_n^3 / 2)

y_n+1 = y_n + (1 - x * y_n^2) * (y_n / 2)

y_n+1 = y_n + y_n / 2 - x * y_n^3/2

y_n+1 = 3/2 * y_n - x * y_n^3/2

y_n+1 = y_n (3/2 * y_n - x * y_n^2 / 2)

y_n+1 = y_n (1.5 * y_n - 0.5 * x * y_n * y_n)

This is fewer than 1/3 of the included steps (between the second line and the end), and has the advantage of being correct during the intermediate steps. I don't think any of my steps are anything other than self-evident to someone who understands algebra, but I'd be happy to entertain competing perspectives.

It's… probably not worth it.

I just wrote this years ago (go.mod said 1.12) for the fun of it, thought I had it in a Gist/GitHub, and uploaded it yesterday in response to this post.

One thing I remember trying was coding this up in ASM… which makes it worse, because it prevents inlining. But I learned the Go ASM syntax that way.

Time for nitpicks, sorry.

The formulas for the floats have typos. They should read (-1)^S, not -1^S (which always equals -1).

Interpreting the raw bit patterns isn't a piecewise linear approximation of the logarithm. The lines between the data points on the blue graph don't actually exist, it's not possible for a bit to be half set to 1. It's rather a discrete version of the logarithm: the only data points that exist - where the red and blue lines meet - are literally equal to the (scaled, shifted) logarithm.

Other than that, nice post!

It's not a continuous piecewise linear approximation of the logarithm, it's a discrete piecewise linear approximation of the logarithm. You are correct that the blue lines aren't continuous, but incorrect about their interpretation - there are 256 individual points evenly spaced along the that x axis that make up the blue graph, not just a handful at the intersections. (A full graph would have 2^32 options in a piecewise linear pattern, but that isn't what the OP has graphed.)

Given that the article is talking about operations on 32 bit integers and IEEE-754 32-bit floats, I personally think it's fine to leave out "discrete" in the description.

I'm not sure I understand. Imagine a very small 6-bit float, with 1 bit sign, 2 bits exponent, and 3 bits mantissa. The interval [010000, 010111] contains the numbers 2, 2.25, 2.5, 2.75, 3, 3.25, 3.5, 3.75.

But! These numbers' base two logarithms imply mantissae of

- .0000000 (equal to the float 000)

- .0010101 (not equal to the float 001)

- .0101001 (not equal to the float 010)

- .0111010 (not equal to the float 011)

- .1001010 (not equal to the float 100)

- .1011001 (not equal to the float 101)

- .1100111 (not equal to the float 110)

- .1110100 (not equal to the float 111)

because the floats in the [2,4) interval are linearly spaced, whereas the corresponding logarithms are not. In other words, the floats are a piecewise linear approximation of the logarithm – just as the article says.

You're right about the first part, of course. But don't you just need to truncate?

In any case, my point was more that it's not "piecewise linear". Piecewise linear means that the map is defined on some interval; it's not, it's defined on a discrete set of values. To take your example, the map isn't defined on e.g., 2.3. You can decide to interpolate linearly between the value at 2.25 and the value at 2.5, but that's a decision you make which isn't reflected in the code.

Or said differently: do you consider that any map defined on a discrete subset of R is really a piecewise linear map?

If your point is that it's more of an arithmetic sequence than a line segment, thanks to discretisation, then I agree but I don't have a good word for it other than "discrete piecewise linear".

I mean it's not continuous, but it's still on an interval scale.

Totally not-related to the topic, but I started reading that JAVAhurt PDF and find the typesetting god-awful. Just me? Feels like they imported some TeX package to space out words on a line far, far too far... and inconsistently. Like it was OCR'd from something else and put in extra spaces. Even the monospace sections have weird extra spacing.

Anyway, enough ranting for the day, but I found this really hard to focus on to read - almost felt like a science-kook type manifesto, though I know it is not.

Here’s something new that isn’t mentioned and might be worth adding to the repo. (Edit I’m wrong, it is there, I just didn’t recognize it.) The magic number in the famous code snippet is not the optimal constant. You can do maybe 0.5% better relative error using a different constant. Maybe at the time it was infeasible to search for the absolutely optimal number, but now it’s relatively easy. I also went down this rabbit hole at some point, so I have a Jupyter notebook for finding the optimal magic number for this (1/x^2) and also for (1/x). Anyone wanna know what the optimal magic number is?

Good point, I hadn’t read Lomont’s paper and should have. I read the section in the Wikipedia article talking about it, and did try the constant that it suggests, however it depends on doing extra Newton iterations and I looked at the relative error of the initial guess without Newton. I can see in the paper he found something within 1 bit of what I found. I’m not certain mine’s better, but my python script claims it is.

For no Newton iterations, I thought I found 0x5f37641f had the lowest relative error, and I measure it at 3.4211. Of course I’m not super certain, Lomont’s effort is way more complete than mine. Lomont’s paper mentions 0x5f37642f with a relative error of 3.42128 and Wikipedia and the paper both talk about 0x5f375a86 being best when using Newton iterations.

I was building 3D engines a few years before the days of Quake, and having recently watched videos about optimizing the trig code in Super Mario 64, and these other faster algorithms, it makes me wonder how much more can be squeezed out of old hardware.

I was optimizing my assembler to the nth degree, but optimizing the algorithm is always going to be the real winner.

Something interesting is I've seen this mythologized as evidenced of the "cracked engineer" theory, wherein random engineers will accidentally stumble upon incredibly complex discoveries in the course of their day to day work and expect no fanfare for this, besting the scientists and researchers who take the traditional route but aren't spurred by necessity. People, like xkcd, purported this was either Carmack or some random employee who figured it out in a few hours when stuck on a problem then waltzed onto the next one. In truth, as noted in the document, it has a history that goes back decades in academia, this was simply the most notable time it was implemented.

I think it speaks to an issue I see in the software engineering world where people assume that collaboration is for low-IQ people and all great innovation comes from some super-genius working on their own for long enough, and isn't required to share how they arrived at their findings. I'm sure the mythology of this algorithm is propelled somewhat by the enigmatic character of writing "what the fuck" next to adding the constant, implying a mystical element to its utility that was arrived at without needing to clarify it in some long boring research paper.

But that is what happened [1]. The article, which is probably the most definitive investigation to date, has Greg Walsh claim they invented the fast inverse square root while working at Ardent, which Wikipedia claims was founded in 1985 [2], to meet product performance benchmarks.

The Green Hills Software C compiler had a fast square root algorithm of similar form:

  (x >> 1) + {magic constant}

dating between 1983-1985 if I recall correctly. Also implemented to maximize floating point performance benchmarks, and again not drawn from academia. If my recollection is correct, that is one of the earliest known examples of the general technique and predates even the official IEEE 754 floating point specification which was not formally ratified until 1985 (but the standard was in development since 1977 and already de facto adopted by the time it was formally ratified).

[1] https://www.beyond3d.com/content/articles/15/

[2] https://en.wikipedia.org/wiki/Stardent_Inc.#Ardent_Computer_...

> where people assume that collaboration is for low-IQ people

I've also seen a phenomenon where developers believe their code is so valuable and amazing that they don't share it; it's really ordinary code but useful to novices nonetheless. Some communities I participated in when I was younger (e.g. PS3 jailbreak scene) were very protective of their code for no reason. Executables were obfuscated, nothing was open source, and developers were very hostile or intentionally trolled you when asking questions about their software.

I've met some people like this before, and I think this phenomenon has a lot to do with having one's identity closely bound to the code. Any criticism of the code is interpreted as a personal attack. Conversely, the attitude of "I'm so awesome therefore my code must be super important and really important to protect".

Yeah I've seen this a lot too.

I've also seen the reverse - perhaps more. People need advice but don't want to share their code because they are embarrassed by it. They wrote it thinking "no one will ever see this".

By contrast I sell code, so I know it'll be seen by lots of people, so I tend to spend time making sure style is consistent, things are well named, and so on. But equally, to me, it's just code. Feel free to comment on it - there's always room for improvement.

I think the xkcd comic was more about academia vs product engineering in for profit entities. If anything, it is the collaborative effort of many people working on a product with deadlines that ends up with some of them coming up with really interesting solutions just due to the number of people looking. The lack of sharing between companies does increase the incidence of the outsider effect, at the massive cost of reinventing basic knowledge over and over. The only reason breakthroughs still happen despite that is just the large number of people trying random stuff that probably won't work motivated by unrealistic deadlines. They don't know it probably won't work though...

The point seems more to me that in engineering, breakthroughs worthy of academic repute aren't given much fanfare, though probably a bit of an exaggeration. It's hard to tell to what extent his comics reflect actual sentiments or rather an artifice of some abstraction taken to an absurd level.

Note that you’re actually using 3 distinct magic numbers - not only the single 0x5f3759df, but also 0.5 and 1.5;

So for further accuracy we can instead have:

conv.i = 0x5F1FFFF9 - ( conv.i >> 1 ); conv.f = 0.703952253f ( 2.38924456f - x * conv.f * conv.f ); return conv.f;

[from Wikipedia]

Noob here. Please ELI 5 what this algo used in quake 3 ?.

> Solve lighting equations

Is it the light or laser effect when the gun is fired ?

When you render a pixel on screen (of a diffuse object), to determine how brightly it's lit, you need to compute the angle between the normal at the point on the surface and the light source. Computing this angle requires normalizing one or more vectors, which is done by dividing by the magnitude, or the square root of the sum of the squares of the components.

So the lights in this case are any lights in the scene, typically represented as a point of origin and a brightness.

The origin is really unclear. It's not certain which id employee added it to Quake 3 and where they got it from. John claims it wasn't him. Gary Tarolli, one of the founders of 3dfx, claims to remember tweaking the value of the hex constant to the value known today. But he says he didn't come up with the algorithm either. Maybe somebody else at 3dfx did, maybe he got it from somewhere else.

And at 3dfx he probably learned it from some ex-SGI guy, and that guy learned it doing his PhD at Stanford from someone who worked for ed Ed Catmull and so on. The lore goes deep with stuff like this. There's rarely just one author but many who improve the formula over time.

there's an unrecognized genius out there that realized you can just apply a bit flip to get that out there. That John Carmack hasn't claimed it as his invention when he could have, speaks volumes about his character.

Except he couldn't have even if he wanted to. There's always the risk some former 3dfx or SGI graybeard comes out with some paper binder with the original implementation, and calls you out on your bullshit if you do. When you're such a famous public figure no way you ever risk lying in public. Even if he were to getaway with it there was nothing for him to gain: he already has all the fame and money from honest work, there's no point risking it all to claim some piece of code.

He posts c code and says that it computes the inverse square root, but the code he posted is undefined c, so while it could compute the inverse square root of it's input it could also do a long list of other stuff.

One of the many reasons you should stay away from real world languages when talking about algorithms unless you are an expert in the language.

The code he posted is straight from the Quake 3 source release, a program which has certainly been installed and used on hundreds on millions of real computers. UB pedantry makes no difference to the actual behavior of the program as demonstrated.

It’s a historical artifact being presented in its original form. There’s no reason this code can’t be updated with a more modern bitwise int-float conversion. Do that if you want to use it.

Whether it is bad practice or not, people used to do this all the time, and so the compilers all tend to support type conversion via pointer and type punning via union. A lot of critical code in the real world would break if it suddenly didn’t work.

C++ bit-cast is brand new (C++20) and until now the suggestion in C++ was to use memcpy, and hope and pray it gets elided. That might be correct but boy is it ugly and gross.

If we're talking real-world languages, the code is perfectly well-defined. It's undefined behaviour in Abstract C, the C of the Specification, which to my knowledge nobody has ever implemented, but Quake 3 was not written in Abstract C. It was written in Visual C++ 2003, gcc 2.95, and some other specific C implementation.

The code snippet that the article claims is C:

  int32_t compute_magic(void) {
    double sigma = 0.0450465;
    double expression = 1.5 * pow(2.0, 23.0) * (127.0 - sigma);
    int32_t i = expression;
    return i;
  }

which, as far as I can tell, is perfectly well-defined Abstract C.

Specifically, I think the original code will always do what you expect as long as sizeof(long) == sizeof(float), and the alignment is the same. In reality no compiler is gonna do weird stuff unless your hardware target is weird.

compiler engineers have been disappointing our 'surely no compiler would ever be so perverse as to do x' expectations for 25 years now, and it doesn't seem that they're likely to stop soon

（评论） (comments)

（评论）
(comments)