（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=40250519

一些加密货币在其工作证明机制中利用了素性测试。大约八年前，有效的素性测试实施由于能够有效地开采这些硬币而带来了可观的利润。作者曾经管理过riecoin的挖矿软件，很欣赏素数。蒙哥马利乘法是快速实用模幂实现的基础，但在本文中被忽略了。 Niall Emmart 曾在学术界工作，目前据信在 Nvidia，他创建了一个异常快速的 GPU 大整数库，名为 CGBN。在作者已知的批处理 modexp 实现中，其效率仍然是无与伦比的。通过 Python 的 pow(x, y, m) 函数可以实现家庭使用的费马或米勒-拉宾素性测试。或者，GNU 多精度算术库 (GMP) 提供了方便的 mpz_probab_prime() 函数。没有明确说明在加密货币中使用独特的素性测试背后的具体原因。无论是由于缺乏对密码学主要应用的了解还是有意的选择，这仍然是一个有趣的问题。对于较大的值（如 N = 2^1024），测试从 2 到 N 平方根的所有基数似乎是可行的。运行 1007600 个碱基的测试只需大约 65 分钟。虽然与实时流程相比更长，但确保原始真实性可能会被证明是有益的，而且不会显着影响生产力。

Related, there are a few cryptocurrencies that used things related to finding large primes as part of their proof of work functions. It turns out that ~8 years ago, a really fast primality test implementation could make you a lot of money. (For some period of time I was the author and maintainer of mining software for riecoin. Why, I have no idea, except that I like prime numbers.)

This article omits the number one optimization for fast primality testing: Montgomery multiplication

https://en.m.wikipedia.org/wiki/Montgomery_modular_multiplic...

It forms the basis of fast practical modular exponentiation implementations.

Niall Emmart, then in academia and now I believe at Nvidia, released a really, spectacularly fast GPU bigint library, CGBN: https://github.com/NVlabs/CGBN

It's still the fastest batch modexp implentation that I'm aware of. It's breathtaking, if I can gush in geek for a moment.

(Someday I should write up the story of how that helped me dominate production of a small cryptocurrency for about 5 years. Thanks, Niall - owe you a beer if we ever cross paths!)

Also worth noting that python includes an entirely fine modexp in the three-argument form of pow(x, y, m) --> compute x^y % m

With that, you can very easily implement a fermat or miller-rabin primality test if you want to roll your own, which is quite fun. If you don't, the gmp library provides a very nice mpz_probab_prime() function. Gmp is obviously faster, but it's hard to beat the fun of a two liner fermat test for playing with big primes.

Do you know why said cryptocurrency would use such a bespoke proof-of-work function? I.e., was it just someone ignorant who had only a vague idea that cryptography somehow uses prime numbers and didn’t know when or why, or was there a deeper reason?

There was a really popular trend among cryptocurrencies at some point of creating a new proof of work function as a way to differentiate your coin from the sea of Bitcoin clones. Coin creators would then make unsupported claims about why their new pow was better than Bitcoin (claims such as GPU, resistance, asic resistance, or some kind of social good were common). In the case of the prime number coins (primecoin, riecoin, nexus, probably some others), it was marketing that doing the pow would, in some hand wavy way help "science". Bullshit, mostly, but...

All pow functions are just about proving that you wasted work, so there's nothing really inherently wrong with using something prime number based. The problem is that you don't really want to have mathematically and implementation intricate proof-of-work functions, because then some jerk like me comes along and is able to create a much more optimized miner, keep it private, and make money at the cost of having the other miners feel like the game is unfair. New coins in particular depend on those miners acting as marketers to shill the coin, so if they are grumpy, you lose part of your marketing team.

> This is where things start to get interesting. At first, I found the concept of probabilistic primality tests strange and tried to look for deterministic algorithms that could handle huge numbers. I did find two - APR-CL and ECPP. Both of these are so mathematically complex that I could not make sense of their research papers at all, and there isn't much accessible information about them on the internet for someone like me who is bad at math.

> After taking a look at discussions online, OpenSSL's source code and recommendations by NIST, I realized that almost everyone including RSA uses probabilistic algorithms. The catch is that if implemented properly, these algorithms have an extremely low error rate which is negligible.

For a given maximum number range, it's trivial to make Miller-Rabin actually deterministic. You just choose bases that have been proven to together exclude all pseudoprimes in the given range.

(It doesn't even end up being a long list, Miller-Rabin kicks ass)

>For a given maximum number range, it's trivial to make Miller-Rabin actually deterministic. You just choose bases that have been proven to together exclude all pseudoprimes in the given range.

What are the bases for the range of 1024-bit numbers? I couldn't find an answer online.

Just testing all the bases from 2 through 2 log(N)^2 doesn't actually seem to be too bad for N around 2^1024. It would be a little under 1007600 bases.

I've got a Miller-Rabin implementation in Python 3. Testing 2^1024 + 643, which is the first prime greater than 2^1024 with 65 random bases takes about 0.25 seconds on my Mac Studio. This is with no attempts at optimization or at using multiple cores.

At that rate testing 1007600 bases would take under 65 minutes.

Certainly way longer than you'd want to wait around for something like key generation, but if you didn't actually need your prime for an hour and really wanted to be certain it was a prime its not outrageous.

EDIT: I modified my Miller-Rabin to be the deterministic version, and made it take two parameters: cores and worker. I modified the loop that runs through the bases from 2 to floor(2 log(N)^2) to only actually test when "cores == 1 or base % cores == worker". I made my test program take cores and worker on the command line.

Then I ran this shell script, where prime.py is a program that checks 2^1024 + 643 for primality with the aforementioned deterministic Rabin-Miller:

  ./prime.py 8 0 &
  ./prime.py 8 1 &
  ./prime.py 8 2 &
  ./prime.py 8 3 &
  ./prime.py 8 4 &
  ./prime.py 8 5 &
  ./prime.py 8 6 &
  ./prime.py 8 7 &
  wait

Each of those 8 copies is testing a different 1/8th of the bases that need to be tested. My computer has 8 performance cores and the hope was that MacOS would run each on a different core. That did seem to be the case.

It took 392 seconds for them all to complete, with each correctly reporting that 2^1024 + 643 has passed for all their bases.

> Just testing all the bases from 2 through 2 log(N)^2 doesn't actually seem to be too bad for N around 2^1024. It would be a little under 1007600 bases.

Unfortunately this limit depends on the currently unproven conjecture (a subset of the generalized Riemann hypothesis). You have to check all n bases to be unconditionally correct.

Going up to n is surely excessive. You need a version of the Riemann hypothesis to get any sort of logarithmic bound, but a sqrt(n) bound can be achieved using unconditional Polya-Vinogradov type inequalities. (A successful Miller-Rabin test corresponds to a nontrivial value of a Dirichlet character, and Polya-Vinogradov ensures that the least such value cannot exceed sqrt(n) * ln(n), unconditionally without the need for any Riemann hypothesis.)

A reference for this material is "Explicit bounds for primality testing and related problems" by Eric Bach, Math. Comp. 55 (191), July 1990, pp. 355-380.

Oh, woops. I thought the list went way higher, my bad.

I did read that I think if you do up to 2*ln(n)^2 you're always good, but that's not _that_ short of a list. Probably better off just keeping it probabalistic, negating my original post :(

One line of inline assembler makes the bignum school multiplication trivial: https://github.com/jcalvinowens/toy-rsa/blob/master/bfi.c#L4...

If I could go back in time and change one thing about the C language, I would add some notion of expanding multiplication. It's a shame Rust doesn't have it either. Hardware support is everywhere: hell, the Cortex M0 doesn't do division, but it has an expanding multiply!

This is from a very ugly little toy implementation of RSA I wrote a long time ago: https://github.com/jcalvinowens/toy-rsa

I found I could get away with the Fermat test, because the algorithm doesn't work if the primes aren't actually prime: the Fermat test is fast, and an encrypt/decrypt eliminates the extremely minuscule chance either prime is a fermat liar.

But I don't know if it can be proven there do not exist non-prime P/Q values which produce an RSA keypair which can successfully encrypt/decrypt messages. I'm sure this isn't kosher for a real implementation, but I've never found an answer.

Curiously, C actually has bignums. Now. In C23, they added a _BitInt(N) type (e.g., "_BitInt(1024)" for a 128-byte type).

The compiler support for that is limited, though. To let N be >128 in Clang, -fexperimental-max-bitint-width=N flag can be provided. If N>128 and _BitInt(N) is divided by something, the compiler will just crash, but +, -, * all work as expected.

> But I don't know if it can be proven there do not exist non-prime P/Q values which produce an RSA keypair which can successfully encrypt/decrypt messages. I'm sure this isn't kosher for a real implementation, but I've never found an answer.

If p and q are coprime Carmichael numbers then RSA will still successfully encrypt and decrypt messages, though this will be less secure as p*q will have smaller prime factors and thus be easier to factor.

The original Pretty Good Privacy (PGP) by Philip Zimmermann in 1994 used only a sieve that divided by all known 16-bit primes (this table was produced using the Sieve of Eratosthenes), followed by the Fermat test.

Even so, if you wrap it in a function and make sure to document what's happening, I would argue that a version without asm() is preferable. It gives the compiler more leeway for optimization and is easier to read for someone who isn't well-versed in GCC's weird asm syntax.

See the generated code for these two alternative implementations: https://godbolt.org/z/3vno6G46j

Yes I also noticed the bug in the constraints and came to the same conclusion that changing them yields the same code for both implementations. But I figured that the fact that a function with a single asm instruction has a bug kind of supports my point. :)

As for the codegen on ARM, I don't think your asm implementation is correct there either. As far as I can tell the UMULL instruction only cares about the least significant 32 bits in the source registers, regardless if you're on a 64-bit CPU: https://developer.arm.com/documentation/ddi0602/2024-03/Base...

On 64-bit hosts using the full widening multiplier requires that you use 128 bit integers-- which aren't portable. :( (in particular, MSVC doesn't have it last I checked). It's the obvious thing to do where they exist however.

> If I could go back in time and change one thing about the C language, I would add some notion of expanding multiplication.

Zig makes this relatively easy: it has a @mulWithOverflow intrinsic, which returns an overflow bit with the result, and it has integers up to (u|i)65535. So depending on what you're doing, you can either detect overflow and then upcast, or upcast first and then optionally truncate.

It also has saturating multiply as a separate operator *|, or wrapping with *%, for when those are the semantics you want. Otherwise overflow is safety-checked undefined behavior, which will panic in Debug and ReleaseSafe build modes.

At least on x86, there's no real point in detecting truncation before upcasting, since expanding multiplication yields both the high and low halves in a single instruction. At best, you'd be adding branches that the processor can insert just as well itself. (Unless we're talking about things like 128x128-bit multiplication, which would be a different story; it would likely depend on your expected input distribution.)

As for returning an overflow bit, Rust has had this since forever with its overflowing_OP() methods, and C23 has recently added an header with a bunch of ckd_OP() macros that return an overflow bit.

I want an explicit operation that gives the upper and lower half of the result separately. This is ugly, but like:

  hi, lo = val1 *** val2;

It isn't necessarily intrinsically supported on all platforms: int128 is not part of the original standard. Sometimes long long is the same width as long.

I think the concept generalizes to non-binary and non-twos-complement CPUs easily enough.

How long did this take you? Curious as I did an undergrad research project on multiplying large integers that basically took two semesters. I implemented Karatsuba, Toom-Cook, the complex FFT, a few NTTs, and Schonhage-Strassen. Primes are basically math magic. A Friendly Introduction to Number Theory by Silverman is an amazing math book for those interested.

FYI the link on your page reads 4025051 instead of 40250519.

Nice article! I've also rolled some of my own bigint code recently (for earlier versions of [0]), and I can recall how frustrating it was to translate high-level descriptions in math papers into actual operations.

I do have a small quibble, though:

> At this point, the BigInt is using base-(2^64-1) or base-18446744073709551615 and it only needs 16 "digits" to represent a number that uses 309 digits in base-10!

If you use the full range of a u64, then your number will be in base 2^64, with each word ranging from 0 to 2^64-1, in the same way that base-10 digits range from 0 to 9.

[0] https://github.com/LegionMammal978/bigfoot-sim

note that your last optimization of increasing the number by 2 if it fails rather than generating a new random number actually breaks the security slightly. Primes aren't evenly distributed, so doing this biases you towards primes that are directly after large prime gaps.

Yeah i read about this in my research. It is a tradeoff between execution speed vs randomness of primes, i choose to go with speed assuming that 16 threads all starting from a random number and competing to find the prime would add enough randomness. If someone preferred more randomness in place of speed it's an easy change to replace the +=2 with a rng() call.

IMO you really shouldn't be bottle-necked by random number generation speed. If the kernel calls are what's slowing you down, you could use a strong user space PRNG (e.g. aes based) that is seeded with urandom. The other approach that would be not perfect but a lot better would be to re-randomize the lower 64 or 128 bits each time. That would cut down your rng requirements by a factor of ~10 while keeping much better uniformity since your jumps would be random and much bigger than the gap between primes.

Another option is to do N sequential candidate tests for each fresh random candidate. Where N is something that gets you most of the speed advantage of doing all sequential tests. Maybe like N=16.

Nice write-up!

I don’t think the parallelism really helps with this; you’ll still never find the second of two twin primes for instance.

Are there any easy ways to mitigate this though? What about say adding 2^64 at each step instead of 2?

That's an interesting idea. I don't know enough to say if using a large offset would be enough to mitigate the downsides of +=2, i would have to read more about it. In actual use you can go directly to using random candidates for each test as a few extra seconds to generate 2 primes would be worth the benefits. In hindsight my simplistic rng() can also probably be better optimized (batch load random bits and cache) to make it faster as the slowdown primarily comes from repeated filesystem access.

I guess so long as your increment is constant there are primes you’re never going to find. If a prime happens to be 2^64 from another prime you’ll still never see it with my approach.

Probably the most reasonable thing to do would be to use a Mersenne twister or other PRNG to generate a stream of random bytes. Would be plenty fast and should hopefully have no relevant pattern. No reason you should need real randomness from the OS after the initial seed.

Nice article, and nicely written.

> My code from attempt #3 was effectively using base-255, with each byte acting as a single "digit".

I think the author means base-256, not base-255.

This sends me back.

My first-year college project, a few decades ago, was to implement 4096-bit RSA encryption, the idea of my project-mate and friend (and later valedectorian) who implemented the core math.

I recall how slow prime number generation was in our final implementation, taking ~20 minutes to generate on PA-RISC workstations. My friend, a math nerd, took it upon themselves to continue optimizing the code long after the project was over. I remember them reading papers on prime number detection, and bignum math implementations. For example, one huge improvement came when the code would detect if a number in a component multiplication was 0, then skip the multiplication and give a 0 result!

> The random number returned is OR-ed with 0b1000000000000001 to set its first and last bit to 1. The last bit set to 1 makes it an odd number and the first bit set to 1 ensures that it is a sufficiently large number which covers the entire range of bits I need.

I can understand setting the low bit to 1 since an even number will never be a prime (edit: obviously except 2). But why set the high bit to 1 as well? Admittedly I don't know much about prime numbers or crypto, but it seems to me like this is just giving up a bit of entropy unnecessarily. What am I missing here?

Another factor is if your high bit is always set, and you encode the prime with that bit, your prime always takes the same number of byte to encode.

Variable byte encoding can lead to problems, if you need to exchange the data between different software, unless the specifications are very clear, and well tested. (See problems with RSA based DHE if the server public key has leading zeros)

As the other comments have mentioned, by setting the first bit to one it looses a bit of entropy but ensures that the prime is large enough. Another thing to add is that in RSA two primes are multiplied together. If one of them is 1024 bits the other can be ~200 bits (if i remember correctly) and still reach the required number of entropy bits for the key. So, having both primes be 1024-bit adds a bit of wiggle room too.

You are giving up a bit of entropy, yeah, but you still have 1022, it's probably safer than wondering if a 1020 bit prime is fine even if they asked for a 1024 bit one. Eg we usually don't consider 00042 a 5-digit number.

Technically probably depends on exactly what you're using it for which choice is optimal, but I'd think the one in the article is the safer default.

For the purposes of key generation, however, wouldn't you want the full n bits of entropy? Otherwise the search space for a brute force factorization (haha right) is 2^(n-1) instead of 2^n, or half as many possibilities. The domain of the product is still [0..2^(2n)] so the resulting key is the desired 2^(2n) bits.

I guess another way to pose my question would be: is there an issue with sampling the entire 2^n space that makes us only take the highest 2^(n-1) subset of integers instead when selecting factors for a key?

Out of curiosity, if it is known that the nth bit is set, don't I also have the same risk but in (n-1) bits? Genuinely curious here.

Edit: Ah, nevermind, I see now why I don't have that issue. It's because I can't easily iterate the primes in that domain even though I can iterate the reduced number of bits. Thanks!

> My binary implementation was probably spending almost all of its time waiting to read or write numbers from RAM due to L1 cache misses. I did not know at the time how to actually test this, but I thought let's try a more memory efficient version anyway and see if it improves things.

The text never comes back to this, but the answer is that a handful of 1-2KB numbers fit into L1 just fine, and even if they didn't there's a megabyte or more of L2 that takes about 3ns to access.

> It goes without saying that probably none of this is actually cryptographically secure, but that was never the point anyway.

Hmm. This is just the prime generation, so it avoids most RSA pitfalls, and urandom had better be secure. So if this code works at all then there's not much that could go wrong. RSA has a few issues with weak primes to avoid but I don't know if they're likely enough to matter here.

In the section on Fermat's little theorem there is a small mistake in the first statement of the theorem. You've written: "If p is prime and a is any integer not divisible by p, then the number a^(p-1) is divisible by p" this should read "a^(p-1) - 1 is divisible by p". The second statement of the theorem in terms of modular arithmetic is correct though.

Really nice writeup. For the test, could Fermat's method be used to find candidates and then trial division to weed out the false positives? Given they are rare this should not cost much extra time as the trial division (with some precomputed divisiors) is rarely used.

That was an excellent article. Particularly enjoyed the iterative improvements throughout. It clearly shows the significant performance boost that can be achieved through good engineering and optimization.

One quick thing to add to this: /dev/urandom does not generate "true" random numbers. TRNGs generate 1 bit out per bit of entropy they collect from the environment, while /dev/urandom will not stop generating random bits when it runs out of entropy. That makes it a CSPRNG that is seeded by a TRNG.

For all practical purposes, a CSPRNG seeded by a TRNG is almost as good as a TRNG, but it isn't quite the same.

Linux used to recommend /dev/random which actually was a TRNG (although its entropy collection would sometimes overestimate how much entropy it got, particularly on servers), but it wasn't practical to use as your primary cryptographic RNG because it was very slow.

A HWRNG is not necessarily a TRNG. As you say, a TRNG has one bit of entropy per bit of output. There's no way to prove this property is even possible in this physical universe, since it requires perfect unpredictability. Urandom is a CSPRNG seeded by a HWRNG.

This is true. The Intel HWRNG has actually been thought to be suspect in this regard in the past, although I don't think there's actual data about that.

Urandom also takes entropy from things like mouse movements, inter-onset intervals of key presses, and (on servers) hard drive seek times, so it actually does take in some of its own entropy in addition to that provided by the CPU.

I'm glad you know it, but elsewhere on this site I was just talking to someone who proudly knows very little about programming.

The distinction here is actually very important to a few people (eg the people who run LavaRand at CloudFlare), although it doesn't matter to many.

I love that the code panics when trying to divide by zero. Reminds me I should catch stuff even if it's for my own sake, I generally opt to completely overreact.

It is custom html/css built with my own simple static site generator. The entropy styling is css animations delayed for each letter such that it matches up. Glad that you liked it!

Unrelated to crypto: Trying to select the word "entropy" (with the cool colour & wave effect) grinds Chrome to halt on Android.

If you try to generate a PGP key using GPG on a fresh Linux system, it may flatout refuse to do so claiming there is not sufficient entropy. Or at least it was like that some years ago.

I would recommend keeping one handle open to /dev/urandom rather than opening and closing a new one every single time you generate a new random number.

（评论） (comments)

（评论）
(comments)