Zmij:更快的浮点数双精度到字符串转换
Zmij: Faster floating point double-to-string conversion

原始链接: https://vitaut.net/posts/2025/faster-dtoa/

一种新的浮点数到十进制转换方法“zmij”已被开发,在性能上显著优于现有算法。它借鉴了Dragon4、Grisu和Schubfach等先前实现中的经验教训,优先去除不必要的操作——秉承“不做任何工作比做一些工作更好”的原则。 性能基准测试表明,“zmij”比Dragonbox(一种先前领先且具有正确性证明的算法)快约68%,比Schubfach快两倍,比`std::to_chars`快3.5倍,比Google的double-conversion快6.8倍,甚至比`sprintf`快59倍。现在,在Apple M1硬件上转换单个double大约需要10-20纳秒。 关键改进包括减少候选选择、减少整数乘法、更快的对数近似、减少条件分支以及更高效的输出处理。对数使用32位近似以及无分支处理不规则舍入间隔等优化措施有助于提高速度。代码可在GitHub上找到([https://github.com/vitaut/zmij](https://github.com/vitaut/zmij)),并计划集成到{fmt}以及可能通过ISO C++ P2587集成到std::to_string中。目前,它仅支持指数格式,固定格式支持计划中。

一篇新的博文详细介绍了一种将双精度浮点数转换为字符串的显著更快的算法。作者“vitaut”实现了高达六倍的速度提升,甚至超越了 Google 的 Grisu 库的性能(由 Grisu 的原始作者 floitsch 创建)。 讨论重点在于改进“dtoa”(十进制到字符串)算法,而不是“atod”(字符串到十进制)转换,有很多 dtoa 选项,但 atod 替代方案很少。 几位评论者分享了相关的项目,包括 Teju Jaguá Rust crate,并指出优化最终字符串格式化阶段,而不仅仅是核心数值转换非常重要。 一位评论员指出核心算法之前已经过优化,剩下的瓶颈在于条件语句。 总而言之,这篇帖子引发了数值库和转换开发者之间的讨论。
相关文章

原文

There comes a time in every software engineer’s life when they come up with a new binary-to-decimal floating-point conversion method. I guess my time has come. I just wrote one, mostly over a weekend: https://github.com/vitaut/zmij.

It incorporates lessons learned from implementing Dragon4, Grisu and Schubfach along with a few new ideas from myself and others. The main guiding principle is Alexandrescu’s “no work is less work than some work” so a number of improvements come from removing things from Schubfach (conditional branches, computations and even candidate numbers).

Performance

Here is how it performs on dtoa-benchmark:

  • ~68% faster than Dragonbox, the previous leader (at least among algorithms with correctness proofs)
  • ~2 times faster than my Schubfach implementation that closely follows the original paper
  • ~3.5 times faster than std::to_chars from libc++ (Ryu?)
  • ~6.8 times faster than Google’s double-conversion (Grisu3)
  • ~59 times (not percent!) faster than sprintf on macOS (Dragon4?)

Converting a single double takes about 10 to 20 ns on Apple M1.

What are the improvements?

Here is a list of improvements compared to Schubfach:

  • Selection from 1-3 candidates instead of 2-4
  • Fewer integer multiplications in the shorter case
  • Faster logarithm approximations
  • Faster division and modulo
  • Fewer conditional branches
  • More efficient significand and exponent output

Let’s take a look at some of them.

The first small improvement is having a single branch to quickly check for special cases: NaN, infinity, zero or subnormals. There are still additional checks within that path but the common case is more streamlined.

Another improvement is using faster fixed-point logarithm approximations. Schubfach does the following:

// log10_2_sig = round(log10(2) * 2**log10_2_exp)
constexpr int64_t log10_2_sig = 661'971'961'083;
constexpr int log10_2_exp = 41;

// Computes floor(log10(pow(2, e))) for e <= 5456721.
auto floor_log10_pow2(int e) noexcept -> int {
  return e * log10_2_sig >> log10_2_exp;
}

which uses 64-bit multiplication:

floor_log10_pow2(int):
        movsxd  rcx, edi
        movabs  rax, 661971961083
        imul    rax, rcx
        sar     rax, 41
        ret

However, for the range of inputs (exponents) we could use 32-bit approximations:

// log10_2_sig = round(log10(2) * 2**log10_2_exp)
constexpr int log10_2_sig = 315'653;
constexpr int log10_2_exp = 20;

auto floor_log10_pow2(int e) noexcept -> int {
  return e * log10_2_sig >> log10_2_exp;
}

resulting in

floor_log10_pow2(int):
        imul    eax, edi, 315653
        sar     eax, 20
        ret

Dragonbox also uses 32-bit approximations with slightly different constants.

Similarly, we can replace some integer divisions with integer multiplications. Compilers already know how to do this, but we can do better when we know that the range of inputs is small:

// Returns {value / 100, value % 100} correct for values of up to 4 digits.
inline auto divmod100(uint32_t value) noexcept -> divmod_result {
  assert(value < 10'000);
  constexpr int exp = 19;  // 19 is faster or equal to 12 even for 3 digits.
  constexpr int sig = (1 << exp) / 100 + 1;
  uint32_t div = (value * sig) >> exp;  // value / 100
  return {div, value - div * 100};
}

Another optimization and simplification is branchless handling of irregular rounding intervals. I wrote about rounding intervals in my earlier blog post, but for the purposes of this post it is sufficient to know that a rounding interval for a floating-point number is an interval that contains all real numbers that round back to that number. Normally the intervals are symmetric, except when there is a jump in the exponent (the irregular case):

Most algorithms handle irregular intervals via a completely separate path or at least some branching. This is not terrible, because irregular cases are rare for random floating-point numbers. However, it is possible to handle it cheaply and branchlessly, avoiding extra complexity, which is what I did.

A more interesting improvement comes from a talk by Cassio Neri Fast Conversion From Floating Point Numbers. In Schubfach, we look at four candidate numbers. The first two, of which at most one is in the rounding interval, correspond to a larger decimal exponent. The other two, of which at least one is in the rounding interval, correspond to the smaller exponent. Cassio’s insight is that we can directly construct a single candidate from the upper bound in the first case.

This improvement has a nice effect: it allows us to avoid scaling the value itself by a power of 10, because we only need the lower and upper bounds. This saves two 64-bit integer multiplications in the shorter case.

Unfortunately, this does not help in the longer case, but there are improvements to be made there as well. Classic Schubfach first checks whether there is only one candidate from the second set in the rounding interval and returns early in that case. We can combine this check with the closedness check. This seems counterintuitive, because we do more work (sorry, Andrei), but it eliminates a poorly predicted conditional branch and also simplifies the code.

So we go from this:

uint64_t dec_sig_over = dec_sig_under + 1;
bool under_in = lower + bin_sig_lsb <= (dec_sig_under << 2);
bool over_in = (dec_sig_over << 2) + bin_sig_lsb <= upper;
if (under_in != over_in) {
  // Only one of dec_sig_under or dec_sig_over are in the rounding interval.
  return write(buffer, under_in ? dec_sig_under : dec_sig_over, dec_exp);
}

// Both dec_sig_under and dec_sig_over are in the interval - pick the closest.
int cmp = scaled_sig - ((dec_sig_under + dec_sig_over) << 1);
bool under_closer = cmp < 0 || cmp == 0 && (dec_sig_under & 1) == 0;
return write(buffer, under_closer ? dec_sig_under : dec_sig_over, dec_exp);

to this:

// Pick the closest of dec_sig_under and dec_sig_over and check if it's in
// the rounding interval.
int64_t cmp = int64_t(scaled_sig - ((dec_sig_under + dec_sig_over) << 1));
bool under_closer = cmp < 0 || (cmp == 0 && (dec_sig_under & 1) == 0);
bool under_in = (dec_sig_under << 2) >= lower;
write(buffer, (under_closer & under_in) ? dec_sig_under : dec_sig_over,
      dec_exp);

There are also many improvements in significand and exponent output. The simplest one, which has been used for many years in {fmt} and which I learned from Alexandrescu’s talk “Three Optimization Tips for C++”, is using a lookup table to output pairs of decimal digits. This alone halves the number of integer multiplications and is particularly important here, because the significand is often 16–17 digits long.

Another trick is branchless removal of trailing zeros using another small lookup table, which I believe comes from the excellent Drachennest project by Alexander Bolz. There are ideas for improving this further and potentially getting rid of the lookup table entirely.

Is this a new algorithm?

Does it deserve to be called a new algorithm, or is it just an optimization of Schubfach?

I am not sure, but at the very least it deserves a separate GitHub project =).

Where this can be used

This method, or at least some elements of it, will be used in {fmt}, and it is also a good candidate for JSON serialization in Thrift and elsewhere. If you have other applications that could benefit from faster floating-point formatting, feel free to check it out now, or wait until it is integrated into {fmt}.

Thanks to my ISO C++ paper P2587 “to_string or not to_string”, std::to_string will also be able to use this or a similar method. This will make this standard API both performant and actually useful.

Current limitations

Despite the name, the implementation is not fully polished yet. In particular, it currently supports only exponential, also known as scientific, format, although adding fixed format should be straightforward.

“Fun” fact

My former colleague David Gay wrote an early dtoa implementation back at Bell Labs, and it was widely used for many years.


Last modified on 2025-12-13

联系我们 contact @ memedata.com