亲爱的，我缩小了 {fmt}：将二进制大小提高到 14k 并放弃 C++ 运行时

亲爱的，我缩小了 {fmt}：将二进制大小提高到 14k 并放弃 C++ 运行时
Honey, I shrunk {fmt}: bringing binary size to 14k and ditching the C++ runtime

原始链接: https://vitaut.net/posts/2024/binary-size/

Fmt 库用 C++ 编写，是一个轻量级格式化库，具有紧凑的二进制占用空间。与 IOStreams、Boost.Format 和tinyformat 等替代方案相比，它每次函数调用生成的代码更小。它的小尺寸归因于跨多层的类型擦除的智能应用，有助于最大限度地减少模板膨胀。格式化参数在函数调用期间通过“类型擦除格式\_args:auto”传递。提供的两个主要格式化函数是“vformat(string\_view fmt, format\_args args)”和“template format(format\_string fmt, T&&... args)” ，将其任务委托给“vformat”。输出迭代和其他输出类型也通过专门的缓冲区 API 进行类型擦除。这些设计选择将模板的使用限制在最低水平，从而缩小了二进制文件的大小并缩短了构建时间。示例展示了 Fmt 生成的代码的简洁性和效率，同时提供完整的运行时类型安全性，在编译时或通过异常处理捕获格式字符串中的错误。与 printf 等传统格式不同，Fmt 可以防止未定义的行为、内存损坏以及由于运行时错误而导致的潜在崩溃。早在 2020 年，Fmt 就进行了优化，将其大小减小到 100kB 以下，并在后续更新中采用了特殊的 DragonBox 算法进行浮点格式化。本文讨论了测试和优化 Fmt 库的过程，探索进一步减小二进制大小的机会。使用特定配置进行了多项实验，包括禁用区域设置支持和数字分隔符、更改浮点格式以及启用“按使用量付费”方法。通过这些修改，作者能够实现大约 14kB 的最终二进制大小，比具有空主函数的典型 C 程序 (~6kB) 小得多。而且，优化后的Fmt库不再依赖于C++运行时。

在 C++ 中，“new”和“delete”操作均可识别对象的类型。这决定了是否必须执行析构函数。当不需要析构函数时，例如在“new int[]”中，将使用“operator new[]”分配内存。不保存任何元数据。生成的内存位置是数组的基地址。当需要析构函数时，例如在“new std::string[]”中，内存分配为“sizeof(size\_t)”加上对象数量乘以“sizeof(T)”字节。这里，项目计数保存在“size\_t”中，内存地址设置为计算值，并且每个元素都被初始化。删除则相反：它检索“size\_t”，通过执行“~T()”取消初始化元素，相应地调整数组地址，并将内存交给“operator delete[]”释放。其他注意事项包括检查 NULL 值、管理内存对齐等。请注意，“operator new[]”不会收到有关调用析构函数或存储元数据的必要性的信息；它只是接收请求的字节总数。使用“placement new[]”时要小心，因为分配的“N\*sizeof(T)”字节块可能会不够。尽管对 C++ 运行时的依赖源自“std::allocator”->“operator new”->“malloc”，但讨论似乎仅集中于删除“operator new”，而不影响与 C++ 运行时的整体连接。由于“operator new”继续实现“std::allocator”，因此保留了抛出“\std::bad\_alloc”的能力，因此出现了进一步的区别。作者建议定义自定义“operator new”和“operator delete”函数，而不是更改 C++ 运行时，或者在编译期间通过模板参数和/或“constexpr/consteval”动态连接到 C++ 运行时。另一种方法是，通过对 C++ 运行时进行硬编码或将其委托给 DCE（死代码消除）来减少代码大小。然而，本文中所做的优化仅适用于 Linux aarch64 ELF 二进制文件，可能与内存容量很小的微控制器没有什么关系。最后，重要的是要承认代码大小的显着减少 - 正如文章中所演示的

原文

The {fmt} formatting library is known for its small binary footprint, often producing code that is several times smaller per function call compared to alternatives like IOStreams, Boost Format, or, somewhat ironically, tinyformat. This is mainly achieved through careful application of type erasure on various levels, which effectively minimizes template bloat.

Formatting arguments are passed via type-erased format_args:

auto vformat(string_view fmt, format_args args) -> std::string;

template <typename... T>
auto format(format_string<T...> fmt, T&&... args) -> std::string {
  return vformat(fmt, fmt::make_format_args(args...));
}

As you can see, format delegates all its work to vformat, which is not a template.

Output iterators and other output types are also type-erased through a specially designed buffer API.

This approach confines template usage to a minimal top-level layer, leading to both a smaller binary size and faster build times.

For example, the following code:

// test.cc
#include <fmt/base.h>

int main() {
  fmt::print("The answer is {}.", 42);
}

compiles to just

.LC0:
        .string "The answer is {}."
main:
        sub     rsp, 24
        mov     eax, 1
        mov     edi, OFFSET FLAT:.LC0
        mov     esi, 17
        mov     rcx, rsp
        mov     rdx, rax
        mov     DWORD PTR [rsp], 42
        call    fmt::v11::vprint(fmt::v11::basic_string_view<char>, fmt::v11::basic_format_args<fmt::v11::context>)
        xor     eax, eax
        add     rsp, 24
        ret

godbolt

It is much smaller than the equivalent IOStreams code and comparable to that of printf:

.LC0:
        .string "The answer is %d."
main:
        sub     rsp, 8
        mov     esi, 42
        mov     edi, OFFSET FLAT:.LC0
        xor     eax, eax
        call    printf
        xor     eax, eax
        add     rsp, 8
        ret

godbolt

Unlike printf, {fmt} offers full runtime type safety. Errors in format strings can be caught at compile time, and even when the format string is determined at runtime, errors are managed through exceptions, preventing undefined behavior, memory corruption, and potential crashes. Additionally, {fmt} calls are generally more efficient, particularly when using positional arguments, which C varargs are not well-suited for.

Back in 2020, I dedicated some time to optimizing the library size, successfully reducing it to under 100kB (just ~57kB with -Os -flto). A lot has changed since then. Most notably, {fmt} now uses the exceptional Dragonbox algorithm for floating-point formatting, kindly contributed by its author, Junekey Jeon. Let’s explore how these changes have impacted the binary size and see if further reductions are possible.

But why, some say, the binary size? Why choose this as our goal?

There has been considerable interest in using {fmt} on memory-constrained devices, see e.g. #758 and #1226 for just two examples from the distant past. A particularly intriguing use case is retro computing, with people using {fmt} on systems like Amiga (#4054).

We’ll apply the same methodology as in previous work, examining the executable size of a program that uses {fmt}, as this is most relevant to end users. All tests will be conducted on an aarch64 Ubuntu 22.04 system with GCC 11.4.0.

First, let’s establish the baseline: what is the binary size for the latest version of {fmt} (11.0.2)?

$ git checkout 11.0.2
$ g++ -Os -flto -DNDEBUG -I include test.cc src/format.cc
$ strip a.out && ls -lh a.out
-rwxrwxr-x 1 vagrant vagrant 75K Aug 30 19:24 a.out

The resulting binary size is 75kB (stripped). The positive takeaway is that despite numerous developments over the past four years, the size has not significantly regressed.

Now, let’s explore potential optimizations. One of the first adjustments you might consider is disabling locale support. All the formatting in {fmt} is locale-independent by default (which breaks with the C++’s tradition of having wrong defaults), but it is still available as an opt in via the L format specifier. It can be disabled in a somewhat obscure way via the FMT_STATIC_THOUSANDS_SEPARATOR macro:

$ g++ -Os -flto -DNDEBUG "-DFMT_STATIC_THOUSANDS_SEPARATOR=','" \
      -I include test.cc src/format.cc
$ strip a.out && ls -lh a.out
-rwxrwxr-x 1 vagrant vagrant 71K Aug 30 19:25 a.out

Disabling locale support reduces the binary size to 71kB.

Next, let’s examine the results using our trusty tool, Bloaty:

$ bloaty -d symbols a.out

    FILE SIZE        VM SIZE
 --------------  --------------
  43.8%  41.1Ki  43.6%  29.0Ki    [121 Others]
   6.4%  6.04Ki   8.1%  5.42Ki    fmt::v11::detail::do_write_float<>()
   5.9%  5.50Ki   7.5%  4.98Ki    fmt::v11::detail::write_int_noinline<>()
   5.7%  5.32Ki   5.8%  3.88Ki    fmt::v11::detail::write<>()
   5.4%  5.02Ki   7.2%  4.81Ki    fmt::v11::detail::parse_replacement_field<>()
   3.9%  3.69Ki   3.7%  2.49Ki    fmt::v11::detail::format_uint<>()
   3.2%  3.00Ki   0.0%       0    [section .symtab]
   2.7%  2.50Ki   0.0%       0    [section .strtab]
   2.3%  2.12Ki   2.9%  1.93Ki    fmt::v11::detail::dragonbox::to_decimal<>()
   2.0%  1.89Ki   2.4%  1.61Ki    fmt::v11::detail::write_int<>()
   2.0%  1.88Ki   0.0%       0    [ELF Section Headers]
   1.9%  1.79Ki   2.5%  1.66Ki    fmt::v11::detail::write_float<>()
   1.9%  1.78Ki   2.7%  1.78Ki    [section .dynstr]
   1.8%  1.72Ki   2.4%  1.62Ki    fmt::v11::detail::format_dragon()
   1.8%  1.68Ki   1.5%    1016    fmt::v11::detail::format_decimal<>()
   1.6%  1.52Ki   2.1%  1.41Ki    fmt::v11::detail::format_float<>()
   1.6%  1.49Ki   0.0%       0    [Unmapped]
   1.5%  1.45Ki   2.2%  1.45Ki    [section .dynsym]
   1.5%  1.45Ki   2.0%  1.31Ki    fmt::v11::detail::write_loc()
   1.5%  1.44Ki   2.2%  1.44Ki    [section .rodata]
   1.5%  1.40Ki   1.1%     764    fmt::v11::detail::do_write_float<>()::{lambda()#2}::operator()()
 100.0%  93.8Ki 100.0%  66.6Ki    TOTAL

Unsurprisingly, a significant portion of the binary size is dedicated to numeric formatting, particularly floating-point numbers. FP formatting also relies on sizable tables, which aren’t shown here. But what if floating-point support isn’t required? {fmt} provides a way to disable it, though the method is somewhat ad hoc and doesn’t extend to other types.

The core issue is that formatting functions need to be aware of all formattable types. Or do they? This is true for printf as defined by the C standard, but not necessarily for {fmt}. {fmt} supports an extension API that allows formatting arbitrary types without knowing the complete set of types in advance. While built-in and string types are handled specially for performance reasons, focusing on binary size might warrant a different approach. By removing this special handling and routing every type through the extension API, you can avoid paying for types you don’t use.

I did an experimental implementation of this idea. With the FMT_BUILTIN_TYPES macro set to 0, only int is handled specially, and all other types go through the general extension API. We still need to know about int for dynamic width and precision, for example

fmt::print("{:{}}\n", "hello", 10); // prints "hello     "

This gives you the “don’t pay for what you don’t use” model, though it comes with a slight increase in per-call binary size. If you do format floating-point numbers or other types, the relevant code will still be included in the build. While it’s possible to make the FP implementation smaller, we won’t delve into that here.

With FMT_BUILTIN_TYPES=0, the binary size in our example reduced to 31kB, representing a substantial improvement:

$ git checkout 377cf20
$ g++ -Os -flto -DNDEBUG \
      "-DFMT_STATIC_THOUSANDS_SEPARATOR=','" -DFMT_BUILTIN_TYPES=0 \
      -I include test.cc src/format.cc
$ strip a.out && ls -lh a.out
-rwxrwxr-x 1 vagrant vagrant 31K Aug 30 19:37 a.out

However, the updated Bloaty results reveal some lingering locale artifacts, such as digit_grouping:

$ bloaty -d fullsymbols a.out

    FILE SIZE        VM SIZE
 --------------  --------------
  41.8%  18.0Ki  39.7%  11.0Ki    [84 Others]
   6.4%  2.77Ki   0.0%       0    [section .symtab]
   5.3%  2.28Ki   0.0%       0    [section .strtab]
   4.6%  1.99Ki   6.9%  1.90Ki    fmt::v11::detail::format_handler<char>::on_format_specs(int, char const*, char const*)
   4.4%  1.88Ki   0.0%       0    [ELF Section Headers]
   4.1%  1.78Ki   5.8%  1.61Ki    fmt::v11::basic_appender<char> fmt::v11::detail::write_int_noinline<char, fmt::v11::basic_appender<char>, unsigned int>(fmt::v11::basic_appender<char>, fmt::v11::detail::write_int_arg<unsigned int>, fmt::v11::format_specs const&, fmt::v11::detail::locale_ref) (.constprop.0)
   3.7%  1.60Ki   5.8%  1.60Ki    [section .dynstr]
   3.5%  1.50Ki   4.8%  1.34Ki    void fmt::v11::detail::vformat_to<char>(fmt::v11::detail::buffer<char>&, fmt::v11::basic_string_view<char>, fmt::v11::detail::vformat_args<char>::type, fmt::v11::detail::locale_ref) (.constprop.0)
   3.5%  1.49Ki   4.9%  1.35Ki    fmt::v11::basic_appender<char> fmt::v11::detail::write_int<fmt::v11::basic_appender<char>, unsigned __int128, char>(fmt::v11::basic_appender<char>, unsigned __int128, unsigned int, fmt::v11::format_specs const&, fmt::v11::detail::digit_grouping<char> const&)
   3.1%  1.31Ki   4.7%  1.31Ki    [section .dynsym]
   3.0%  1.29Ki   4.2%  1.15Ki    fmt::v11::basic_appender<char> fmt::v11::detail::write_int<fmt::v11::basic_appender<char>, unsigned long, char>(fmt::v11::basic_appender<char>, unsigned long, unsigned int, fmt::v11::format_specs const&, fmt::v11::detail::digit_grouping<char> const&)

After disabling these artifacts in commits e582d37 and b3ccc2d, and introducing a more user-friendly option to opt out via the FMT_USE_LOCALE macro, the binary size drops to 27kB:

$ git checkout b3ccc2d
$ g++ -Os -flto -DNDEBUG -DFMT_USE_LOCALE=0 -DFMT_BUILTIN_TYPES=0 \
      -I include test.cc src/format.cc
$ strip a.out && ls -lh a.out
-rwxrwxr-x 1 vagrant vagrant 27K Aug 30 19:38 a.out

The library includes several areas where size is traded off for speed. For example, consider this function used to compute the number of decimal digits:

auto do_count_digits(uint32_t n) -> int {
// An optimization by Kendall Willets from https://bit.ly/3uOIQrB.
// This increments the upper 32 bits (log10(T) - 1) when >= T is added.
#  define FMT_INC(T) (((sizeof(#T) - 1ull) << 32) - T)
  static constexpr uint64_t table[] = {
      FMT_INC(0),          FMT_INC(0),          FMT_INC(0),           // 8
      FMT_INC(10),         FMT_INC(10),         FMT_INC(10),          // 64
      FMT_INC(100),        FMT_INC(100),        FMT_INC(100),         // 512
      FMT_INC(1000),       FMT_INC(1000),       FMT_INC(1000),        // 4096
      FMT_INC(10000),      FMT_INC(10000),      FMT_INC(10000),       // 32k
      FMT_INC(100000),     FMT_INC(100000),     FMT_INC(100000),      // 256k
      FMT_INC(1000000),    FMT_INC(1000000),    FMT_INC(1000000),     // 2048k
      FMT_INC(10000000),   FMT_INC(10000000),   FMT_INC(10000000),    // 16M
      FMT_INC(100000000),  FMT_INC(100000000),  FMT_INC(100000000),   // 128M
      FMT_INC(1000000000), FMT_INC(1000000000), FMT_INC(1000000000),  // 1024M
      FMT_INC(1000000000), FMT_INC(1000000000)                        // 4B
  };
  auto inc = table[__builtin_clz(n | 1) ^ 31];
  return static_cast<int>((n + inc) >> 32);
}

The table used here is 256 bytes. There isn’t a one-size-fits-all solution, and changing it unconditionally might negatively impact other use cases. Fortunately, we have a fallback implementation of this function for scenarios where __builtin_clz is unavailable, such as with constexpr:

template <typename T> constexpr auto count_digits_fallback(T n) -> int {
  int count = 1;
  for (;;) {
    // Integer division is slow so do it for a group of four digits instead
    // of for every digit. The idea comes from the talk by Alexandrescu
    // "Three Optimization Tips for C++". See speed-test for a comparison.
    if (n < 10) return count;
    if (n < 100) return count + 1;
    if (n < 1000) return count + 2;
    if (n < 10000) return count + 3;
    n /= 10000u;
    count += 4;
  }
}

All that remains is to provide users with control over when to use the fallback implementation via (you guessed it) another configuration macro, FMT_OPTIMIZE_SIZE:

auto count_digits(uint32_t n) -> int {
#ifdef FMT_BUILTIN_CLZ
  if (!is_constant_evaluated() && !FMT_OPTIMIZE_SIZE) return do_count_digits(n);
#endif
  return count_digits_fallback(n);
}

With this and a few similar adjustments, we reduced the binary size to 23kB:

$ git checkout 8e3da9d
$ g++ -Os -flto -DNDEBUG -I include \
      -DFMT_USE_LOCALE=0 -DFMT_BUILTIN_TYPES=0 -DFMT_OPTIMIZE_SIZE=1 \
      test.cc src/format.cc
$ strip a.out && ls -lh a.out
-rwxrwxr-x 1 vagrant vagrant 23K Aug 30 19:41 a.out

We could likely reduce the binary size even further with additional tweaks, but let’s address the elephant in the room which is, of course, the C++ standard library. What’s the point of optimizing the size when you end up getting a megabyte or two of the C++ runtime?

While {fmt} relies minimally on the standard library, is it possible to remove it completely as a dependency? One obvious problem is exceptions and those can be disabled via FMT_THROW, e.g. by defining it to abort. In general it is not recommended but it might be OK for some use cases especially considering that most errors are caught at compile time.

Let’s try it out and compile with -nodefaultlibs and exceptions disabled:

$ g++ -Os -flto -DNDEBUG -I include \
      -DFMT_USE_LOCALE=0 -DFMT_BUILTIN_TYPES=0 -DFMT_OPTIMIZE_SIZE=1 \
      '-DFMT_THROW(s)=abort()' -fno-exceptions test.cc src/format.cc \
      -nodefaultlibs -lc

/usr/bin/ld: /tmp/cc04DFeK.ltrans0.ltrans.o: in function `fmt::v11::basic_memory_buffer<char, 500ul, std::allocator<char> >::grow(fmt::v11::detail::buffer<char>&, unsigned long)':
<artificial>:(.text+0xaa8): undefined reference to `std::__throw_bad_alloc()'
/usr/bin/ld: <artificial>:(.text+0xab8): undefined reference to `operator new(unsigned long)'
/usr/bin/ld: <artificial>:(.text+0xaf8): undefined reference to `operator delete(void*, unsigned long)'
/usr/bin/ld: /tmp/cc04DFeK.ltrans0.ltrans.o: in function `fmt::v11::vprint_buffered(_IO_FILE*, fmt::v11::basic_string_view<char>, fmt::v11::basic_format_args<fmt::v11::context>) [clone .constprop.0]':
<artificial>:(.text+0x18c4): undefined reference to `operator delete(void*, unsigned long)'
collect2: error: ld returned 1 exit status

Amazingly, this approach mostly works. The only remaining dependency on the C++ runtime comes from fmt::basic_memory_buffer, which is a small stack-allocated buffer that can grow into dynamic memory if necessary.

fmt::print can write directly into the FILE buffer and generally doesn’t require dynamic allocation. So we could remove the dependency on fmt::basic_memory_buffer from fmt::print. However, since it may be used elsewhere, a better solution is to replace the default allocator with one that uses malloc and free instead of new and delete.

template <typename T> struct allocator {
  using value_type = T;

  T* allocate(size_t n) {
    FMT_ASSERT(n <= max_value<size_t>() / sizeof(T), "");
    T* p = static_cast<T*>(malloc(n * sizeof(T)));
    if (!p) FMT_THROW(std::bad_alloc());
    return p;
  }

  void deallocate(T* p, size_t) { free(p); }
};

This reduces binary size to just 14kB:

$ git checkout c0fab5e
$ g++ -Os -flto -DNDEBUG -I include \
      -DFMT_USE_LOCALE=0 -DFMT_BUILTIN_TYPES=0 -DFMT_OPTIMIZE_SIZE=1 \
      '-DFMT_THROW(s)=abort()' -fno-exceptions test.cc src/format.cc \
      -nodefaultlibs -lc
$ strip a.out && ls -lh a.out
-rwxrwxr-x 1 vagrant vagrant 14K Aug 30 19:06 a.out

Considering that a C program with an empty main function is 6kB on this system, {fmt} now adds less than 10kB to the binary.

We can also easily verify that it no longer depends on the C++ runtime:

$ ldd a.out
        linux-vdso.so.1 (0x0000ffffb0738000)
        libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000ffffb0530000)
        /lib/ld-linux-aarch64.so.1 (0x0000ffffb06ff000)

Hope you found this interesting and happy embedded formatting!

Last modified on 2024-08-30

亲爱的，我缩小了 {fmt}：将二进制大小提高到 14k 并放弃 C++ 运行时 Honey, I shrunk {fmt}: bringing binary size to 14k and ditching the C++ runtime

亲爱的，我缩小了 {fmt}：将二进制大小提高到 14k 并放弃 C++ 运行时
Honey, I shrunk {fmt}: bringing binary size to 14k and ditching the C++ runtime