The {fmt} formatting library is known for its small binary footprint, often producing code that is several times smaller per function call compared to alternatives like IOStreams, Boost Format, or, somewhat ironically, tinyformat. This is mainly achieved through careful application of type erasure on various levels, which effectively minimizes template bloat.
Formatting arguments are passed via type-erased format_args
:
auto vformat(string_view fmt, format_args args) -> std::string;
template <typename... T>
auto format(format_string<T...> fmt, T&&... args) -> std::string {
return vformat(fmt, fmt::make_format_args(args...));
}
As you can see, format
delegates all its work to vformat
, which is not a
template.
Output iterators and other output types are also type-erased through a specially designed buffer API.
This approach confines template usage to a minimal top-level layer, leading to both a smaller binary size and faster build times.
For example, the following code:
// test.cc
#include <fmt/base.h>
int main() {
fmt::print("The answer is {}.", 42);
}
compiles to just
.LC0:
.string "The answer is {}."
main:
sub rsp, 24
mov eax, 1
mov edi, OFFSET FLAT:.LC0
mov esi, 17
mov rcx, rsp
mov rdx, rax
mov DWORD PTR [rsp], 42
call fmt::v11::vprint(fmt::v11::basic_string_view<char>, fmt::v11::basic_format_args<fmt::v11::context>)
xor eax, eax
add rsp, 24
ret
It is much smaller than the equivalent IOStreams code and comparable to that
of printf
:
.LC0:
.string "The answer is %d."
main:
sub rsp, 8
mov esi, 42
mov edi, OFFSET FLAT:.LC0
xor eax, eax
call printf
xor eax, eax
add rsp, 8
ret
Unlike printf
, {fmt} offers full runtime type safety. Errors in format strings
can be caught at compile time, and even when the format string is determined at
runtime, errors are managed through exceptions, preventing undefined behavior,
memory corruption, and potential crashes. Additionally, {fmt} calls are
generally more efficient, particularly when using positional arguments, which C
varargs are not well-suited for.
Back in 2020, I dedicated some time to optimizing the library size,
successfully reducing it to under 100kB (just ~57kB with -Os -flto
).
A lot has changed since then. Most notably, {fmt} now uses the exceptional
Dragonbox algorithm for floating-point formatting, kindly
contributed by its author, Junekey Jeon. Let’s explore how these changes have
impacted the binary size and see if further reductions are possible.
But why, some say, the binary size? Why choose this as our goal?
There has been considerable interest in using {fmt} on memory-constrained devices, see e.g. #758 and #1226 for just two examples from the distant past. A particularly intriguing use case is retro computing, with people using {fmt} on systems like Amiga (#4054).
We’ll apply the same methodology as in previous work, examining the executable size of a program that uses {fmt}, as this is most relevant to end users. All tests will be conducted on an aarch64 Ubuntu 22.04 system with GCC 11.4.0.
First, let’s establish the baseline: what is the binary size for the latest version of {fmt} (11.0.2)?
$ git checkout 11.0.2
$ g++ -Os -flto -DNDEBUG -I include test.cc src/format.cc
$ strip a.out && ls -lh a.out
-rwxrwxr-x 1 vagrant vagrant 75K Aug 30 19:24 a.out
The resulting binary size is 75kB (stripped). The positive takeaway is that despite numerous developments over the past four years, the size has not significantly regressed.
Now, let’s explore potential optimizations. One of the first adjustments you
might consider is disabling locale support. All the formatting in {fmt} is
locale-independent by default (which breaks with the C++’s tradition of having
wrong defaults), but it is still available as an opt in via the L
format
specifier. It can be disabled in a somewhat obscure way via the
FMT_STATIC_THOUSANDS_SEPARATOR
macro:
$ g++ -Os -flto -DNDEBUG "-DFMT_STATIC_THOUSANDS_SEPARATOR=','" \
-I include test.cc src/format.cc
$ strip a.out && ls -lh a.out
-rwxrwxr-x 1 vagrant vagrant 71K Aug 30 19:25 a.out
Disabling locale support reduces the binary size to 71kB.
Next, let’s examine the results using our trusty tool, Bloaty:
$ bloaty -d symbols a.out
FILE SIZE VM SIZE
-------------- --------------
43.8% 41.1Ki 43.6% 29.0Ki [121 Others]
6.4% 6.04Ki 8.1% 5.42Ki fmt::v11::detail::do_write_float<>()
5.9% 5.50Ki 7.5% 4.98Ki fmt::v11::detail::write_int_noinline<>()
5.7% 5.32Ki 5.8% 3.88Ki fmt::v11::detail::write<>()
5.4% 5.02Ki 7.2% 4.81Ki fmt::v11::detail::parse_replacement_field<>()
3.9% 3.69Ki 3.7% 2.49Ki fmt::v11::detail::format_uint<>()
3.2% 3.00Ki 0.0% 0 [section .symtab]
2.7% 2.50Ki 0.0% 0 [section .strtab]
2.3% 2.12Ki 2.9% 1.93Ki fmt::v11::detail::dragonbox::to_decimal<>()
2.0% 1.89Ki 2.4% 1.61Ki fmt::v11::detail::write_int<>()
2.0% 1.88Ki 0.0% 0 [ELF Section Headers]
1.9% 1.79Ki 2.5% 1.66Ki fmt::v11::detail::write_float<>()
1.9% 1.78Ki 2.7% 1.78Ki [section .dynstr]
1.8% 1.72Ki 2.4% 1.62Ki fmt::v11::detail::format_dragon()
1.8% 1.68Ki 1.5% 1016 fmt::v11::detail::format_decimal<>()
1.6% 1.52Ki 2.1% 1.41Ki fmt::v11::detail::format_float<>()
1.6% 1.49Ki 0.0% 0 [Unmapped]
1.5% 1.45Ki 2.2% 1.45Ki [section .dynsym]
1.5% 1.45Ki 2.0% 1.31Ki fmt::v11::detail::write_loc()
1.5% 1.44Ki 2.2% 1.44Ki [section .rodata]
1.5% 1.40Ki 1.1% 764 fmt::v11::detail::do_write_float<>()::{lambda()#2}::operator()()
100.0% 93.8Ki 100.0% 66.6Ki TOTAL
Unsurprisingly, a significant portion of the binary size is dedicated to numeric formatting, particularly floating-point numbers. FP formatting also relies on sizable tables, which aren’t shown here. But what if floating-point support isn’t required? {fmt} provides a way to disable it, though the method is somewhat ad hoc and doesn’t extend to other types.
The core issue is that formatting functions need to be aware of all formattable
types. Or do they? This is true for printf
as defined by the C standard, but
not necessarily for {fmt}. {fmt} supports an extension API that allows
formatting arbitrary types without knowing the complete set of types in advance.
While built-in and string types are handled specially for performance reasons,
focusing on binary size might warrant a different approach. By removing this
special handling and routing every type through the extension API, you can avoid
paying for types you don’t use.
I did an experimental implementation of this idea. With the
FMT_BUILTIN_TYPES
macro set to 0, only int
is handled specially, and all
other types go through the general extension API. We still need to know about
int
for dynamic width and precision, for example
fmt::print("{:{}}\n", "hello", 10); // prints "hello "
This gives you the “don’t pay for what you don’t use” model, though it comes with a slight increase in per-call binary size. If you do format floating-point numbers or other types, the relevant code will still be included in the build. While it’s possible to make the FP implementation smaller, we won’t delve into that here.
With FMT_BUILTIN_TYPES=0
, the binary size in our example reduced to 31kB,
representing a substantial improvement:
$ git checkout 377cf20
$ g++ -Os -flto -DNDEBUG \
"-DFMT_STATIC_THOUSANDS_SEPARATOR=','" -DFMT_BUILTIN_TYPES=0 \
-I include test.cc src/format.cc
$ strip a.out && ls -lh a.out
-rwxrwxr-x 1 vagrant vagrant 31K Aug 30 19:37 a.out
However, the updated Bloaty results reveal some lingering locale artifacts,
such as digit_grouping
:
$ bloaty -d fullsymbols a.out
FILE SIZE VM SIZE
-------------- --------------
41.8% 18.0Ki 39.7% 11.0Ki [84 Others]
6.4% 2.77Ki 0.0% 0 [section .symtab]
5.3% 2.28Ki 0.0% 0 [section .strtab]
4.6% 1.99Ki 6.9% 1.90Ki fmt::v11::detail::format_handler<char>::on_format_specs(int, char const*, char const*)
4.4% 1.88Ki 0.0% 0 [ELF Section Headers]
4.1% 1.78Ki 5.8% 1.61Ki fmt::v11::basic_appender<char> fmt::v11::detail::write_int_noinline<char, fmt::v11::basic_appender<char>, unsigned int>(fmt::v11::basic_appender<char>, fmt::v11::detail::write_int_arg<unsigned int>, fmt::v11::format_specs const&, fmt::v11::detail::locale_ref) (.constprop.0)
3.7% 1.60Ki 5.8% 1.60Ki [section .dynstr]
3.5% 1.50Ki 4.8% 1.34Ki void fmt::v11::detail::vformat_to<char>(fmt::v11::detail::buffer<char>&, fmt::v11::basic_string_view<char>, fmt::v11::detail::vformat_args<char>::type, fmt::v11::detail::locale_ref) (.constprop.0)
3.5% 1.49Ki 4.9% 1.35Ki fmt::v11::basic_appender<char> fmt::v11::detail::write_int<fmt::v11::basic_appender<char>, unsigned __int128, char>(fmt::v11::basic_appender<char>, unsigned __int128, unsigned int, fmt::v11::format_specs const&, fmt::v11::detail::digit_grouping<char> const&)
3.1% 1.31Ki 4.7% 1.31Ki [section .dynsym]
3.0% 1.29Ki 4.2% 1.15Ki fmt::v11::basic_appender<char> fmt::v11::detail::write_int<fmt::v11::basic_appender<char>, unsigned long, char>(fmt::v11::basic_appender<char>, unsigned long, unsigned int, fmt::v11::format_specs const&, fmt::v11::detail::digit_grouping<char> const&)
After disabling these artifacts in commits e582d37 and
b3ccc2d, and introducing a more user-friendly option to opt out via
the FMT_USE_LOCALE
macro, the binary size drops to 27kB:
$ git checkout b3ccc2d
$ g++ -Os -flto -DNDEBUG -DFMT_USE_LOCALE=0 -DFMT_BUILTIN_TYPES=0 \
-I include test.cc src/format.cc
$ strip a.out && ls -lh a.out
-rwxrwxr-x 1 vagrant vagrant 27K Aug 30 19:38 a.out
The library includes several areas where size is traded off for speed. For example, consider this function used to compute the number of decimal digits:
auto do_count_digits(uint32_t n) -> int {
// An optimization by Kendall Willets from https://bit.ly/3uOIQrB.
// This increments the upper 32 bits (log10(T) - 1) when >= T is added.
# define FMT_INC(T) (((sizeof(#T) - 1ull) << 32) - T)
static constexpr uint64_t table[] = {
FMT_INC(0), FMT_INC(0), FMT_INC(0), // 8
FMT_INC(10), FMT_INC(10), FMT_INC(10), // 64
FMT_INC(100), FMT_INC(100), FMT_INC(100), // 512
FMT_INC(1000), FMT_INC(1000), FMT_INC(1000), // 4096
FMT_INC(10000), FMT_INC(10000), FMT_INC(10000), // 32k
FMT_INC(100000), FMT_INC(100000), FMT_INC(100000), // 256k
FMT_INC(1000000), FMT_INC(1000000), FMT_INC(1000000), // 2048k
FMT_INC(10000000), FMT_INC(10000000), FMT_INC(10000000), // 16M
FMT_INC(100000000), FMT_INC(100000000), FMT_INC(100000000), // 128M
FMT_INC(1000000000), FMT_INC(1000000000), FMT_INC(1000000000), // 1024M
FMT_INC(1000000000), FMT_INC(1000000000) // 4B
};
auto inc = table[__builtin_clz(n | 1) ^ 31];
return static_cast<int>((n + inc) >> 32);
}
The table used here is 256 bytes. There isn’t a one-size-fits-all solution,
and changing it unconditionally might negatively impact other use cases.
Fortunately, we have a fallback implementation of this function for scenarios
where __builtin_clz
is unavailable, such as with constexpr
:
template <typename T> constexpr auto count_digits_fallback(T n) -> int {
int count = 1;
for (;;) {
// Integer division is slow so do it for a group of four digits instead
// of for every digit. The idea comes from the talk by Alexandrescu
// "Three Optimization Tips for C++". See speed-test for a comparison.
if (n < 10) return count;
if (n < 100) return count + 1;
if (n < 1000) return count + 2;
if (n < 10000) return count + 3;
n /= 10000u;
count += 4;
}
}
All that remains is to provide users with control over when to use the fallback
implementation via (you guessed it) another configuration macro,
FMT_OPTIMIZE_SIZE
:
auto count_digits(uint32_t n) -> int {
#ifdef FMT_BUILTIN_CLZ
if (!is_constant_evaluated() && !FMT_OPTIMIZE_SIZE) return do_count_digits(n);
#endif
return count_digits_fallback(n);
}
With this and a few similar adjustments, we reduced the binary size to 23kB:
$ git checkout 8e3da9d
$ g++ -Os -flto -DNDEBUG -I include \
-DFMT_USE_LOCALE=0 -DFMT_BUILTIN_TYPES=0 -DFMT_OPTIMIZE_SIZE=1 \
test.cc src/format.cc
$ strip a.out && ls -lh a.out
-rwxrwxr-x 1 vagrant vagrant 23K Aug 30 19:41 a.out
We could likely reduce the binary size even further with additional tweaks, but let’s address the elephant in the room which is, of course, the C++ standard library. What’s the point of optimizing the size when you end up getting a megabyte or two of the C++ runtime?
While {fmt} relies minimally on the standard library, is it possible to
remove it completely as a dependency? One obvious problem is exceptions and
those can be disabled via FMT_THROW
, e.g. by defining it to abort
.
In general it is not recommended but it might be OK for some use cases
especially considering that most errors are caught at compile time.
Let’s try it out and compile with -nodefaultlibs
and exceptions disabled:
$ g++ -Os -flto -DNDEBUG -I include \
-DFMT_USE_LOCALE=0 -DFMT_BUILTIN_TYPES=0 -DFMT_OPTIMIZE_SIZE=1 \
'-DFMT_THROW(s)=abort()' -fno-exceptions test.cc src/format.cc \
-nodefaultlibs -lc
/usr/bin/ld: /tmp/cc04DFeK.ltrans0.ltrans.o: in function `fmt::v11::basic_memory_buffer<char, 500ul, std::allocator<char> >::grow(fmt::v11::detail::buffer<char>&, unsigned long)':
<artificial>:(.text+0xaa8): undefined reference to `std::__throw_bad_alloc()'
/usr/bin/ld: <artificial>:(.text+0xab8): undefined reference to `operator new(unsigned long)'
/usr/bin/ld: <artificial>:(.text+0xaf8): undefined reference to `operator delete(void*, unsigned long)'
/usr/bin/ld: /tmp/cc04DFeK.ltrans0.ltrans.o: in function `fmt::v11::vprint_buffered(_IO_FILE*, fmt::v11::basic_string_view<char>, fmt::v11::basic_format_args<fmt::v11::context>) [clone .constprop.0]':
<artificial>:(.text+0x18c4): undefined reference to `operator delete(void*, unsigned long)'
collect2: error: ld returned 1 exit status
Amazingly, this approach mostly works. The only remaining dependency on the C++
runtime comes from fmt::basic_memory_buffer
, which is a small stack-allocated
buffer that can grow into dynamic memory if necessary.
fmt::print
can write directly into the FILE
buffer and generally
doesn’t require dynamic allocation. So we could remove the dependency on
fmt::basic_memory_buffer
from fmt::print
. However, since it may be used
elsewhere, a better solution is to replace the default allocator with one that
uses malloc
and free
instead of new
and delete
.
template <typename T> struct allocator {
using value_type = T;
T* allocate(size_t n) {
FMT_ASSERT(n <= max_value<size_t>() / sizeof(T), "");
T* p = static_cast<T*>(malloc(n * sizeof(T)));
if (!p) FMT_THROW(std::bad_alloc());
return p;
}
void deallocate(T* p, size_t) { free(p); }
};
This reduces binary size to just 14kB:
$ git checkout c0fab5e
$ g++ -Os -flto -DNDEBUG -I include \
-DFMT_USE_LOCALE=0 -DFMT_BUILTIN_TYPES=0 -DFMT_OPTIMIZE_SIZE=1 \
'-DFMT_THROW(s)=abort()' -fno-exceptions test.cc src/format.cc \
-nodefaultlibs -lc
$ strip a.out && ls -lh a.out
-rwxrwxr-x 1 vagrant vagrant 14K Aug 30 19:06 a.out
Considering that a C program with an empty main
function is 6kB on this
system, {fmt} now adds less than 10kB to the binary.
We can also easily verify that it no longer depends on the C++ runtime:
$ ldd a.out
linux-vdso.so.1 (0x0000ffffb0738000)
libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000ffffb0530000)
/lib/ld-linux-aarch64.so.1 (0x0000ffffb06ff000)
Hope you found this interesting and happy embedded formatting!
Last modified on 2024-08-30