We build a minimal u128 as two u64 limbs and implement arithmetic using carry, borrow, and multiply intrinsics that map directly to x64 instructions.
The generated code is on par with builtin __uint128_t for addition, subtraction, multiplication, and comparison.
This is unsigned-only, x64-focused, and intentionally narrow in scope.
The result is a solid foundation for exact, fixed-width arithmetic with a focus on good codegen and predictability, not abstraction.
Full code and compiler output: https://godbolt.org/z/K6dn3s91Y
Scope
We take the smallest reasonable definition of a 128-bit integer, two 64-bit words, and turn it into a usable arithmetic type whose generated code is indistinguishable from a builtin __uint128_t.
This post is explicitly not about dynamically-sized big integer arithmetic. It is about being explicit with range bounds and letting the compiler emit the exact instructions we want. The scope is deliberately limited: unsigned arithmetic, fixed width, modern x64, with Clang and GCC as the primary targets and notes for MSVC where it differs.
Why fixed-width big integers
In many domains, especially geometry and numerics, we do not need arbitrary precision. We need enough precision to be exact for known bounds, and we need the cost to be predictable.
Dynamic big integer libraries solve a different problem. They are flexible and general, but they pay for that generality in memory traffic, branches, and indirection. If your values fit into a fixed number of bits and you know that ahead of time, fixed-width arithmetic is usually the better trade. (In fact, our high-performance exact mesh booleans are completely built on this: Exact Arithmetic in Solidean)
A 128-bit integer is the gateway drug to fixed-width arithmetic. It is the smallest width that is no longer builtin, while still mapping cleanly to the underlying hardware. Once the carry and multiply patterns are explicit at 128 bits, extending them to 192 or 256 bits is straightforward. In production, we use 256-bit integers in our hot paths and go up to 564 bits for certain edge cases.
Representation
We represent a 128-bit unsigned integer as two 64-bit limbs.
You can literally think of this as writing the u128 as a 2-digit number in base \(2^{64}\).
Because it's unsigned, we don't need to think about two's complement.
#include <cstdint>
#include <immintrin.h> // _addcarry_u64, _subborrow_u64, _mulx_u64
// this is the type used in the intrinsics signature
// (and uint64_t is unsigned long, not unsigned long long...)
using u64 = unsigned long long;
struct u128
{
u64 low = 0;
u64 high = 0;
};
Addition, with carry
We start off easy.
Addition is simply done using long addition on our base \(2^{64}\) digits.
The intrinsic _addcarry_u64 corresponds to the x64 instruction adc and is exactly what we need:
Given two u64 summands and an input carry (0 or 1), we get the u64 result (via a slightly cumbersome output parameter) and a new carry.
u128 operator+(u128 a, u128 b)
{
u128 r;
unsigned char c = _addcarry_u64(0, a.low, b.low, &r.low);
(void)_addcarry_u64(c, a.high, b.high, &r.high);
return r;
}
The generated assembly is exactly what you would write by hand.
operator+(u128, u128):
mov rax, rdi
add rax, rdx
adc rsi, rcx
mov rdx, rsi
ret
The moves are just calling convention noise.
The core is an add (because the first addition has no input carry) followed by an adc.
This is identical to what the compiler emits for __uint128_t addition.
A small but important point is that intrinsics are preferable to inline assembly here. They act like specialized IR operations. The compiler understands their semantics, can schedule around them, and can still optimize aggressively. Inline assembly is more of a black box and much easier to get wrong or inhibit optimizations.
Subtraction: same story, inverted
Subtraction mirrors addition almost perfectly.
Instead of a carry, we track a borrow.
On x64, this is sbb, subtract with borrow.
The corresponding intrinsic is _subborrow_u64.
u128 operator-(u128 a, u128 b)
{
u128 r;
unsigned char c = _subborrow_u64(0, a.low, b.low, &r.low);
(void)_subborrow_u64(c, a.high, b.high, &r.high);
return r;
}
And again, the assembly is exactly what we want.
operator-(u128, u128):
mov rax, rdi
sub rax, rdx
sbb rsi, rcx
mov rdx, rsi
ret
At this point, addition and subtraction are basically solved.
There is no hidden cost and no abstraction penalty.
It's also easy to see how this scales to larger integer types with one extra instruction per u64 "digit".
Multiplication: regrouping our u64 digits
Multiplication is where things get more interesting.
A 128-bit by 128-bit multiply produces a 256-bit result but we want only the lower 128 bit for our result.
Same story with u64 * u64 really, which produces a 128 bit result in theory, but you usually only use the lower u64.
Speaking of which, all modern 64-bit architectures give you access to fast u64 * u64 -> u128 instructions.
With BMI2, this is exposed as _mulx_u64.
On MSVC, the equivalent is _umul128.
This is our building block for large multiplication.
You can derive the code from writing the u128 * u128 as 2-digit long multiplication (u64, u64) * (u64, u64) and then look sharply at what sums up to which digit.
That's how I do it on paper, but here we can also choose an algebraic route. Write the numbers as: $$ (a.\text{low} + 2^{64} \cdot a.\text{high}) \cdot (b.\text{low} + 2^{64} \cdot b.\text{high}) $$ Expanding this gives four terms. Of those, only three contribute to the low 128 bits:
- \(a_\text{low} \cdot b_\text{low}\) contributes both low and high parts
- \(a_\text{low} \cdot b_\text{high}\) contributes to bits 64..127
- \(a_\text{high} \cdot b_\text{low}\) contributes to bits 64..127
- \(a_\text{high} \cdot b_\text{high}\) contributes only above bit 128 and can be discarded
This works for larger integers as well, though summing up intermediate terms can produce carries for "higher digits".
u128 operator*(u128 a, u128 b)
{
// we want the low 128 bits of:
// (a.low + 2^64 a.high) * (b.low + 2^64 b.high)
//
// r.low = lo64(a.low * b.low)
// r.high = hi64(a.low * b.low)
// + lo64(a.low * b.high)
// + lo64(a.high * b.low) (mod 2^64)
// NOTE (MSVC): for multiply, you can use _umul128(a, b, &hi) instead of _mulx_u64.
// Clang/GCC: _mulx_u64 is BMI2 and needs -mbmi2.
u128 r;
u64 p0_hi;
r.low = _mulx_u64(a.low, b.low, &p0_hi);
// cross terms: only the low 64 bits contribute to r.high
u64 t1_hi;
u64 t1_lo = _mulx_u64(a.low, b.high, &t1_hi);
u64 t2_hi;
u64 t2_lo = _mulx_u64(a.high, b.low, &t2_hi);
// simply add is sufficient: carries would land in bit 128 and are discarded
r.high = p0_hi + t1_lo + t2_lo;
return r;
}
Note that we do not need carry handling there. Any carry out of bit 127 would land in bit 128, which we are discarding anyway.
The compiler output reflects this reasoning.
operator*(u128, u128):
mulx r8, rax, rdi
imul rcx, rdi
imul rdx, rsi
add rdx, rcx
add rdx, r8
ret
The compiler chooses slightly different instructions and registers for the builtin __uint128_t multiplication but is otherwise identical as well.
Equality
Now an easy one.
u128 equality is simple structural equality.
bool operator==(u128 a, u128 b)
{
return a.low == b.low && a.high == b.high;
}
The generated assembly is worth a quick look.
operator==(u128, u128):
xor rdi, rdx
xor rsi, rcx
or rsi, rdi
sete al
ret
Instead of branching (as you might expect from the short-circuiting of &&), the compiler XORs the corresponding limbs.
XOR produces zero if and only if the inputs are equal.
ORing the results combines the checks.
If the final value is zero, both limbs were equal.
This pattern continues for larger integers, though we might see branching due to short-circuiting at some point.
(We could of course just use the XOR approach in operator== but it is distinctly less readable.)
Comparison: borrow beats branching
The straightforward way to compare two 128-bit integers is to compare the high parts first and then the low parts.
bool operator<(u128 a, u128 b)
{
if (a.high != b.high) return a.high < b.high;
return a.low < b.low;
}
This is correct, but the codegen is not great:
operator<(u128, u128):
xor r8d, r8d
cmp rdi, rdx
setb r8b
xor eax, eax
cmp rsi, rcx
setb al
cmove eax, r8d
ret
Works, but heavier than necessary. We can do better by leaning on the hardware borrow flag.
Unsigned comparison a < b is equivalent to checking whether a - b produces a borrow:
bool operator<(u128 a, u128 b)
{
u64 dont_care;
// compute borrow from (a.low - b.low). If a.low < b.low => borrow = 1.
unsigned char borrow = _subborrow_u64(0, a.low, b.low, &dont_care);
// now subtract highs with that borrow
// final borrow tells us if a < b in 128-bit unsigned.
borrow = _subborrow_u64(borrow, a.high, b.high, &dont_care);
return borrow != 0;
}
The resulting assembly is minimal:
operator<(u128, u128):
cmp rdi, rdx
sbb rsi, rcx
setb al
ret
One compare, one subtract with borrow, and a flag check. This is exactly the kind of codegen we want in hot code.
A small use site
To make sure everything composes properly, let's build a slightly larger function.
u128 demo_u128(u128 a, u128 b)
{
u128 x = a + b;
u128 y = a * b;
return x < y
? y - x
: x - y;
}
All operators inline cleanly:
demo_u128(u128, u128):
mov r8, rdi
add r8, rdx
mov r9, rsi
adc r9, rcx
mulx rax, r10, rdi
imul rcx, rdi
imul rdx, rsi
add rdx, rcx
add rdx, rax
mov rax, r8
sub rax, r10
mov rcx, r9
sbb rcx, rdx
jae .LBB13_2
sub r10, r8
sbb rdx, r9
mov rax, r10
mov rcx, rdx
.LBB13_2:
mov rdx, rcx
ret
There is a branch for the ternary operator while the builtin version uses conditional moves instead. Which is better depends on data patterns but could go either way. I consider this basically as good as it gets.
Platform notes
The examples shown are for x64 with Clang or GCC.
On MSVC, _addcarry_u64 and _subborrow_u64 work the same way.
For multiplication, _umul128 replaces _mulx_u64.
On AArch64, the same approach applies using adds and adcs instructions for addition, subs and sbcs for subtraction, and mul + umulh for the low + high half of a 64-bit multiply.
The patterns carry over directly, even though the intrinsics differ slightly (and multiplication is split into two parts).
Outlook
This u128 is the easiest large integer you can write.
Our goal is best performance, so we made sure that codegen is reasonably identical to the builtin __uint128_t.
From here, the same patterns extend naturally to signed variants, widening multiplies such as u128 times u128 to u256, and chains of fixed-width integers like i192 or i256. More importantly, the same reasoning applies when you design predicate-specific arithmetic that only computes what is actually needed.
A lot of our performance really boils down to these types:
No BigInteger tax, no floating predicates (that need a few kB of stack space for the worst case).
Just a lot of straightline integer code and some light static branching.
Addendum 2026-01-24
Some notes based on reader feedback.
On PowerPC: I don't know much about PowerPC or the availability of intrinsics, but the instructions you need are definitely there. For add/sub with carry, there's addc / adde for addition and subfc / subfe for subtraction with borrow. For wide multiplication, mulhdu gives you the high half of a 64-bit multiply (and mulhd for signed).
On GCC codegen with intrinsics: Someone pointed out that GCC doesn't always handle the intrinsic-based addition optimally. It sometimes moves the result through the stack before setting it in the final registers. This is related to writing directly to the result struct. If you write to "real" scalars first, the codegen is optimal again. A weird one.
On division: There is no neat codegen for division. Even the builtin delegates to a library call. The naive but practical approach is binary long division, which finishes in up to 128 steps. Either branchless with fixed runtime or with a loop that searches for the next set bit. Either way it's a bit of work. Our exact predicates are always formulated in a division-free way simply because division would be expensive.
On _BitInt(N):
The upcoming _BitInt(N) type does work, but with caveats.
For B = 128, you get the normal codegen.
For B > 128, GCC calls into a function where the bit size is a runtime parameter.
So yes, they would work, but performance will be subpar for larger widths.
Clang generates properly inlined assembly.