``` 用C++构建你自己的高效uint128 ```
Building Your Own Efficient uint128 in C++

原始链接: https://solidean.com/blog/2026/building-your-own-u128/

该项目使用两个64位字来实现一个最小的128位无符号整数 (`u128`) 类型,目标是在x64架构上实现与内置`__uint128_t`类型相当的性能。重点在于*固定宽度*算术——为已知边界提供精确精度——而不是任意精度大整数库。 该实现利用x64内部函数 (`_addcarry_u64`, `_subborrow_u64`, `_mulx_u64`),这些函数直接映射到高效的CPU指令 (adc, sbb, mulx)。加法和减法都比较简单,分别利用进位和借位标志。乘法涉及将运算分解为u64乘法并求和相关结果,丢弃超出128位限制的任何溢出。比较使用带借位的减法进行优化,以避免分支。 生成的代码与内置128位整数生成的汇编代码相同,优先考虑可预测性和良好的代码生成,而不是抽象。虽然目前仅限于无符号算术和x64(并附有关于MSVC和AArch64的说明),但该方法易于扩展到更大的固定宽度整数(例如,256位)和有符号变体。该项目展示了高性能、精确算术的基础,这在几何和数值等领域至关重要,在这些领域中,可预测的成本至关重要。

黑客新闻 新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 构建你自己的高效 uint128 在 C++ (solidean.com) 15 分,PaulHoule 发表于 4 小时前 | 隐藏 | 过去 | 收藏 | 讨论 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文
TL;DR:

We build a minimal u128 as two u64 limbs and implement arithmetic using carry, borrow, and multiply intrinsics that map directly to x64 instructions. The generated code is on par with builtin __uint128_t for addition, subtraction, multiplication, and comparison. This is unsigned-only, x64-focused, and intentionally narrow in scope. The result is a solid foundation for exact, fixed-width arithmetic with a focus on good codegen and predictability, not abstraction.

Full code and compiler output: https://godbolt.org/z/K6dn3s91Y

Scope

We take the smallest reasonable definition of a 128-bit integer, two 64-bit words, and turn it into a usable arithmetic type whose generated code is indistinguishable from a builtin __uint128_t.

This post is explicitly not about dynamically-sized big integer arithmetic. It is about being explicit with range bounds and letting the compiler emit the exact instructions we want. The scope is deliberately limited: unsigned arithmetic, fixed width, modern x64, with Clang and GCC as the primary targets and notes for MSVC where it differs.

Why fixed-width big integers

In many domains, especially geometry and numerics, we do not need arbitrary precision. We need enough precision to be exact for known bounds, and we need the cost to be predictable.

Dynamic big integer libraries solve a different problem. They are flexible and general, but they pay for that generality in memory traffic, branches, and indirection. If your values fit into a fixed number of bits and you know that ahead of time, fixed-width arithmetic is usually the better trade. (In fact, our high-performance exact mesh booleans are completely built on this: Exact Arithmetic in Solidean)

A 128-bit integer is the gateway drug to fixed-width arithmetic. It is the smallest width that is no longer builtin, while still mapping cleanly to the underlying hardware. Once the carry and multiply patterns are explicit at 128 bits, extending them to 192 or 256 bits is straightforward. In production, we use 256-bit integers in our hot paths and go up to 564 bits for certain edge cases.

Representation

We represent a 128-bit unsigned integer as two 64-bit limbs. You can literally think of this as writing the u128 as a 2-digit number in base \(2^{64}\). Because it's unsigned, we don't need to think about two's complement.

#include <cstdint>
#include <immintrin.h> // _addcarry_u64, _subborrow_u64, _mulx_u64

// this is the type used in the intrinsics signature
// (and uint64_t is unsigned long, not unsigned long long...)
using u64 = unsigned long long;

struct u128
{
    u64 low = 0;
    u64 high = 0;
};

Addition, with carry

We start off easy. Addition is simply done using long addition on our base \(2^{64}\) digits. The intrinsic _addcarry_u64 corresponds to the x64 instruction adc and is exactly what we need: Given two u64 summands and an input carry (0 or 1), we get the u64 result (via a slightly cumbersome output parameter) and a new carry.

u128 operator+(u128 a, u128 b) 
{
    u128 r;
    unsigned char c = _addcarry_u64(0, a.low,  b.low,  &r.low);
    (void)_addcarry_u64(c, a.high, b.high, &r.high);
    return r;
}

The generated assembly is exactly what you would write by hand.

operator+(u128, u128):
        mov     rax, rdi
        add     rax, rdx
        adc     rsi, rcx
        mov     rdx, rsi
        ret

The moves are just calling convention noise. The core is an add (because the first addition has no input carry) followed by an adc. This is identical to what the compiler emits for __uint128_t addition.

A small but important point is that intrinsics are preferable to inline assembly here. They act like specialized IR operations. The compiler understands their semantics, can schedule around them, and can still optimize aggressively. Inline assembly is more of a black box and much easier to get wrong or inhibit optimizations.

Subtraction: same story, inverted

Subtraction mirrors addition almost perfectly. Instead of a carry, we track a borrow. On x64, this is sbb, subtract with borrow. The corresponding intrinsic is _subborrow_u64.

u128 operator-(u128 a, u128 b) 
{
    u128 r;
    unsigned char c = _subborrow_u64(0, a.low,  b.low,  &r.low);
    (void)_subborrow_u64(c, a.high, b.high, &r.high);
    return r;
}

And again, the assembly is exactly what we want.

operator-(u128, u128):
        mov     rax, rdi
        sub     rax, rdx
        sbb     rsi, rcx
        mov     rdx, rsi
        ret

At this point, addition and subtraction are basically solved. There is no hidden cost and no abstraction penalty. It's also easy to see how this scales to larger integer types with one extra instruction per u64 "digit".

Multiplication: regrouping our u64 digits

Multiplication is where things get more interesting. A 128-bit by 128-bit multiply produces a 256-bit result but we want only the lower 128 bit for our result. Same story with u64 * u64 really, which produces a 128 bit result in theory, but you usually only use the lower u64. Speaking of which, all modern 64-bit architectures give you access to fast u64 * u64 -> u128 instructions. With BMI2, this is exposed as _mulx_u64. On MSVC, the equivalent is _umul128. This is our building block for large multiplication.

You can derive the code from writing the u128 * u128 as 2-digit long multiplication (u64, u64) * (u64, u64) and then look sharply at what sums up to which digit.

That's how I do it on paper, but here we can also choose an algebraic route. Write the numbers as: $$ (a.\text{low} + 2^{64} \cdot a.\text{high}) \cdot (b.\text{low} + 2^{64} \cdot b.\text{high}) $$ Expanding this gives four terms. Of those, only three contribute to the low 128 bits:

  • \(a_\text{low} \cdot b_\text{low}\) contributes both low and high parts
  • \(a_\text{low} \cdot b_\text{high}\) contributes to bits 64..127
  • \(a_\text{high} \cdot b_\text{low}\) contributes to bits 64..127
  • \(a_\text{high} \cdot b_\text{high}\) contributes only above bit 128 and can be discarded

This works for larger integers as well, though summing up intermediate terms can produce carries for "higher digits".

u128 operator*(u128 a, u128 b) 
{
    // we want the low 128 bits of:
    // (a.low + 2^64 a.high) * (b.low + 2^64 b.high)
    //
    // r.low  = lo64(a.low * b.low)
    // r.high = hi64(a.low * b.low)
    //        + lo64(a.low * b.high)
    //        + lo64(a.high * b.low)        (mod 2^64)

    // NOTE (MSVC): for multiply, you can use _umul128(a, b, &hi) instead of _mulx_u64.
    //              Clang/GCC: _mulx_u64 is BMI2 and needs -mbmi2.

    u128 r;

    u64 p0_hi;
    r.low = _mulx_u64(a.low, b.low, &p0_hi);

    // cross terms: only the low 64 bits contribute to r.high
    u64 t1_hi;
    u64 t1_lo = _mulx_u64(a.low,  b.high, &t1_hi);

    u64 t2_hi;
    u64 t2_lo = _mulx_u64(a.high, b.low,  &t2_hi);

    // simply add is sufficient: carries would land in bit 128 and are discarded
    r.high = p0_hi + t1_lo + t2_lo;

    return r;
}

Note that we do not need carry handling there. Any carry out of bit 127 would land in bit 128, which we are discarding anyway.

The compiler output reflects this reasoning.

operator*(u128, u128):
        mulx    r8, rax, rdi
        imul    rcx, rdi
        imul    rdx, rsi
        add     rdx, rcx
        add     rdx, r8
        ret

The compiler chooses slightly different instructions and registers for the builtin __uint128_t multiplication but is otherwise identical as well.

Equality

Now an easy one. u128 equality is simple structural equality.

bool operator==(u128 a, u128 b) 
{
    return a.low == b.low && a.high == b.high;
}

The generated assembly is worth a quick look.

operator==(u128, u128):
        xor     rdi, rdx
        xor     rsi, rcx
        or      rsi, rdi
        sete    al
        ret

Instead of branching (as you might expect from the short-circuiting of &&), the compiler XORs the corresponding limbs. XOR produces zero if and only if the inputs are equal. ORing the results combines the checks. If the final value is zero, both limbs were equal.

This pattern continues for larger integers, though we might see branching due to short-circuiting at some point. (We could of course just use the XOR approach in operator== but it is distinctly less readable.)

Comparison: borrow beats branching

The straightforward way to compare two 128-bit integers is to compare the high parts first and then the low parts.

bool operator<(u128 a, u128 b)
{
    if (a.high != b.high) return a.high < b.high;
    return a.low < b.low;
}

This is correct, but the codegen is not great:

operator<(u128, u128):
        xor     r8d, r8d
        cmp     rdi, rdx
        setb    r8b
        xor     eax, eax
        cmp     rsi, rcx
        setb    al
        cmove   eax, r8d
        ret

Works, but heavier than necessary. We can do better by leaning on the hardware borrow flag.

Unsigned comparison a < b is equivalent to checking whether a - b produces a borrow:

bool operator<(u128 a, u128 b) 
{
    u64 dont_care;

    // compute borrow from (a.low - b.low). If a.low < b.low => borrow = 1.
    unsigned char borrow = _subborrow_u64(0, a.low, b.low, &dont_care);

    // now subtract highs with that borrow
    // final borrow tells us if a < b in 128-bit unsigned.
    borrow = _subborrow_u64(borrow, a.high, b.high, &dont_care);

    return borrow != 0;
}

The resulting assembly is minimal:

operator<(u128, u128):
        cmp     rdi, rdx
        sbb     rsi, rcx
        setb    al
        ret

One compare, one subtract with borrow, and a flag check. This is exactly the kind of codegen we want in hot code.

A small use site

To make sure everything composes properly, let's build a slightly larger function.

u128 demo_u128(u128 a, u128 b) 
{
    u128 x = a + b;
    u128 y = a * b;
    return x < y
            ? y - x
            : x - y;
}

All operators inline cleanly:

demo_u128(u128, u128):
        mov     r8, rdi
        add     r8, rdx
        mov     r9, rsi
        adc     r9, rcx
        mulx    rax, r10, rdi
        imul    rcx, rdi
        imul    rdx, rsi
        add     rdx, rcx
        add     rdx, rax
        mov     rax, r8
        sub     rax, r10
        mov     rcx, r9
        sbb     rcx, rdx
        jae     .LBB13_2
        sub     r10, r8
        sbb     rdx, r9
        mov     rax, r10
        mov     rcx, rdx
.LBB13_2:
        mov     rdx, rcx
        ret

There is a branch for the ternary operator while the builtin version uses conditional moves instead. Which is better depends on data patterns but could go either way. I consider this basically as good as it gets.

Platform notes

The examples shown are for x64 with Clang or GCC.

On MSVC, _addcarry_u64 and _subborrow_u64 work the same way. For multiplication, _umul128 replaces _mulx_u64.

On AArch64, the same approach applies using adds and adcs instructions for addition, subs and sbcs for subtraction, and mul + umulh for the low + high half of a 64-bit multiply. The patterns carry over directly, even though the intrinsics differ slightly (and multiplication is split into two parts).

Outlook

This u128 is the easiest large integer you can write. Our goal is best performance, so we made sure that codegen is reasonably identical to the builtin __uint128_t.

From here, the same patterns extend naturally to signed variants, widening multiplies such as u128 times u128 to u256, and chains of fixed-width integers like i192 or i256. More importantly, the same reasoning applies when you design predicate-specific arithmetic that only computes what is actually needed.

A lot of our performance really boils down to these types: No BigInteger tax, no floating predicates (that need a few kB of stack space for the worst case). Just a lot of straightline integer code and some light static branching.

Addendum 2026-01-24

Some notes based on reader feedback.

On PowerPC: I don't know much about PowerPC or the availability of intrinsics, but the instructions you need are definitely there. For add/sub with carry, there's addc / adde for addition and subfc / subfe for subtraction with borrow. For wide multiplication, mulhdu gives you the high half of a 64-bit multiply (and mulhd for signed).

On GCC codegen with intrinsics: Someone pointed out that GCC doesn't always handle the intrinsic-based addition optimally. It sometimes moves the result through the stack before setting it in the final registers. This is related to writing directly to the result struct. If you write to "real" scalars first, the codegen is optimal again. A weird one.

On division: There is no neat codegen for division. Even the builtin delegates to a library call. The naive but practical approach is binary long division, which finishes in up to 128 steps. Either branchless with fixed runtime or with a loop that searches for the next set bit. Either way it's a bit of work. Our exact predicates are always formulated in a division-free way simply because division would be expensive.

On _BitInt(N): The upcoming _BitInt(N) type does work, but with caveats. For B = 128, you get the normal codegen. For B > 128, GCC calls into a function where the bit size is a runtime parameter. So yes, they would work, but performance will be subpar for larger widths. Clang generates properly inlined assembly.

联系我们 contact @ memedata.com