快速排序，无分支设计

快速排序，无分支设计
Fast Sorting, Branchless by Design

原始链接: https://00f.net/2026/02/17/sorting-without-leaking-secrets/

## 安全与性能：常数时间排序传统的排序算法，如快速排序和归并排序，优先考虑速度，但它们数据相关的执行时间可能会产生安全漏洞——时序侧信道攻击，攻击者可以据此推断排序数据的相关信息。这对于后量子密码学至关重要，因为排序是其核心操作。解决方案在于**排序网络**：由比较和交换操作组成的固定电路，其执行顺序与输入数据无关。虽然看似效率低下（与快速排序的O(n log n)相比，为O(n log²n)），但它们提供常数时间的执行，这对于安全性至关重要。即使冒泡排序也因其固定的比较模式而符合排序网络的定义。高效的排序网络，如**Bitonic排序**，利用并行性，并且在GPU上可能表现出令人惊讶的性能。最近的实现，如**djbsort**，通过利用SIMD向量化和精心设计代码进一步优化，从而实现与标准排序算法竞争甚至超越的速度。Djbsort还包括形式化验证，并通过位级别转换有效地处理浮点数。最终，排序网络以少量性能开销换取显著的安全优势，使其成为敏感应用的理想选择，并且是传统排序方法的有力替代方案。

Hacker News 新闻 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录快速排序，无分支设计 (00f.net) 9 分，由 jedisct1 发表于 2 小时前 | 隐藏 | 过去 | 收藏 | 2 条评论帮助 jstrieb 40 分钟前 [–] 哇，这堆满了有趣的信息。感谢你的写作！(也感谢你所有的其他优秀的开源作品！) 是否有计划将其上游到 Zig 标准库中？似乎它不仅仅对密码学包有用，因为结尾的基准测试表明它通常比 std pdqsort 更快。我检查了 Codeberg 和 GitHub 上的 issue 追踪器，没有看到任何提到 djbsort 或 NTRU Prime 的内容，这让我相信（目前）没有（官方）上游计划。回复 ozgrakkurt 4 分钟前 | 父级 [–] stdlib 中的 blocksort 在我的经验中也可以比 pqdsort 更快。回复指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索：

Sorting is one of the most studied problems in computer science. Every language ships a built-in sort, and for most applications, picking the right one is a matter of performance. Quicksort, mergesort, pdqsort. They all produce the correct output. The main question is how fast they get there.

But there’s a class of applications where correctness and speed aren’t enough. When the data being sorted is sensitive, the sort itself can become a security vulnerability. Not because it produces the wrong result, but because an attacker who observes execution time can learn something about the data.

This is not theoretical. Post-quantum cryptosystems like Classic McEliece and NTRU Prime use sorting as a core subroutine, and if the sort leaks timing information, the entire cryptosystem is compromised.

There is, however, a family of algorithms that eliminates this problem entirely, and when implemented well, can even outperform traditional sorts.

Why sorting leaks information

Consider quicksort.

It picks a pivot, partitions the array around it, and recurses. Which pivot gets chosen, how the partitioning plays out, and how deep the recursion goes all depend on the values in the array. Different inputs produce different branch patterns, different memory access sequences, and different execution times.

An attacker who can measure execution time, even remotely over a network, can use those timing differences to deduce information about what’s being sorted. This is a timing side-channel attack.

The problem isn’t specific to quicksort. Any sort whose control flow depends on the data is vulnerable. Mergesort’s merge step branches on comparisons. Insertion sort’s inner loop length depends on how far each element needs to move. Even pdqsort, which combines several strategies, makes data-dependent decisions at every level.

What we need is a sort where the sequence of operations is completely fixed: determined only by the array length, never by the values. The same comparisons, the same memory accesses, the same number of instructions, regardless of whether the array contains [1, 2, 3] or [3, 1, 2].

Sorting networks

A sorting network does exactly this. It’s a fixed circuit of compare-and-swap operations wired together in a specific pattern. Each comparator takes two values and outputs them in order: the smaller one goes to one wire, the larger one to the other. The entire network is determined at construction time, before any data is seen.

The simplest sorting network sorts two elements. It’s a single comparator: take two values, output (min, max). Done. For three elements, you need three comparators. For four, you need five. The pattern grows, but the key property remains: the comparisons don’t depend on the data. The same wires get compared in the same order every time.

This is what makes sorting networks attractive for cryptography. The comparison pattern is data-oblivious: an attacker watching the execution sees the exact same sequence of operations regardless of what’s being sorted.

Bubble sort as a sorting network

It turns out that bubble sort, often dismissed as the worst sorting algorithm, is already a sorting network.

Think about what bubble sort actually does. On each pass, it compares adjacent pairs from left to right: elements 0 and 1, then 1 and 2, then 2 and 3, and so on. If a pair is out of order, it swaps them. After n-1 passes over an array of n elements, the array is sorted.

The comparison pattern is always the same. It doesn’t matter what values are in the array: bubble sort always compares the same pairs in the same order. It’s a fixed schedule of compare-and-swap operations. That’s exactly the definition of a sorting network.

The problem is efficiency. Bubble sort needs O(n) passes, each with O(n) comparisons, for a total of O(n²) comparators. For tiny arrays that fit in L1/L2 caches, bubble sort can be surprisingly fine in practice. But for large arrays, it’s far too slow.

Odd-even transposition sort

A natural improvement is to parallelize. In odd-even transposition sort, instead of sweeping left-to-right, you alternate between two phases:

Phase A: compare and swap pairs (0,1), (2,3), (4,5), ...
Phase B: compare and swap pairs (1,2), (3,4), (5,6), ...

Within each phase, all the comparisons are independent: they operate on disjoint pairs of elements, so they can all happen simultaneously. And after n phases, the array is sorted.

On a single processor, this is no faster than bubble sort. But it demonstrates something important about sorting networks: they expose parallelism. All comparisons within a phase are independent, which is a property that data-dependent sorts like quicksort fundamentally cannot offer, because each comparison depends on the results of previous ones.

Still, odd-even transposition sort has O(n) depth (sequential phases) and O(n²) total comparators. We can do much better.

Bitonic sequences

The key insight that leads to efficient sorting networks comes from a special kind of sequence: a bitonic sequence.

A bitonic sequence is one that first increases and then decreases like [1, 4, 7, 5, 2]. It can also be a cyclic rotation of such a sequence, so [5, 2, 1, 4, 7] is also bitonic (it falls, then rises, which is the same shape rotated).

Why do we care? Because a bitonic sequence of length n can be sorted by a fixed network with only O(log n) depth.

The trick is called a bitonic merger, and it works like this: compare each element in the first half with the corresponding element in the second half (elements at distance n/2). This single layer of comparators splits the bitonic sequence into two smaller bitonic sequences, one in each half. Recurse on each half, halving the distance each time. After log n layers, the sequence is sorted.

A bitonic merger is a small, efficient, fixed-structure circuit. It is the building block for sorting any array, not just bitonic ones.

Batcher’s bitonic sort

In 1968, Ken Batcher showed how to use the bitonic merger to build a full sorting network. The construction is recursive:

Split the array in half.
Recursively sort the first half in ascending order and the second half in descending order.
The result is a bitonic sequence. Merge it with the bitonic merger.

For clarity, assume n is a power of two. Practical implementations handle arbitrary lengths with padding or tailored networks.

Here’s a concrete example with 8 elements. Start with:

After recursively sorting the first half ascending and the second half descending (which itself applies the same construction on smaller subarrays), we get:

[1, 3, 6, 7,  8, 5, 4, 2]
 ascending    descending

This is a bitonic sequence: it rises then falls. Now we apply the bitonic merger. The rule is simple: for each pair (i, i+d), keep the minimum at position i and the maximum at position i+d. Halve d after each layer:

    positions: [0, 1, 2, 3, 4, 5, 6, 7]

        input: [1, 3, 6, 7, 8, 5, 4, 2]

d = 4   pairs: (0,4), (1,5), (2,6), (3,7)
          out: [1, 3, 4, 2, 8, 5, 6, 7]

d = 2   pairs: (0,2), (1,3), (4,6), (5,7)
          out: [1, 2, 4, 3, 6, 5, 8, 7]

d = 1   pairs: (0,1), (2,3), (4,5), (6,7)
          out: [1, 2, 3, 4, 5, 6, 7, 8]

Three layers, four comparisons each, and the array is sorted. The comparison schedule was the same regardless of what values were in the array: only the array length determined which pairs got compared.

How many comparisons does this take? Each level of recursion halves the array, so there are O(log n) levels. At each level, the mergers collectively cover all n elements across O(log n) layers. That’s O(n log n) comparisons per level, and O(log n) levels, for a total of O(n log²n) comparators.

This is more work than quicksort’s O(n log n). But the comparison pattern is completely fixed, which is exactly the property we need for constant-time sorting. And O(n log²n) is close enough to O(n log n) to be practical. The extra log factor is a modest price for side-channel resistance. More importantly, layers are made of nothing but simple, disjoint operations that can run in parallel.

This structure is also a natural fit for GPUs. GPUs execute threads in lockstep groups, and when threads within a group take different branches, performance collapses. Data-dependent sorts like quicksort suffer from this. Sorting networks avoid it entirely: every thread executes the same compare-and-swap regardless of the data. Bitonic sort has been one of the standard GPU sorting algorithms for this reason, and is included in NVIDIA’s CUDA samples as a textbook example.

Note that there exists a theoretically optimal O(n log n) sorting network, the AKS network, but its constants are so enormous that it’s unusable in practice. As Knuth put it, Batcher’s method is better unless n exceeds the total memory capacity of all computers on earth.

djbsort

Daniel J. Bernstein (“djb”) released djbsort in 2018 and keeps updating it. The goal is a sort that is simultaneously fast, constant-time, and verifiable. This implementation uses SIMD vectorization for speed, data-oblivious execution for security, and includes formal verification tooling that operates on the compiled binary.

djbsort implements Batcher’s sorting network with the wire ordering rearranged to be SIMD-friendly.

Since all comparisons within a layer operate on disjoint pairs, the CPU can exploit this independence in two ways: SIMD instructions process multiple pairs per instruction, and out-of-order execution overlaps independent scalar operations. The sorting network’s structure exposes both forms of parallelism naturally.

The core primitive is the compare-and-swap, and djbsort provides two versions.

The fast path uses hardware min/max instructions. On modern CPUs, packed SIMD min and max operate on entire vectors of elements at once and are inherently branchless. One instruction compares 4, 8, or 16 pairs simultaneously with no data-dependent control flow.

The fallback path uses pure arithmetic. It subtracts the two values to get a difference, extracts the sign bit via a right shift, negates it to produce either an all-ones or all-zeros mask, then XOR-swaps the values conditioned on that mask. Unless the compiler is able to understand and optimize that pattern, we’ll get no branches, no data-dependent memory accesses, and only operations that are expected to run in constant time on the target architecture. The same operations execute regardless of whether the pair was already in order.

A Zig implementation

I implemented djbsort in Zig. The library exposes two functions.

sort is the fast path for native numeric types: integers and floats of any width:

const djbsort = @import("djbsort");

var data = [_]i32{ 42, -7, 13, 0, -99 };
djbsort.sort(i32, .asc, &data);
// data is now { -99, -7, 0, 13, 42 }

sortWith handles arbitrary types, including structs. It follows the same interface as std.sort.pdq. You provide a comparison function and an optional context:

const Point = struct { x: i32, y: i32 };

var points = [_]Point{ .{ .x = 3, .y = 1 }, .{ .x = 1, .y = 2 } };
djbsort.sortWith(Point, &points, {}, struct {
    fn lessThan(_: void, a: Point, b: Point) bool {
        return a.x < b.x;
    }
}.lessThan);

Both functions use the same Batcher sorting network topology. They differ in how the compare-and-swap is implemented.

The native path has two levels of optimization. On modern CPUs with SIMD support, it processes elements in vector-wide chunks using packed min/max. For the leftover elements that don’t fill a full vector, it falls back to a scalar compare-and-swap.

The scalar path is gated by a comptime flag called has_ct_minmax. When side-channel mitigations are disabled (side_channels_mitigations == .none), it’s always true. In that specific case, we don’t need constant-time code, so @min/@max are fine on any architecture. But when mitigations are enabled (and by default, they are), it’s only true on x86/x86_64 and aarch64, where @min and @max have been verified to lower to branchless instructions (cmov, csel). On other architectures, the code falls back to the arithmetic sign-bit extraction and XOR swap:

const diff = if (order == .asc) b_int - a_int else a_int - b_int;
const sign_bit: u1 = @truncate(@as(UWInt, @bitCast(diff)) >> @intCast(bits));
var mask_word = @as(usize, 0) -% @as(usize, sign_bit);

The subtraction produces a negative result when the pair is out of order. The sign bit is extracted via a right shift, then spread to a full-width mask via 0 -% sign_bit (wrapping subtraction: 0 minus 1 wraps to all-ones, 0 minus 0 stays zero). The mask drives an XOR swap of the raw bytes.

The generic path (sortWith) behaves differently depending on std.options.side_channels_mitigations. When mitigations are disabled, it uses a plain conditional swap: a branch and std.mem.swap. Fast, but not necessarily constant-time. When mitigations are enabled, it uses ctCondSwap, which XOR-masks the raw byte representation in machine-word-sized chunks. Even with mitigations enabled, sortWith is only timing-safe if the caller-supplied comparison function is itself constant-time. A data-dependent lessThan would leak through the comparator, not the swap.

Both paths use the same optimization barrier to prevent the compiler from undoing the constant-time code:

mask_word = asm volatile (""
    : [mask] "=r" (-> usize),
    : [_] "0" (mask_word),
);

This is a no-op at the hardware level. It generates no instructions. But it tells the compiler that mask_word might have been modified, which prevents LLVM from reasoning about its value and converting the branchless XOR swap back into a conditional branch. This is more efficient than std.mem.doNotOptimizeAway(&mask) which includes a memory clobber.

Float ordering

IEEE 754 floats don’t have a total order. NaN is unordered. It’s not less than, equal to, or greater than anything, including itself. And -0.0 compares equal to +0.0, even though they have different bit representations.

For a sorting network, we need a total order. The sort function imposes one: -NaN < -inf < ... < -0.0 < +0.0 < ... < +inf < +NaN.

The solution is djbsort’s “useint” technique: don’t sort the floats at all. Instead, reinterpret their bits as signed integers, apply a transform that makes integer comparison match the desired float ordering, sort using the integer sorting network, and transform back.

The transform is a function called floatSortKey. Positive floats already sort correctly as integers: larger floats have larger bit values. Negative floats sort in the wrong direction: -1.0 has a larger bit pattern than -2.0, but should come after it. The fix is to arithmetic-right-shift the sign bit to fill the entire word (all-ones for negative, all-zeros for positive), AND with maxInt to clear the sign bit, and XOR. This flips the magnitude bits of negative values, reversing their order while leaving positive values unchanged:

fn floatSortKey(comptime SInt: type, s: SInt) SInt {
    const mask = s >> (@bitSizeOf(SInt) - 1);
    return s ^ (mask & std.math.maxInt(SInt));
}

The transform is self-inverse: applying it twice gives back the original value. So the whole float sort is just a thin wrapper around the integer sort:

if (@typeInfo(T) == .float) {
    const SInt = std.meta.Int(.signed, @bitSizeOf(T));
    const int_items: [*]SInt = @ptrCast(items.ptr);
    const int_slice = int_items[0..n];
    applyFloatSortKey(SInt, int_slice);  // O(n) transform, vectorized
    sort(SInt, order, int_slice);        // reuse the integer sort
    applyFloatSortKey(SInt, int_slice);  // O(n) reverse (self-inverse)
    return;
}

The pre/post transform is itself vectorized, processing elements in SIMD-width chunks. The cost is two linear passes over the array, negligible next to the O(n log²n) sorting work. In benchmarks, float sorting runs within a few percent of integer sorting speed.

Sorting networks trade a modest increase in total work, O(n log²n) instead of O(n log n), for a property that mainstream sorts like quicksort and pdqsort cannot offer: a fixed, data-independent comparison schedule. djbsort makes this practical through SIMD vectorization and the instruction-level parallelism inherent in sorting networks.

On Apple Silicon, the SIMD path (sort) runs 2x to 5x faster than the standard library’s pdqsort across all types and sizes tested, from 16 to 1M elements. For i32 at 1M elements, djbsort is still about 3x faster.

On AMD Zen4, the speedups are even larger at small and medium sizes, up to 5.5x at 16K elements. At 1M elements the O(n log²n) work catches up, but djbsort still wins by about 1.3x.

Floats sort at the same speed as their integer counterparts on both architectures, within a few percent. The generic path (sortWith), which cannot use SIMD, beats pdq up to around 65K elements, then gradually loses as the extra log factor dominates.

For a data-oblivious sort, being faster than a mainstream data-dependent sort across most practical sizes is a welcome bonus. And for the cryptographic use cases that motivate djbsort, arrays tend to be well within that range.

The code is on GitHub.