无分支快速排序,比 std::sort 和 pdqsort 更快,提供 C 和 C++ API。
Branchless Quicksort faster than std:sort and pdqsort with C and C++ API

原始链接: https://tiki.li/blog/blqsort

**blqsort** 是一款高性能、无分支(branchless)的快速排序实现,旨在通过减少 CPU 分支预测错误来超越标准库的排序算法。该算法通过利用辅助缓冲区和用于小数据集的排序网络,消除了内循环中的条件分支,从而显著缩短了在现代硬件上的执行时间。 在 Apple M1 和 AMD Ryzen 系统上的基准测试表明,blqsort 的性能始终优于 `std::sort` 和 `pdqsort`。为了保持稳健性,它采用了中位数的中位数(median-of-medians)作为枢轴策略,在分区不平衡时回退至堆排序,并针对非平凡、高代价的数据类型提供了专门的变体。 该库专为在 C 和 C++ 环境中易于使用而设计,提供与 `std::sort` API 相似的仅头文件(header-only)实现。它支持单线程和多线程执行,并且比典型的基于 SIMD 的排序库能更灵活地处理自定义数据结构。对于寻求优化性能的开发者而言,blqsort 提供了一种直接替换方案:它通过增加数据拷贝量,换取了在现代深度流水线 CPU 上更快的处理速度。

最近的一场 Hacker News 讨论聚焦于一种新的“无分支快速排序”(blqs)实现,该实现声称性能优于 C++ 标准库中的排序函数(`std::sort` 和 `pdqsort`)。 讨论帖中包含了 `pdqsort` 创建者 Orson Peters 的见解,他分享了该新库与 `ipnsort` 及 `driftsort` 的性能基准测试对比——后两者是目前已集成到 Rust 标准库中的高性能排序算法。虽然 `blqs` 在处理随机数据时表现出色,但基准测试显示,在处理部分排序的数据集时,`driftsort` 的表现明显优于其他方法。 评论者们对基准测试的细节进行了探讨,指出该代码库缺乏与向量化排序网络(如英特尔的实现)的对比,且过度依赖随机输入测试,这可能无法代表真实的生产使用场景。尽管存在这些批评,社区仍称赞了 `blqs` 实现的简洁与优雅,并指出与那些复杂冗长的排序算法相比,其极简设计更易于理解和维护。
相关文章

原文
Branchless Quicksort

Fast Branchless Quicksort using Sorting-Networks with C and C++ Interface

Performance results naturally depend on the underlying hardware. The following benchmarks show the execution times for sorting 50 million doubles using different sorting implementations. The measurements were taken on an Apple M1 system using Clang and on an AMD Ryzen 3 Linux system using GCC, both compiled with the -O3 option.

Implementation Apple M1 AMD Ryzen
std::sort 1.33s 5.56s
pdqsort 1.33s 2.81s
blqsort (single threaded) 0.97s 2.06s

For a fair comparison, the single-threaded version of blqs was used here. On an M1, the threaded versions are another factor of 3 to 4 faster. In terms of runtime, the C++ versions differ only very little from the C version.

blqsort

Full source code is included on Github.There are four implementations of blqsort here, each provided as a single header file.

Branchless programming

On modern CPUs, avoiding branch misprediction is a key technique to speed up programs. This branchless approach:

for (int i = 0; i < 1000; i++) {
    small_numbers[smlen] = numbers[i];
    smlen += (numbers[i] < 500);
}
is much faster than the conventional version with a conditional branch:
for (int i = 0; i < 1000; i++) {
    if (numbers[i] < 500) {
        small_numbers[smlen] = numbers[i];
        smlen += 1;
    }
}
This paper by Edelkamp and Weiß shows how partitioning performance in Quicksort can be improved by avoiding conditional branches.

The strategy of using an auxiliary buffer for branchless partitioning is inspired by fluxsort. The “auxiliary buffer” here means a 1024‑element stack array, not heap memory.

First, 1024 elements are copied from one side into an auxiliary buffer to make room for subsequent operations. Then, we alternately copy a 1024-element block to the left and right in a branchless manner. The left pointer is only incremented if the element is smaller than the pivot, otherwise, the right pointer is incremented - branchless, of course.

This involves more more than double the necessary copy operations. For data types that are cheap to copy, however, this is much less expensive than the branch mispredictions that would otherwise occur.

Pivot strategy, bad input and sorting network

To avoid the O(n²) runtime caused by bad input data, the program can group identical elements together and switch to heapsort for that specific part if it detects a big imbalance during partitioning. The program also checks if a partition is already sorted.

For larger parts, it uses a median-of-medians strategy to find a good pivot. In addition, critical partitioning loops are explicitly unrolled.

For 2 to 12 elements, the algorithm uses custom sorting networks. This approach requires a separate code path for each size but sorts small subsets with very few swaps using a branchless sort‑2 primitive. Source for sorting networks

C++

For types with higher copy costs that are not is_trivially_copyable (such as strings), the buffer-based branchless approach becomes less efficient. In such cases, a BlockQuicksort (Edelkamp and Weiß) variant is used instead. This processes only the element indices in a branchless manner and then moves the actual data with fewer swaps. Some ideas are from pdqsort.

Usage

You only need to include blqs.h, and it can be used just as easily as std::sort.

#include "blqs.h"

double data[SIZE];

blqs::sort(data, data + SIZE);
For the C++ multithreaded variant, which uses C++ threads, include blqs_thr.h instead of blqs.h. The function call remains the same.

C

In C, the code specialized for the data type is generated using #define.

Usage

#define BLQS_CMP(a, b) ((a) < (b))
#define BLQS_TYPE double
#include "blqsort.h"

double data[SIZE];

blqsort(data, SIZE);
For the C multithreaded variant, which uses POSIX threads, include blqsort_thr.h instead of blqsort.h. The function call remains the same here as well.

Sorting Custom Data Structures

In practice, we often need to sort custom data structures. This is where SIMD libraries like Google Highway - while very fast for simple numbers - become difficult to use.

Using std::sort or blqsort gives you much more flexibility.

C++

#include "blqs.h"

struct entry {
    int id;
    int value;

    bool operator<(const entry& other) const {
        return id < other.id;
    }
};

struct entry data[SIZE];

blqs::sort(data, data + SIZE);

C

struct entry {
    int id;
    int value;
};
#define BLQS_CMP(a, b) (((a).id) < ((b).id))
#define BLQS_TYPE struct entry
#include "blqsort.h"

struct entry data[SIZE];

blqsort(data, SIZE);
Execution times for sorting 50 million of these structs.
Implementation Apple M1 AMD Ryzen
std::sort 3.46s 4.75s
pdqsort 3.46s 4.72s
blqsort 0.96s 2.20s

Links

When ‘if’ slows you down, avoid it.

Interactive sorting demo


[email protected]

联系我们 contact @ memedata.com