排序性能兔子洞

排序性能兔子洞
Sorting Performance Rabbit Hole

原始链接: https://nibblestew.blogspot.com/2026/04/sorting-performance-rabbit-hole.html

## Pystd 排序性能：追求速度最近的努力集中在优化 Pystd 库中的排序算法，以超越 stdlibc++ 的性能。稳定排序取得了快速的成功，通过小的调整实现了 5% 的速度提升，达到 0.86 秒。不稳定排序证明更具挑战性。尽管进行了许多优化——包括模仿 stdlibc++ 的元素移动技术（使用 `memmove` 和特定的交换方法），试验 Shell 排序和基数排序，以及改进枢轴选择——Pystd 始终落后 5-10%。突破意外地来自于调整内省排序算法中从快速排序切换到插入排序的阈值。将此限制从 16 增加到 32（并短暂测试 64）产生了最大的性能提升。最终，Pystd 略微胜过 stdlibc++，用时 0.754 秒，而 stdlibc++ 用时 0.755 秒，这表明了进一步优化的潜力。虽然胜利微小，但这标志着在追求更快的排序方面取得了一项重大成就。

黑客新闻新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录排序性能兔子洞 (nibblestew.blogspot.com) 3 分，由 ingve 发表于 2 小时前 | 隐藏 | 过去 | 收藏 | 讨论帮助指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索：

原文

In an earlier blog post we found out that Pystd's simple sorting algorithm implementations were 5-10% slower than their stdlibc++ counterparts. The obvious follow up nerd snipe is to ask "can we make the Pystd implementation faster than stdlibc++?"

For all tests below the data set used was 10 million consecutive 64 bit integers shuffled in a random order. The order was the same for all algorithms.

It turns out that the answer for stable sorting is "yes, surprisingly easily". I made a few obvious tweaks (whose details I don't even remember any more) and got the runtime down to 0.86 seconds. This is approximately 5% faster than std::stable_sort. Done. Onwards to unstable sort.

This one was not, as they say, a picnic. I suspect that stdlib developers have spent more time optimizing std::sort than std::stable_sort simply because it is used a lot more.

After all the improvements I could think of were done, Pystd's implementation was consistently 5-10% percent slower. At this point I started cheating and examined how stdlibc++'s implementation worked to see if there are any optimization ideas to steal. Indeed there were, but they did not help.

Pystd's insertion sort moves elements by pairwise swaps. Stdlibc++ does it by moving the last item to a temporary, shifting the array elements onwards and then moving the stored item to its final location. I implemented that. It made things slower.

Stdlibc++'s moves use memmove instead of copying (at least according to code comments). I implemented that. It made things slower.

Then I implemented shell sort to see if it made things faster. It didn't. It made them a lot slower. So did radix sort.

Then I reworked the way pivot selection is done and realized that if you do it in a specific way, some elements move to their correct partitions as a side effect of median selection. I implemented that and it did not make things faster. It did not make them slower, either, but the end result should be more resistant against bad pivot selection so I left it in.

At some point the implementation grew a bug which only appeared with very large data sets. For debugging purposes I reduce the limit where introsort switches from qsort to insertion sort from 16 to 8. I got the bug fixed but the change made sorting a lot slower. As it should.

But this raises a question, namely would increasing the limit from 16 to 32 make things faster? It turns out that it did. A lot. Out of all perf improvements I implemented, this was the one that yielded the biggest improvement by a fairly wide margin. Going to 64 elements made it even faster, but that made other algorithms using insertion sort slower, so 32 it is. For now at least.

After a few final tweaks I managed to finally beat stdlibc++. By how much you ask? Pystd's best observed time was 0.754 seconds while stdlibc++'s was 0.755 seconds. And it happened only once. But that's enough for me.

排序性能兔子洞 Sorting Performance Rabbit Hole

排序性能兔子洞
Sorting Performance Rabbit Hole