你可以战胜二分查找。
You can beat the binary search

原始链接: https://lemire.me/blog/2026/04/27/you-can-beat-the-binary-search/

## SIMD 四分查找算法总结 文章详细介绍了“SIMD 四分”算法的开发,该算法旨在高效地在排序的 16 位无符号整数数组中查找值。传统的二分查找是有效的,但作者旨在利用现代处理器的功能——特别是数据并行指令 (SIMD)——进行进一步优化。 SIMD 四分算法将四分插值查找与 SIMD 指令相结合。它将数组分成 16 个元素的块,并使用插值快速识别可能包含目标值的块。然后,SIMD 指令用于同时将目标值与该块内的所有 16 个元素进行比较。 在 Intel 和 Apple 处理器上的基准测试表明,SIMD 四分查找始终优于二分查找。在冷缓存的 Intel 系统上,优势更为明显,而 Apple 系统在冷缓存下也获得了收益。四分查找组件在 Intel 上被证明有益,增强了内存级并行性。作者得出结论,调整算法以利用现代硬件并行性,可以比教科书方法获得显著的性能提升。源代码公开可用以供进一步研究。

Hacker News 新闻 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 你可以击败二分查找 (lemire.me) 22 分,vok 1 小时前 | 隐藏 | 过去 | 收藏 | 1 条评论 帮助 bediger4000 21 分钟前 [–] 这篇文章中的 (AI 生成的?) 图片完全没有帮助,我认为根据我对文章的理解,它是错误的。 最好不要有图片。回复 考虑申请 YC 的 2026 年夏季批次!申请截止至 5 月 4 日 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文

We sometimes have to look for a value in a sorted array. The simplest algorithm consists in just going through the values one by one, until we encounter the value, or exhaust the array. We sometimes call this algorithm a linear search. In C++, you can get the desired effect with the std::find function.

For large arrays, you can do better with a binary search. Binary search is a classic algorithm that efficiently locates a target value in a sorted array by repeatedly dividing the search interval in half. Starting with the entire array, it compares the target to the middle element: if the target is smaller, it discards the upper half; if larger, it discards the lower half. This process continues until the target is found or the interval is empty. It is much faster than linear search for large datasets. In C++, this is implemented by the std::binary_search function, which returns a boolean indicating whether the value is present.

The popular Roaring Bitmap format uses arrays of 16-bit integers of size ranging from 1 to 4096. We sometimes have to check whether a value is present. We use a binary search.

I wanted a faster approach. I had two insights.

  1. Virtually all processors today have data parallel instructions (sometimes called SIMD) that can check several values at once. Both 64-bit ARM and x64 processors (Intel/AMD) always support comparing eight 16-bit integers with a target value using a single instruction. This suggests that you should not bother going down in the binary search to blocks that are smaller than eight elements. And you may also want to cheaply compare sixteen elements or more.
  2. The binary search checks one value at a time. However, recent processors can load and check more than one value at once. They have excellent memory-level parllelism. This suggest that instead of a binary search, we might want to try a quaternary search: instead of splitting arrays in halves, we might split them in quarters. The net result might generate a few more instructions but the number of instructions is likely not the limiting factor.

Thus, I created something I call the SIMD Quad algorithm. It is an efficient search algorithm for sorted arrays of 16-bit unsigned integers, combining a quaternary interpolation search with SIMD (Single Instruction, Multiple Data). The algorithm divides the array into fixed-size blocks of 16 elements (except maybe for the last block) and uses the last element of each block as interpolation keys to quickly narrow down the search to a single block, then employs SIMD instructions to check all 16 elements in that block simultaneously.

The core idea is to perform a hierarchical search: first, use interpolation search on a coarser level (block boundaries) to find the likely block containing the target value, then switch to SIMD for fine-grained parallel checking within the block. This hybrid approach leverages the strengths of both algorithmic optimization (interpolation search reduces comparisons logarithmically) and hardware acceleration (SIMD checks multiple elements at once).

  1. Initial Check: If the array has fewer than 16 elements, perform a simple linear search through all elements.
  2. Block Division: Divide the array into blocks of 16 consecutive elements. For an array of size cardinality, there are num_blocks = cardinality / 16 full blocks.
  3. Quaternary Interpolation Search: Use the last element of each block (at positions 16-1, 32-1, etc.) as keys for interpolation. The search performs a quaternary (base-4) interpolation to find the block where the target pos is likely located. This involves comparing the target against quarter-points of the current search range and adjusting the base accordingly.
  4. Block Selection: After narrowing down, select the appropriate block index lo based on the interpolation results.
  5. SIMD Check: If a valid block is found, load the 16 elements into SIMD registers (using NEON on ARM or SSE2 on x64) and perform parallel equality comparisons with the target value. If any match is found, return true.
  6. Remainder Check: For any elements not in full blocks (remainder), perform a linear search.

How does it do? I wrote a benchmark. The benchmark works as follows. For each array size from 2 to 4096 elements, it generates 100,000 sorted arrays of 16-bit unsigned integers. For each size, it performs 10 million membership queries in “cold” mode (each query searches a different array, simulating cache misses) and 10 million queries in “warm” mode (queries are grouped by array, with each array being searched 100 times consecutively, simulating cache hits). The benchmark measures the average time per query for three algorithms: linear search (std::find), binary search (std::binary_search), and the new SIMD Quad algorithm.

I use two systems. An Apple M4 with Apple LLVM and an Intel Emeral Rapids processor with GCC.

Firstly, let us compare the linear search with the binary search.

Intel/GCC:

Apple/LLVM

The result is clear. The binary search beats the linear search as soon as the arrays get large. That is to be expected.

On a cold cache, the linear search is relatively worse. That is to be expected because it accesses more data, causing more cache faults.

We have established that the binary search is the net winner over the linear search. Let us now compare with the SIMD Quad algorithm.

Intel/GCC:

Apple/LLVM

The results differ markedly between the Intel and Apple platform. On the Intel platform the SIMD Quad is more than twice as fast as the binary search on the warm cache. The benefits are lesser on the cold cache. On the Apple platform, the reverse is true, it is with the cold cache that the SIMD Quad is more than twice as fast, whereas the benefits are more marginal on the warm cache.

But the important point is that, in all instances, SIMD Quad is faster than the binary search.

The SIMD component of the algorithm is rather straightforward: we use specialized instructions that save work. So it is easy to see why it might make things faster. There are few instructions, fewer branches.

But what about the ‘quad’ part. Does it matter? So I tried a binary version of the same algorithm. It has the same SIMD optimization, but I am dropping the quaternary interpolation search and replacing it with a standard binary search.

Intel/GCC:

Apple/LLVM

To put it in simple terms, the quad approach has little effect on the Apple platform, but it is a decent optimization on the Intel platform for large arrays in the cold case. The quaternary search better exploits the memory-level parallelism on my Intel server.

My source code is available.

Conclusion. What my results suggest is that while a textbook binary search is a decent algorithm, you can do better in ways that matter. Standard algorithms were often not designed for computers that have so much parallelism. The SIMD Quad algorithm tries to leverage both the memory-level and data parallelism. Further, I suspect that we can do even better than my algorithm. Let us get creative!

Further reading: Faster intersections between sorted arrays with shotgun

bool simd_quad(const uint16_t *carr, int32_t cardinality, 
            uint16_t pos) {
    constexpr int32_t gap = 16;
    if (cardinality < gap) {
      for (int32_t j = 0; j < cardinality; j++) {
          if (carr[j] == pos) return true;
        }
        return false;
    }
    int32_t num_blocks = cardinality / gap;
    int32_t base = 0;
    int32_t n = num_blocks;
    while (n > 3) {
      int32_t quarter = n >> 2;

      int32_t k1 = carr[(base + quarter + 1) * gap - 1];
      int32_t k2 = carr[(base + 2 * quarter + 1) * gap - 1];
      int32_t k3 = carr[(base + 3 * quarter + 1) * gap - 1];

      int32_t c1 = (k1 < pos);
      int32_t c2 = (k2 < pos);
      int32_t c3 = (k3 < pos);

      base += (c1 + c2 + c3) * quarter;
      n -= 3 * quarter;
    }
    while (n > 1) {
        int32_t half = n >> 1;
        base = (carr[(base + half + 1) * gap - 1] < pos) 
                 ? base + half : base;
        n -= half;
    }
    int32_t lo = (carr[(base + 1) * gap - 1] < pos) 
                ? base + 1 : base;

    if (lo < num_blocks) {
        const uint16_t *blk = carr + lo * gap;
#ifdef __ARM_NEON
        uint16x8_t needle = vdupq_n_u16(pos);
        uint16x8_t v0 = vld1q_u16(blk);
        uint16x8_t v1 = vld1q_u16(blk + 8);
        uint16x8_t hit = vorrq_u16(vceqq_u16(v0, needle), 
                  vceqq_u16(v1, needle));
        return vmaxvq_u16(hit) != 0;
#else
        __m128i needle = _mm_set1_epi16((short)pos);
        __m128i v0 = _mm_loadu_si128((const __m128i *)blk);
        __m128i v1 = _mm_loadu_si128((const __m128i *)(blk + 8));
        __m128i hit = _mm_or_si128(_mm_cmpeq_epi16(v0, needle),
                                   _mm_cmpeq_epi16(v1, needle));
        return _mm_movemask_epi8(hit) != 0;
#endif
    }

    for (int32_t j = num_blocks * gap; j < cardinality; j++) {
        uint16_t v = carr[j];
        if (v >= pos) return (v == pos);
    }
    return false;
}

`; modal.addEventListener('click', function(e) { if (e.target === modal) modal.close(); }); modal.querySelector('#bibtex-copy-btn').addEventListener('click', function() { const text = modal.querySelector('#bibtex-target').textContent; navigator.clipboard.writeText(text).then(() => { const origText = this.innerText; this.innerText = "Copied!"; setTimeout(() => this.innerText = origText, 1500); }); }); document.body.appendChild(modal); const style = document.createElement('style'); style.innerHTML = `dialog::backdrop { background: rgba(0, 0, 0, 0.5); }`; document.head.appendChild(style); }                         // 1. Extract the URL             const fullLinkHtml = el.dataset.fullLink;              const tempDiv = document.createElement('div');             tempDiv.innerHTML = fullLinkHtml;              const linkElement = tempDiv.querySelector('a');             const rawUrl = linkElement ? linkElement.href : '';                           // 2. Compute the current access date             const accessedDate = this.getCurrentAccessedDate();              // 3. --- NEW LOGIC: Extract ONLY the year (YYYY) ---             // Gets the full date string, e.g., "November 23, 2025"             const fullDateString = el.dataset.year;             // Use regex to find the four-digit year at the end of the string             const match = fullDateString.match(/(\d{4})$/);             const publicationYear = match ? match[0] : '????'; // e.g., '2025'                          // 4. Generate BibTeX Data with the corrected year             const safeTitle = el.dataset.title.replace(/[^a-zA-Z0-9]/g, '').substring(0, 15);             // Use the clean year for the BibKey             const bibKey = (publicationYear + safeTitle);             const content = `@misc{${bibKey}, author = {${el.dataset.author}}, title = {{${el.dataset.title}}}, year = {${publicationYear}}, howpublished = {\\url{${rawUrl}}}, note = {Accessed: ${accessedDate}} }`;                          // 5. Show Modal             document.getElementById('bibtex-target').textContent = content;             modal.showModal();         }     }; })();
联系我们 contact @ memedata.com