所以，你想快速地切块吗？

所以，你想快速地切块吗？
So, you want to chunk really fast?

原始链接: https://minha.sh/posts/so,-you-want-to-chunk-really-fast

## Memchunk：一个用于RAG的快速文本分块库作者开发了“memchunk”，一个高性能的文本分块库，用于检索增强生成（RAG）流程。其开发源于在大型数据集（如维基百科）上进行基准测试时对速度的需求。传统的文本分块方法通常通过字符数简单分割文本，这会破坏句子结构并降低检索质量。更好的方法是基于分隔符的分块——在语义边界（如句点和换行符）处分割。 Memchunk 利用 `memchr` crate 进行优化的字节搜索。对于 1-3 个分隔符，它利用 `memchr` 的 SIMD 指令（AVX2/SSE2）进行极其快速的搜索。处理 4 个或更多分隔符时，它切换到查找表以实现 O(1) 性能。一项关键优化是*反向*搜索，从所需块大小向后移动时找到第一个分隔符，从而最大限度地减少操作。基准测试表明，memchunk 的吞吐量高达 164 GB/s，远超其他 Rust 分块库（比 Kiru 快 36 倍，比 Langchain 等 Python 替代方案快数千倍）。提供 Python 和 WebAssembly (WASM) 绑定，提供零拷贝视图以最大限度地减少开销。Memchunk 通过战略性算法选择和高效的内存管理来优先考虑速度。

## memchunk：用于LLM管道的快速文本分块 Chonkie分块库的维护者开发了**memchunk**，一个实现高达**1TB/s**文本分块速度的新库。这是由构建实时研究代理的需求驱动的，这些代理不断处理数据以进行更新。其核心创新利用SIMD指令快速识别文本中的分隔符（\n, ., ?, !, ;）。一项关键优化涉及根据较低的4位映射字节，允许在分隔符在该位范围内唯一时使用单个shuffle指令，而不是查找表。虽然一些评论员质疑考虑到嵌入生成成本的实际好处，但作者认为分块速度*变得*至关重要，当处理大规模数据集（TB级别）和实时工作流程时。他们优先考虑具有重叠的大块大小，以最大化检索性能，接受在“完美”分割方面的权衡。 **memchunk**现在已集成到Chonkie中，作为`FastChunker`，并且可以通过pip安装获得。该项目旨在实现“足够好”的分块质量，侧重于速度和嵌入质量，而不是精确分割。

原文

we’ve been working on chonkie, a chunking library for RAG pipelines, and at some point we started benchmarking on wikipedia-scale datasets.

that’s when things started feeling… slow.

not unbearably slow, but slow enough that we started wondering: what’s the theoretical limit here? how fast can text chunking actually get if we throw out all the abstractions and go straight to the metal?

this post is about that rabbit hole, and how we ended up building memchunk.

what even is chunking?

if you’re building anything with LLMs and retrieval, you’ve probably dealt with this: you have a massive pile of text, and you need to split it into smaller pieces that fit into embedding models or context windows.

the naive approach is to split every N characters. but that’s dumb — you end up cutting sentences in half, and your retrieval quality tanks.

the smart approach is to split at semantic boundaries: periods, newlines, question marks. stuff that actually indicates “this thought is complete.”

"Hello world. How are you?"
     → ["Hello world.", " How are you?"]

why delimiters are enough

there are fancy chunking strategies out there — sentence splitters, recursive chunkers, semantic chunkers that use embeddings. but for most use cases, the key thing is just not cutting sentences in half.

token-based and character-based chunkers don’t care where sentences end. they just split at N tokens or N bytes. that means you get chunks like:

"The quick brown fox jumps over the la"
"zy dog."

the embedding for that first chunk is incomplete. it’s a sentence fragment.

delimiter-based chunking avoids this. if you split on . and ? and \n, your chunks end at natural boundaries. you don’t need NLP pipelines or embedding models to find good split points — just byte search.

simple concept. but doing it fast? that’s where things get interesting.

enter memchr

the memchr crate by Andrew Gallant is the foundation here. it’s a byte search library with multiple layers of optimization.

the fallback: SWAR

even without SIMD instructions, memchr is fast. the generic implementation uses SWAR (SIMD Within A Register) — a clever trick that processes 8 bytes at a time using regular 64-bit operations.

the core idea: to find a byte in a chunk, XOR the chunk with the needle splatted across all 8 positions. any matching byte becomes zero. then use bit manipulation to detect if any byte is zero:

fn has_zero_byte(x: usize) -> bool {
    const LO: usize = 0x0101010101010101;
    const HI: usize = 0x8080808080808080;
    (x.wrapping_sub(LO) & !x & HI) != 0
}
 
// needle = 0x2E (period), splatted = 0x2E2E2E2E2E2E2E2E
// chunk XOR splatted -> any matching byte becomes 0x00
// has_zero_byte detects if any lane is zero

the has_zero_byte trick works because subtracting 1 from a zero byte causes a borrow that propagates to the high bit. no branches, no loops over individual bytes.

the fast path: AVX2/SSE2

on x86_64 with SIMD support, memchr upgrades to vector instructions. AVX2 processes 32 bytes at once:

// broadcast needle to all 32 lanes
let needle_vec = _mm256_set1_epi8(needle as i8);
// load 32 bytes from haystack
let chunk = _mm256_loadu_si256(haystack);
// compare all 32 bytes simultaneously
let matches = _mm256_cmpeq_epi8(chunk, needle_vec);
// extract match positions as a 32-bit mask
let mask = _mm256_movemask_epi8(matches);

if mask is non-zero, at least one byte matched. trailing_zeros() gives the position.

memchr also has a hybrid strategy: it stores both SSE2 (16-byte) and AVX2 (32-byte) searchers, using SSE2 for small haystacks and AVX2 for larger ones.

the API

memchr provides three variants:

memchr(needle, haystack) — find one byte
memchr2(n1, n2, haystack) — find either of two bytes
memchr3(n1, n2, n3, haystack) — find any of three bytes

for two or three needles, it runs multiple comparisons and ORs the results together. but notice: it stops at three.

the three-character limit

why only three? from memchr’s source code:

only one, two and three bytes are supported because three bytes is about the point where one sees diminishing returns. beyond this point and it’s probably (but not necessarily) better to just use a simple [bool; 256] array or similar.

each additional needle requires another comparison and OR operation. with 3 needles you’re doing 3 broadcasts, 3 comparisons, and 2 ORs per 32-byte chunk. a 4th needle adds another ~33% overhead, and the gains from SIMD start looking less impressive compared to simpler approaches.

the memchr authors explicitly chose not to implement memchr4 or beyond — it’s not a limitation, it’s a design decision.

so what happens when you want to split on \n, ., ?, !, and ;? that’s 5 delimiters. memchr can’t help directly.

the fallback: lookup tables

when you have more than 3 delimiters, we fall back to a classic technique: the 256-entry lookup table.

fn build_table(delimiters: &[u8]) -> [bool; 256] {
    let mut table = [false; 256];
    for &byte in delimiters {
        table[byte as usize] = true;
    }
    table
}
 
fn is_delimiter(byte: u8, table: &[bool; 256]) -> bool {
    table[byte as usize]  // single array lookup, no branching
}

this is O(1) per byte — just an array index. no loops, no comparisons, no branching. modern CPUs love this because it’s completely predictable.

is it as fast as SIMD? no. but it’s still really fast, and it handles arbitrary delimiter sets.

why search backwards?

there’s one more trick: we search backwards from chunk_size, not forwards from 0.

if you search forward, you need to scan through the entire window to find where to split. you’d find a delimiter at byte 50, but you can’t stop there — there might be a better split point closer to your target size. so you keep searching, tracking the last delimiter you saw, until you finally cross the chunk boundary. that’s potentially thousands of matches and index updates.

backward search? one lookup. start at chunk_size, search backward, and the first delimiter you hit is your split point. done.

target_size = 100
text: "Hello. World. This is a test. Here's more text..."

forward search:
  - find "." at 6, save it, keep going
  - find "." at 13, save it, keep going
  - find "." at 30, save it, keep going
  - ... eventually cross boundary, use last saved position
  → multiple searches, lots of bookkeeping

backward search:
  - start at position 100, search backward
  - find "." at 30, done
  → one search, no bookkeeping

fewer operations means less overhead. we use memrchr (reverse memchr) instead of memchr for exactly this reason.

putting it together

memchunk automatically picks the right strategy:

fn find_last_delimiter(window: &[u8], delimiters: &[u8], table: Option<&[bool; 256]>) -> Option<usize> {
    if let Some(t) = table {
        // 4+ delimiters: use lookup table
        window.iter().rposition(|&b| t[b as usize])
    } else {
        // 1-3 delimiters: use SIMD-accelerated memchr
        match delimiters.len() {
            1 => memchr::memrchr(delimiters[0], window),
            2 => memchr::memrchr2(delimiters[0], delimiters[1], window),
            3 => memchr::memrchr3(delimiters[0], delimiters[1], delimiters[2], window),
            _ => None,
        }
    }
}

the table is only built once, lazily, on the first iteration. after that, it’s pure speed.

benchmarks

we ran this against the other rust chunking libraries we could find:

library	throughput	vs memchunk
memchunk	164 GB/s	-
kiru	4.5 GB/s	36x slower
langchain	0.35 GB/s	469x slower
semchunk	0.013 GB/s	12,615x slower
llama-index	0.0035 GB/s	46,857x slower
text-splitter	0.0017 GB/s	96,471x slower

the exact throughput varies with file size and delimiter count — larger files and fewer delimiters favor SIMD more. but even in pessimistic cases (small chunks, many delimiters), we’re still in the hundreds of GB/s range.

for reference, that’s chunking english wikipedia (~20GB) in roughly 120 milliseconds.

python and wasm bindings

since chonkie is a python library, we needed python bindings. we also added wasm for browser/node use cases.

from memchunk import chunk
 
text = "Hello world. How are you?"
for slice in chunk(text, size=4096, delimiters=".?"):
    print(bytes(slice))

import { chunk } from "memchunk"
 
const text = new TextEncoder().encode("Hello world. How are you?")
for (const slice of chunk(text, { size: 4096, delimiters: ".?" })) {
  console.log(new TextDecoder().decode(slice))
}

both return zero-copy views (memoryview in python, subarray in js), so you get most of the performance even across FFI boundaries.

wrapping up

the techniques here aren’t novel — SIMD search and lookup tables are textbook stuff. the win comes from knowing when to use which:

1-3 delimiters → memchr’s SIMD search
4+ delimiters → 256-entry lookup table
minimize allocations → return offsets, let the caller slice

memchunk is on crates.io, pypi, and npm if you want to try it out.