布隆过滤器适合于不扩展的搜索。
Bloom filters are good for search that does not scale

原始链接: https://notpeerreviewed.com/blog/bloom-filters/

## 用于全文搜索的布隆过滤器:可扩展性研究 本文探讨了使用布隆过滤器创建空间高效的全文搜索索引的可能性,即使对于大型文档集合也是如此。核心思想利用了布隆过滤器的小尺寸——可能允许静态网站的客户端索引——但当扩展到少量文档之外时,会面临挑战。 最初尝试通过排序或树形结构化过滤器来提高查询性能的努力被证明是无效的,因为自然语言具有高维度和固有的重叠性。一种更有希望的方法是使用“布隆过滤器的倒排索引”,使用树将词典单词映射到文档过滤器,模仿传统的倒排索引,但可能具有更小的存储空间。 然而,分析表明存在一个关键的限制:虽然布隆过滤器可以有效地将词典*压缩到*过滤器中,但它们不会在过滤器*之间*共享信息。每个文档的过滤器必须重新编码其所有单词,导致随着文档数量的增加而增加空间消耗。超过大约 7,200 个文档后,标准的倒排索引会更节省空间。 最终,布隆过滤器擅长将大型词典压缩成*少量*过滤器。随着过滤器数量的增加,优势会减小,这凸显了在选择数据结构时考虑数据协同作用的重要性。

相关文章

原文

A great blog post from 2013 describes using bloom filters to build a space-efficient full text search index for small numbers of documents. The algorithm is simple: Per document, create a bloom filter of all its words. To query, simply check each document's bloom filter for the query terms.

With a query time complexity of O(number-of-documents), we can forget about using this on big corpuses, right? In this blog post I propose a way of scaling the technique to large document corpuses (e.g. the web) and discuss why that is a bad idea.

Fun fact: There is a nice implementation of this exact algorithm that is still used in the wild. But let's get into it.

Why even try this?

The bloom filter index's big selling point is its small size. It allows static websites with dozens of pages to ship a full text search index to the client that is as small as a small image. An equivalent inverted index, which is the traditional textbook approach for keyword-based full text search, would be multiple times bigger.

But index size is not only relevant on small blog websites. If we could scale this technique to larger document corpuses and achive similar space savings, that would be huge!

The main thing that seems to stand in our way is query performance. Instead of always checking every document's bloom filter, we will try to construct an index that only checks a small subset of filters, but still finds all matching documents.

Some Ideas that don't work at all

Look, I brainstormed a bunch of ideas for how to improve the bloom filter based index. I will quickly go over two of them, because identifying and discarding ideas that will not work is an important part of science and engineering.

Sort the filters

If we sort the filters by some metric, for example by the most to least significant bits, then we can use a binary search algorithm or something like that, right? - Wrong.

We can construct a simple counter example to show that this does not work. Here the query is matched by the first and the last filter in a sorted list of filters.

test

Tree of filters

Plain sorting does not work, but what if we structure our big set of filters into a tree? Imagine it sort of like this, but much bigger.

test

At each branch node, we construct an aggregate filter that encodes all documents that are reachable from the branch. Aggregate filters are constructed by a simple bitwise or of the other filters like this.

test

When we get a query, we first check it against the top level branch filters. If a filter does not match e.g. the filter for document 6-10, we can discard that entire branch of the tree for this query.

Ideally we would like to search as few branches of the tree as possible to improve performance. How many branches we do need to search, depends heavily on the partitioning of the documents. Intuitively, we can think of it like this: branch A should contain all documents that contain the words "dog", "cat", "bird". Branch B should contain documents with "car", "bus", "plane". The fewer branches each word is contained in the better.

Here comes the problem: What if there is a document that says "I took my cat on the bus today"?
This breaks our assumtion above, and suddenly for the query "bus" we need to search both branches. You can imagine this happening for almost every word in the dicionary across all branches, because language is complex and lets us say so many different things in many different contexts.

Or in other words: Text documents are high-dimensional.
I recommend reading about the curse of dimensionality for an intuition of what issues this implies. Here it means that it is basically impossible to cluster text documents into disjunct subsets without significant overlap. For our seach index that means that even when using this tree, we would still need to search almost every document for every query.

Inverted Index of Bloom Filters

The problem with our previous tree-based idea is that there is so much overlap between text documents. But I know one book that only contains every work exactly once: The dictionary. We can construct a search tree of the entire dictionary, again based on bloom filters. Each leaf represents a set of words. At each leaf we keep a list of pointers to every document's filter that contains one of those words.

A diagram of a simplified search tree over the english dictionary

Maybe not so incidentally, this looks a lot like an inverted index. And it works! For any query term we can walk the tree to the leaf that contains the query term and then we match only against the filters at that leaf. Instead of a hash table, as in the inverted index, our index uses a tree for the dictionary, but fundamentally it does a similar thing.

The big difference is that the tree can be smaller than the hash table. Remember, size is the main reason to attempt this at all. Not only is there no empty space in a tree, but we also encode all the words in our bloom filters instead of storing them outright. Modern bloom filters (actually called Xor filters) require about ten bits per element [1], much less than the 8 bits per character required to store a full word.

As an aside, bloom filters are indeed already used in full text search for large-ish datasets, but in the form of skip-indexes. In a skip index, a bloom filter is used to quickly check if a large chunk of data contains a value (e.g. a word) at all. That way, a database can avoid reading chunks of data that do not contain any records for a given query. Until very recently this technique was used by the Clickhouse OLAP system for full text search [2]. It has been superseeded by a proper inverted index in 2025.

Why all of this is still a bad idea

We did it! We have a working idea for a bloom filter based search index that works for large document corpuses. The query time complexity is not as good as for an inverted index, but it is logarithmic with the number of documents. That is good enough if you ask me. So why do I write that it is still a bad idea?

Let's think about what allows the bloom filter based index to be small again. Instead of storing the entire dictionary in our index, we use bloom filters that require about ten bits per word. Ten bits per word. Ten bits per every word. Not unique word. Every word in our document corpus (except duplicates in the same document). To make our math exceedingly simple, let's say the english dictionary has about 500 Thousand unique words and every document contains 1000 distinct words. At ten bits per entry for a bloom filter, that makes each document's filter about 1.25kb. Assume that words are on average ten characters long, then the dictionary will require 5mb for the text alone. We can assume another 4mb for the inverted index's hash table to get a lower bound of 9mb for the inverted index. Both indexes require similar amounts of space for document ids and pointers, so relative to the inverted index, our bloom filter index grows by 1.25kb per document. Divide 9mb by 1.25kb and you find out that at only 7200 documents the inverted index becomes more space efficient than the bloom filter index. Of course the real numbers will be different and we are ignoring some things here, but the trend will stay the same.

What is going on here is that while an inverted index must store every word in the dictionary exactly once, sharing the space when a word is reused, bloom filters do not share space amongst each other. Every document's bloom filter must encode all words in the document from scratch. If a word is contained in thousands of documents, that requires much more space than simply storing the word in plain text.

Conclusions

When you have a small number of documents relative to the size of your dictionary, bloom filters can indeed achieve a much smaller full text search index than is possible traditionally.

Bloom filters are space efficient when compressing a large dictionary into a small number of filters. As more filters share the same dictionary, this efficiency decreases. Intuitively this is because bloom filters cannot share information amongst each other. Each filter must encode its entire dictionary from scratch. An inverted index does not do this. It only stores the dictionary once and shares it for all documents, so it gets more space efficient with the number of documents.

More generally, there is no synergy between bloom filters. Each filter on its own is efficient, but as a whole system, a different approach might be more efficient. We can transfer this insight to other problem domains as well. For example, imagine a content moderation system on a social media platform that allows blocking individual accounts. If we have one global blocklist on our platform, a bloom filter can be an efficient (though maybe not ideal) implementation of this. But allow every user to create their own blocklist and a different design will be more scaleable.

联系我们 contact @ memedata.com