DuckDB全文搜索

DuckDB全文搜索
Full-Text Search with DuckDB

原始链接: https://peterdohertys.website/blog-posts/full-text-search-w-duckdb.html

## DuckDB 全文搜索：快速入门本文探讨了 DuckDB 的全文搜索 (FTS) 功能，在此之前已经介绍了 DuckDB 的基本知识。虽然基本的文本查询存在局限性，但 FTS 允许使用 Okapi BM25 等算法进行更全面和可配置的搜索。 DuckDB 的 FTS 扩展提供了词干提取（将单词还原为其词根）、停用词删除和去除重音等功能。查询可以通过术语频率 (k₁) 和长度归一化 (b) 参数进行调整。虽然它不如 Elasticsearch 或 Postgres 等解决方案功能丰富，但它是一个坚实的基础。作者通过处理一系列电子邮件、将其导入 DuckDB 并创建 FTS 索引来演示 FTS。他们强调了局限性，例如缺少 Postgres 中提供的术语高亮显示功能，并提供了一个用于预处理 .eml 文件的 Python 脚本。文章详细介绍了如何使用参数来排除事务性电子邮件，并根据术语频率和文档长度调整评分。最终，DuckDB 的 FTS 由于其易于设置和速度，非常适合探索性数据分析。虽然更复杂的解决方案可能适用于高级用例，但 DuckDB 为全文搜索提供了一个强大且易于访问的入口。作者计划在未来的文章中探索向量搜索。

使用 DuckDB 进行全文搜索 (peterdohertys.website) 9 分，ethagnawl 发表于 1 小时前 | 隐藏 | 过去 | 收藏 | 1 条评论帮助 rahimnathwani 发表于 18 分钟前 [–] 有人使用 DuckDB（或任何其他工具）创建开源方式来发布邮箱，以便普通人可以浏览和搜索吗？我知道 jmail.world，但他们尚未（？）发布源代码。我最近让 Claude 临时拼凑了一个：https://healdsburg-youcubed-emails.vercel.app/ 它对这小部分邮件来说效果很好，但搜索效果不太好，而且预处理过程比我想要的要多。（我希望能够将一个二进制文件指向 pst 或 mbox 文件，并像这样神奇地提供服务，即使这意味着我需要 VPS 来提供服务。）回复考虑申请 YC 2026 年夏季批次！申请截止至 5 月 4 日指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系方式搜索：

Published on: 4/29/2026

Overview

This is a follow-up to my first post about DuckDB: A Dab of DuckDB. If you’re new to DuckDB, you may want to start there.

The basic DuckDB workflow of making a data source quickly and easily discoverable is incredibly powerful … but there are limits. Some use cases, like searching the contents of historical publications or a tranche of emails would be constrained by basic text queries. As mentioned in my first post, I’m interested in exploring some of the more powerful DuckDB features and in this post I’ll be focused on full-text search (FTS). I have a decent amount of experience using other FTS solutions, like Elasticsearch and Postgres (both with the in-built options and extensions like pgvector and pg_search). So, in this post I will take you through a quick tour of the current state of FTS in DuckDB.

An abbreviated FTS primer

A full FTS tutorial is outside the scope of this post and if you’re interested in learning more the Postgres docs are a worthwhile read.

FTS allows for more comprehensive and configurable queries than what can be achieved using SQL operators like =, ilike or regexen. Query scores can also be tuned using algorithms, like Okapi BM25, which is what DuckDB offers.

Index options:

- Stemming: reduces words to a common root and handles some forms of inflection (walk, walks, walked, walking, etc.) but there are gaps for unconventional forms (e.g. mice and mouse)
- Stop words: removal of common "stop words" like "the", "and" and "of" whose presence may skew results
- Strip accents: normalize "á", "ä", and "a"

Query functions:

- Okapi BM25 parameters

    - k₁: term frequency (are more instances meaningful?)
    - b: length normalization (is a longer document more meaningful?)

The features mentioned above are all present in DuckDB’s FTS extension but this only begins to scratch the surface of what is possible with FTS – especially when using some of the more fully-featured engines. The DuckDB feature set is a good start, though, and I’m sure functionality will be added over time and/or new extensions will be released. I imagine functions for phrase queries, vectors and support for pluggable synonym dictionaries are features contributors are already thinking about.

One example of where the current DuckDB feature-set fell short during my experiments was its lack of any way to highlight where query terms matched in the source data. Postgres offers ts_headline to address this but I found myself having to search query results for matches using tmux (a string of keystrokes I can never remember …) or by piping query results to grep and that was a bit frustrating.

While working on this post, I also came across the Snowball project which is “a small string processing language for creating stemming algorithms for use in Information Retrieval, plus a collection of stemming algorithms implemented using it”. My understanding is that it’s the basis for stemming in most databases and client libraries. The Python snowballstemmer library can be used to quickly debug unexpected stemming issues. For example, why certain word forms aren’t being matched.


# stemmer.py
# Usage: uv run stemmer.py
# /// script
# requires-python = "==3.13"
# dependencies = [
# "snowballstemmer==3.0.1",
# ]
# ///

from snowballstemmer import stemmer

en = stemmer("english")

print(en.stemWord("run"))     # -> run
print(en.stemWord("running")) # -> run

print(en.stemWord("mouse"))   # -> mous
print(en.stemWord("mice"))    # -> mice

Setting Up

Full-text search isn’t something DuckDB provides out-of-the-box but it is readily available via the Full-Text Search Extension.

Assuming you’ve already got DuckDB installed, installing the fts extension is as simple as starting a new session and running:

INSTALL fts;
LOAD fts;

Digging In

Let’s assume you have a multi-GB trove of emails and you want to search their contents to see what politicians, business leaders and celebrities are talking with each other about. In my corpus, there are 13,010 emails with the .eml extension which have a smattering of mime-types. DuckDB can’t import these natively, so we’ll have to do some pre-processing before we can create our database, index it and start querying. You should be able to find a tranche of emails like this one on archive.org, ddosecrets.org or even justice.gov but that’s left as an exercise for the reader. If you want to follow along, any collection of .emls should suffice.

Pre-Processing Files

I’m going to use Python to do the processing of the raw files. YMMV, but when it comes to Python tooling, I don’t think any solution comes close to being as simple and efficient as uv and that’s what I’ll be using. (I’ve given a few lightning talks on uv and really need to dedicate a blog post to it.)

The pre-processing workflow is admittedly quick and dirty and will readily discard emails which can’t be cleanly parsed:

- load the email file
- try to parse out its body contents
- parse out useful headers and other metadata to help flag marketing and transactional emails
- if successful, dump JSON to a file

# preprocess-emails.py
# Usage: uv run preprocess-emails.py
# /// script
# requires-python = "==3.13"
# dependencies = [
# "beautifulsoup4==4.14.3",
# ]
# ///

import email
import json
from bs4 import BeautifulSoup
from email import policy
from pathlib import Path


def html_to_text(html):
    soup = BeautifulSoup(html, "html.parser")
    for tag in soup(["script", "style", "head", "title", "meta"]):
        tag.decompose()
    return soup.get_text(" ", strip=True)


def extract_body(msg):
    try:
        for kind in ("plain", "html"):
            part = msg.get_body(preferencelist=(kind,))
            if part is None:
                continue
            try:
                content = part.get_content()
            except Exception:
                continue
            if not (content and content.strip()):
                continue
            return html_to_text(content) if kind == "html" else content
        return None
    except Exception as e:
        print(f"Couldn't parse body: {e}")
        return None


for path in Path(".").glob("*.eml"):
    try:
        with open(path, "rb") as f:
            msg = email.message_from_binary_file(f, policy=policy.default)
        body = extract_body(msg)

        if not body:
            print(f"no body found for {f}")
            continue

        row = {
            "body": body,
            "date": str(msg["date"]),
            "file": path.name,
            "from": str(msg["from"]),
            "subject": str(msg["subject"]),
            "to": str(msg["to"]),
            "list_unsubscribe": str(msg.get("List-Unsubscribe", "")),
            "list_id": str(msg.get("List-Id", "")),
            "precedence": str(msg.get("Precedence", "")),
            "auto_submitted": str(msg.get("Auto-Submitted", "")),
            "x_mailer": str(msg.get("X-Mailer", "")),
            "return_path": str(msg.get("Return-Path", "")),
        }

        with open(f"{path}.json", "w") as f:
            f.write(json.dumps(row))

    except Exception as e:
        print(f"error parsing {f}: {e}")

Regularly Scheduled Programming

Import JSON and Populate DB

CREATE TABLE emails AS SELECT * FROM read_json('*.eml.json');

I love that DuckDB shows the progress of the import and how much RAM is being used.

Create and populate ID column

There might be a way to do this during import and you could certainly insert IDs from the pre-processing loop but I’d forgotten to do so and I’ll leave this sequence for posterity.

ALTER TABLE emails ADD COLUMN id INTEGER;
UPDATE emails SET id = rowid;

Create FTS Index

You can index one or more columns and there are optional parameters to control the stemmer, stop words, etc. See the docs here.

PRAGMA create_fts_index('emails', 'id', 'subject', 'body');

Start Digging!

There are various parameters which can be used to refine or broaden your queries as needed. See the docs here.

Basic query using default parameters and an attempt at excluding transactional emails and mailing lists:

SELECT id, body, fts_main_emails.match_bm25(id, 'talk') AS score
FROM emails
WHERE list_unsubscribe = ''
    AND precedence NOT IN ('bulk', 'list', 'junk')
    AND score IS NOT NULL
ORDER BY score DESC;
-- yields results for: "Talking", "talks", "talked", etc.

Use conjunctive parameter to require that all terms match:

SELECT id, body, fts_main_emails.match_bm25(id, 'detective trial', conjunctive := 1) AS score
FROM emails
WHERE list_unsubscribe = ''
    AND precedence NOT IN ('bulk', 'list', 'junk')
    AND score IS NOT NULL
ORDER BY score DESC;
-- yields only results matching *both* search terms

Use the Okapi k₁ and b parameters to weight term frequency and document length:

b:

-- b = 0: document length is effectively ignored
SELECT subject, "from", length(body) AS body_len,
       fts_main_emails.match_bm25(id, 'delivery', b := 0.0) AS score
FROM emails
WHERE score IS NOT NULL
ORDER BY score DESC
LIMIT 1;

┌────────────────────────────────────────┬─────────────────────────────────────────┬──────────┐
│                subject                 │                  from                   │ body_len │
│                varchar                 │                 varchar                 │  int64   │
├────────────────────────────────────────┼─────────────────────────────────────────┼──────────┤
│ Our Best Deal of the Year: Save 50% fo │ The New York Times <[email protected] │     4121 │
│ r 26 Weeks on a Times Subscription. Sa │ rktimes.com>                            │          │
│ le Ends 12/2                           │                                         │          │
└────────────────────────────────────────┴─────────────────────────────────────────┴──────────┘


-- b = 1: long documents (e.g. newsletter and articles) are penalized
SELECT subject, "from", length(body) AS body_len,
       fts_main_emails.match_bm25(id, 'delivery', b := 1.0) AS score
FROM emails
WHERE score IS NOT NULL
ORDER BY score DESC
LIMIT 1;

┌────────────────────────┬───────────────────────────────────────┬──────────┐
│        subject         │                 from                  │ body_len │
│        varchar         │                varchar                │  int64   │
├────────────────────────┼───────────────────────────────────────┼──────────┤
│  DELIVERY for Member   │ Sapply <[email protected]> │      197 │
└────────────────────────┴───────────────────────────────────────┴──────────┘

k₁:

I wasn’t able to find a good example of how k₁ would be useful using my existing corpus, so I created two synthetic emails/rows which emphasize how the k₁ parameter changes its score for word frequency. One email mentions “budget” once and the other mentions “budget” repeatedly, as it’s the focus of the email. When k₁ is low, the scores are closer (they actually match in this case but that wouldn’t be true across a real data set) and when k₁ is high, the email which is actually about “the budget” is scored higher.

SELECT file, subject,
    length(regexp_extract_all(lower(body), 'budget')) AS instances,
    fts_main_emails.match_bm25(id, 'budget', k := 0.3) AS k_low,
    fts_main_emails.match_bm25(id, 'budget', k := 3.0) AS k_high
FROM emails
WHERE file LIKE 'demo-%'
ORDER BY k_high DESC;

┌──────────────────────────┬───────────┬────────────────────┬────────────────────┐
│           file           │ instances │       k_low        │       k_high       │
│         varchar          │   int64   │       double       │       double       │
├──────────────────────────┼───────────┼────────────────────┼────────────────────┤
│ demo-budget-focused.eml  │         8 │ 1.5647278150907307 │  5.693932504469362 │
│ demo-passing-mention.eml │         1 │ 1.5647278150907307 │ 3.2223555634031062 │
└──────────────────────────┴───────────┴────────────────────┴────────────────────┘

Using a low k₁, the scoring strategy is effectively, “did this term appear at all?”, while with a high k₁, repeated instances affect the ranking.

Cleaning Up

The index can be deleted using:

PRAGMA drop_fts_index('emails');

Summary

DuckDB’s FTS feature set is not as feature-complete as those in Postgres or Elasticsearch. However, it’s still quite powerful and likely good enough for most exploratory use cases. If you determine that you need a more complex solution, it should be easy enough to dump your DuckDB and import it into Postgres or Elasticsearch. The ease and speed with which it can be spun up against (almost) any data source makes it very compelling and something I will reach for when I start using DuckDB for Real Work.

I would like to continue this series on DuckDB and my next post may investigate the current state of vector search. Stay tuned.