SQLite 中的短路相关子查询
Short-Circuiting Correlated Subqueries in SQLite

原始链接: https://emschwartz.me/short-circuiting-correlated-subqueries-in-sqlite/

## 使用标量子查询优化 Scour 查询 Scour 是一款每月摄取数百万条目的内容搜索引擎,最近实施了域名排除列表和付费墙内容过滤。在开发过程中,发现了一个重要的 SQL 优化:在不需要时,使用*不相关标量子查询*来绕过较慢的*相关子查询*。 核心排序查询最初使用相关子查询来过滤用户阻止列表中指定的域名。虽然对于*使用*排除规则的用户来说效率很高,但这为大多数未使用的用户增加了开销。解决方案是添加一个初步的不相关子查询,以检查用户*是否*有任何排除规则。如果没有,则完全跳过更昂贵的逐行域名检查,利用 SQLite 的短路评估。 基准测试显示,对于没有排除规则的用户,性能提高了 17%,而对于有排除规则的用户,开销可以忽略不计。这种方法也应用于付费墙内容过滤,在处理每个条目之前检查功能是否启用。作者还比较了 `NOT EXISTS` 与 `NOT IN` 和 `LEFT JOIN` 方法,发现带有短路功能的 `NOT EXISTS` 提供了最佳的整体性能。 这强调了在优化 SQL 查询时考虑常见用例的重要性——一个小小的改变可以显著提高大多数用户的性能。

黑客新闻 新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 SQLite 中短路相关的子查询 (emschwartz.me) 15 分,由 emschwartz 13 小时前发布 | 隐藏 | 过去 | 收藏 | 讨论 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文

I recently added domain exclusion lists and paywalled content filtering to Scour. This blog post describes a small but useful SQL(ite) query optimization I came across between the first and final drafts of these features: using an uncorrelated scalar subquery to skip a correlated subquery (if you don't know what that means, I'll explain it below).

Scour searches noisy sources for content related to users' interests. At the time of writing, it ingests between 1 and 3 million pieces of content from over 15,000 sources each month. For better and for worse, Scour does ranking on the fly, so the performance of the ranking database query directly translates to page load time.

The Ranking SQL Query

The main SQL query Scour uses for ranking applies a number of filters and streams the item embeddings through the application code for scoring.

Scour uses brute force search rather than a vector database, which works well enough for now because of three factors:

  1. Scour uses SQLite, so the data is colocated with the application code.
  2. It uses binary-quantized vector embeddings with Hamming Distance comparisons, which only take ~5 nanoseconds each.
  3. We care most about recent posts so we can significantly narrow the search set by publish date.

A simplified version of the query looks something like:

SELECT *
FROM items i
WHERE i.lang IN (SELECT lang FROM user_languages WHERE user_id = ?1)
AND i.published BETWEEN ?2 AND ?3
AND ...(more filters)...

The query plan shows that this makes good use of indexes:

QUERY PLAN
   |--SEARCH i USING INDEX idx_items_lang_published (lang=? AND published>? AND published<?)
   `--LIST SUBQUERY 1
      `--SEARCH user_languages USING COVERING INDEX sqlite_autoindex_user_languages_1 (user_id=?)

Domain Filters Using Correlated Subqueries

To add user-specified domain blocklists, I created the user_excluded_domains table and added this filter clause to the main ranking query:

AND NOT EXISTS (
    SELECT 1
    FROM user_excluded_domains ued
    WHERE user_id = ?1
    AND ued.domain = i.domain
)

The domain exclusion table uses (user_id, domain) as a primary key, so the lookup is efficient. However, this lookup is done for every row returned from the first part of the query. This is a correlated subquery:

QUERY PLAN
   |--SEARCH i USING INDEX idx_items_lang_published (lang=? AND published>? AND published<?)
   |--LIST SUBQUERY 1
   |  `--SEARCH user_languages USING COVERING INDEX sqlite_autoindex_user_languages_1 (user_id=?)
   `--LIST SUBQUERY 2
      `--SEARCH user_excluded_domains USING COVERING INDEX sqlite_autoindex_user_excluded_domains_1 (user_id=?)

A problem with the way we just added this feature is that most users don't exclude any domains, but we've added a check that is run for every row anyway.

To speed up the queries for users who aren't using the feature, we could first check the user's settings and then dynamically build the query. But we don't have to, because we can accomplish the same effect within one static query.

We can change our domain exclusion filter to first check whether the user has any excluded domains:

AND (
    NOT EXISTS (
        SELECT 1
        FROM user_excluded_domains
        WHERE user_id = ?1
    )
    OR NOT EXISTS (
           SELECT 1
           FROM user_excluded_domains ued
           WHERE user_id = ?1
           AND ued.domain = i.domain
    )
)

Since the OR short-circuits, if the first NOT EXISTS returns true (when the user has no excluded domains), SQLite never evaluates the correlated subquery at all.

The first NOT EXISTS clause does not reference any column in items, so SQLite can evaluate it once and reuse the boolean result for all of the rows. This "uncorrelated scalar subquery" is extremely cheap to evaluate and, when it returns true, lets us short-circuit and skip the more expensive correlated subquery that checks each item's domain against the exclusion list.

Here is the query plan for this updated query. Note how the second subquery says SCALAR SUBQUERY, whereas the third one is a CORRELATED SCALAR SUBQUERY. The latter is the per-row check, but it can be skipped by the second subquery.

QUERY PLAN
   |--SEARCH i USING INDEX idx_items_lang_published (lang=? AND published>? AND published<?)
   |--LIST SUBQUERY 1
   |  `--SEARCH user_languages USING COVERING INDEX sqlite_autoindex_user_languages_1 (user_id=?)
   |--SCALAR SUBQUERY 2
   |  `--SEARCH user_excluded_domains USING COVERING INDEX sqlite_autoindex_user_excluded_domains_1 (user_id=?)
   `--CORRELATED SCALAR SUBQUERY 3
      `--SEARCH ued USING COVERING INDEX sqlite_autoindex_user_excluded_domains_1 (user_id=? AND domain=?)

Benchmarking

To test the performance of each of these queries, I replaced the SELECT * with SELECT COUNT(*) and used a simple bash script to invoke the sqlite3 binary 100 times for each query on my laptop. Starting up the sqlite3 process each time adds overhead, but we're comparing relative differences.

At the time of this benchmark, the last week had 235,975 items, 144,229 of which were in English. The two example users I tested this for below only look for English content.

User Without Excluded Domains

This test represents most users, who have not configured any excluded domains:

Approach Min (ms) Max (ms) Avg (ms) Stddev (ms) Diff (ms) Diff (%)
Baseline (no filter) 67 91 72.7 4.7
Correlated Subquery 80 108 85.2 5.5 +12.5 +17.1%
With Short-Circuit 69 91 72.7 3.8 +0 +0%

This shows that the short-circuit query adds practically no overhead for users without excluded domains, whereas the correlated subquery alone makes queries 17% slower for these users.

User with Excluded Domains

This test uses an example user that has excluded content from 2 domains:

Approach Min (ms) Max (ms) Avg (ms) Stddev (ms) Diff (ms) Diff (%)
Baseline (no filter) 68 99 76.2 7.6
Correlated Subquery 84 112 90.5 6.8 +14.3 +18.7%
With Short-Circuit 82 109 88.5 8.1 +12.3 +16.1%

In this case, we do need to check each row against the domain filter. But this shows that the short-circuit still adds no overhead on top of the query.

Conclusion

When using SQL subqueries to filter down result sets, it's worth thinking about whether each subquery is really needed for most users or most queries. If the check is needed most of the time, this approach won't help. However if the per-row check isn't always needed, using an uncorrelated scalar subquery to short-circuit a condition can dramatically speed up the average case with practically zero overhead.

This is extra important because the slow-down from each additional subquery compounds. In this blog post, I described and benchmarked a single additional filter. But this is only one of multiple subquery filters.

Earlier, I also mentioned that users had asked for a way to filter out paywalled content. This works similarly to filtering out content from excluded domains. Some users opt-in to hiding paywalled content. For those users, we check if each item is paywalled. If so, we check if it comes from a site the user has specifically allowed paywalled content from (because they have a subscription). I used the same uncorrelated subquery approach to first check if the feature is enabled for the user and, only then, does SQLite need to check each row.

Concretely, the paywalled content filter subquery looks like:

AND (
    (
        SELECT COALESCE(hide_paywalled_content, 0) = 0
        FROM users
        WHERE user_id = ?1
    ) -- note these parentheses are needed so SQLite doesn't mistakenly think this query is correlated with `items`
    OR COALESCE(i.is_paywalled, 0) = 0
    OR i.domain IN (
        SELECT domain
        FROM user_paywall_allowed_domains
        WHERE user_id = ?1
    )
)

In short, a trivial uncorrelated scalar subquery can help us short-circuit and avoid a more expensive per-row check when we don't need it.

Appendix: NOT EXISTS vs NOT IN vs LEFT JOIN

There are multiple ways to exclude rows from an SQL query.

Here are the results from the same benchmark I ran above, but with two other ways of checking for whether an item comes from an excluded domain.

The NULL-safe NOT IN version of the query uses the subquery:

...
AND (
    i.domain IS NULL
    OR i.domain NOT IN (
        SELECT domain
        FROM user_excluded_domains
        WHERE user_id = ?1
    )
)

The LEFT JOIN variation joins items with user_excluded_domains and then checks for NULL:

SELECT *
FROM items i
LEFT JOIN user_excluded_domains ued on ued.user_id = ?1 AND ued.domain = i.domain
WHERE i.lang IN (SELECT lang FROM user_languages WHERE user_id = ?1)
AND i.published BETWEEN ?2 AND ?3
AND ued.domain IS NULL

And here are the full benchmarks:

User Without Excluded Domains

Approach Min (ms) Max (ms) Avg (ms) Stddev (ms) Diff (ms) Diff (%)
Baseline (no filter) 67 91 72.7 4.7
NOT EXISTS (no short-circuit) 80 108 85.2 5.5 +12.5 +17.1%
NOT EXISTS + short-circuit 69 91 72.7 3.8 +0 +0%
NULL-safe NOT IN (no short-circuit) 75 111 79.5 7.1 +6.8 +9.3%
NULL-safe NOT IN + short-circuit 69 103 74.8 6.6 +2.1 +2.8%
LEFT JOIN (no short-circuit) 74 100 79.1 5.1 +6.4 +8.8%
LEFT JOIN + short-circuit 76 103 84.4 7.4 +11.7 +16.0%

For users without excluded domains, we can see that the NOT EXISTS query using the short-circuit wins and adds no overhead.

User With Excluded Domains

Approach Min (ms) Max (ms) Avg (ms) Stddev (ms) Diff (ms) Diff (%)
Baseline (no filter) 68 99 76.2 7.6
NOT EXISTS (no short-circuit) 84 112 90.5 6.8 +14.3 +18.7%
NOT EXISTS + short-circuit 82 109 88.5 8.1 +12.3 +16.1%
NULL-safe NOT IN (no short-circuit) 83 112 89.7 8.4 +13.5 +17.7%
NULL-safe NOT IN + short-circuit 84 112 91.3 8.2 +15.1 +19.8%
LEFT JOIN (no short-circuit) 81 107 86.3 6.7 +10.1 +13.2%
LEFT JOIN + short-circuit 82 126 89.8 7.7 +13.6 +17.8%

For users who do have excluded domains, the LEFT JOIN is faster than the NOT EXISTS version. However, this version raises the exact problem this whole blog post is designed to address. Since joins happen no matter what, we cannot use the short-circuit to avoid the overhead for users without excluded domains. At least for now, this is why I've gone with the NOT EXISTS subquery using the short-circuit.


Discuss on Hacker News, Lobsters, r/programming, r/sqlite.


#potofhoney #scour

联系我们 contact @ memedata.com