如何改进完美连接算法
How to Improve a Perfect Join Algorithm

原始链接: https://remy.wang/blog/ya-fast.html

## 改进 Yannakakis 最优连接算法 Yannakakis 的算法为无环连接提供了理论上的最佳性能,但实际上通常比传统的哈希连接慢 2-3 倍。最近的研究集中在弥合这一差距。一个关键问题是半连接传递的开销——即使输入关系具有完美的连接兼容性,这些传递也可能造成浪费。 改进包括利用 **Bloom 过滤器** 来紧凑地表示半连接结果,与哈希表相比,大大提高了查找和插入速度,尤其是在过滤器适合 CPU 缓存时。 另一个增强功能,**聚合下推**,简化了当查询请求聚合数据(如 `MAX` 或 `AVG`)时的过程,通过在哈希表构建期间执行聚合来有效地将多个传递合并到一个操作中。 进一步的优化包括延迟连接期间的结果扩展——创建匹配的嵌套表示——并在循环嵌套内执行 **即时半连接**,删除遇到的不匹配元组。这种方法通过“搭便车”半连接并避免大型中间结果的物化,实现了线性时间复杂度。 这些进展,详见即将发表的论文(ICDT 2026,CIDR 2026),旨在使理论上最优的连接算法具有实用性,并鼓励数据库厂商考虑其实现。

黑客新闻 新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 如何改进完美的连接算法 (remy.wang) 12 分,由 remywang 发表于 23 小时前 | 隐藏 | 过去 | 收藏 | 讨论 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系 搜索:
相关文章

原文
4 Ways to Improve A Perfect Join Algorithm

Based on Database Theory in Action: Yannakakis's Algorithm to appear at ICDT 2026

Last time I covered one of my favorite results from database theory: Yannakakis's algorithm for acyclic joins. The algorithm is very simple, yet achieves instance-optimality on a large class of queries, meaning it's basically perfect as far as asymptotic complexity goes. So why doesn't every database implement it? The main reason is that the algorithm is actually not very fast in practice. To be exact, it's usually 2-3x slower than hash join. Let's look at an example to understand this. Consider the natural join:

Q = R_1(x_1, x_2) \bowtie R_2(x_2, x_3) \bowtie R_3(x_3, x_4) \bowtie R_4(x_4, x_5)

Here I'm using R_1(x_1, x_2) to imply the schema of R_1 is \{x_1, x_2\}. In SQL the query looks like this:

Wikipedia page - it's actually really simple and elegant! But all you need to know now is that Bloom filter can store a very large set very compactly: a filter for an in-memory DB table can usually fit in the CPU cache, drastically accelerating both insertions and lookups. For Yannakakis's algorithm, we can store the semijoin result in a Bloom filter instead of a hash table:

for several years, and bitmap filters has been in SQLServer for even longer. It's just that people are only recently realizing that these filters pair really well with the theoretical guarantees of Yannakakis's algorithm.

Aggregate pushdown

While Bloom filters accelerate Yannakakis's algorithm "horizontally" at the level of each semijoin operation, the next idea works "vertically" by replacing semijoins all together. The key insight is that, in most cases, we don't actually need the millions of tuples coming out of the join, but rather some aggregate information over them. For example, we might want to know the best rated movie in some category, or the average rating of some director. In SQL, this means instead of SELECT *, we usually GROUP BY a small number of columns and output some aggregates per group. Using our slightly contrived example, we'll be querying:

predicate transfer work implements some pretty neat SIMD acceleration for Bloom filters; the aggregate pushdown paper precisely defines when the optimization applies; our nested Yannakakis paper defines an entire new nested relational algebra and describes a vectorized implementation; and for the last approach (which we call TreeTracker Join), our paper introduces an additional twist to jump over loops when a variable used in a hash probe was introduced in an outer loop. If this post whets your appetite, you should check out our very short survey paper for pointers to the literature and start digging from there. There's also an exciting new paper to appear at CIDR 2026 on how SQLServer "accidentally" implemented Yannakakis's algorithm that didn't come out yet when we wrote our survey. I hope you will now see optimal join algorithms are not that mysterious after all, and you should demand your DB vendor to start considering them! Or better yet, if you are a DB maintainer, you should consider implementing one of the many approaches here, and talk to us researchers so we can better help you.

联系我们 contact @ memedata.com