PostgreSQL 优化器的十年改进
Ten years of improvements in PostgreSQL's optimizer

原始链接: https://rmarcus.info/blog/2024/04/12/pg-over-time.html

自 2014 年以来,我花了十年时间通过先进的开源查询优化器 PostgreSQL 来完善我在查询优化方面的专业知识。 为了评估其在此期间的进展,我对 PostgreSQL 版本 8 至 16 进行了连接顺序基准 (JOB) 测试。测量了每个版本的第 90 个百分位查询延迟。 所有版本都安装在运行 Arch Linux 的 Docker 容器中,并使用 GCC 13.2 进行编译。 为了隔离查询优化器性能,我分配了充足的共享内存 (8 GB),保持 IO 和索引等其他因素不变。 通过在记录中值之前多次执行查询来预热缓存,我评估了每个版本的查询优化器的效率。 我的发现表明 PostgreSQL 的后期性能有了显着的进步,在版本 8 和 16 之间,尾部延迟减少了近一半。如果线性考虑,这种改进平均每个主要版本大约 15%。 然而,重要的是要记住,性能提升不仅仅归功于查询优化器;它还取决于查询优化器。 并行处理和 JIT 编译等增强功能也做出了重大贡献。 对于那些感兴趣的人来说,采用回归技术的详细查询分布分析可以进一步深入了解每个 PostgreSQL 更新的影响。 最后,PostgreSQL 的不断发展的性质意味着持续的比较研究是必要的; 因此,研究人员必须考虑旧技术与当代版本的关系,而不是绝对百分比增益。 对于实际应用,请考虑迁移到 PostgreSQL 16,以获得工作负载尾部延迟的潜在巨大改进。

PostgreSQL 是一种流行的开源关系数据库管理系统,为改进其优化过程提供了大量机会。 以下是作者强调的三个关键领域: 1. 数据准确性:PostgreSQL 需要精确的成本数据才能实现最佳查询执行。 改进的领域包括收集准确的系统调用延迟数据和增强外键统计数据。 不同系统的系统调用延迟差异很大,但 PostgreSQL 目前缺乏内置机制来衡量这一重要数据。 此外,虽然沿着外键连接理想情况下应该产生出色的查询计划,但这并不总是发生。 为了应对这些挑战,作者建议自动测量系统调用成本并创建基于外键的跨表相关性估计。 2. 实时适应:目前,PostgreSQL 采用静态方法来进行查询规划。 然而,通过延迟或替代场景规划进行实时适应将大大有利于复杂且资源密集型的查询。 通过在执行的每个阶段考虑可用数据,后续阶段可以获得有价值的见解,从而获得卓越的整体查询结果。 3. 先进技术和机器学习集成:尽管机器学习在查询优化方面具有实质性进步的潜力,但现有的实现并没有给作者留下深刻的印象。 相反,专注于构建更强大的成本模型并利用机器学习进行成本发现和估算可以提供更有希望的结果。 采用任意非嵌套结构或精心绘制草图方法等先进技术可能会进一步增强 PostgreSQL 的功能。 然而,新颖的优化策略的集成必须在复杂性和可用性之间取得平衡。
相关文章

原文

12 Apr 2024

As a query optimization researcher, I’ve spent the last 10 years of my life playing with, learning from, and building on top of the most sophisticated open source query optimizer out there, PostgreSQL. I recently wondered how much PostgreSQL had improved over the decade since I started working on databases. While changelogs and opinion pieces were plentiful, I couldn’t find any strong empirical comparisons, so I decided to run the join order benchmark (JOB) on PostgreSQL 8 through 16. I recorded the 90th percentile query latency for each database version.

Graph of P90 tail latency for different PostgreSQL versions. Speed improves drastically over time.

I built each version of PostgreSQL using GCC 13.2 inside a Docker container with Arch Linux. Since I wanted to measure the quality of the query optimizer, and not index/IO performance, I set shared_buffers to 8GB (large enough to hold the entire database). I also set work_mem to 8MB for all versions. Each query is executed once to warm the cache, then the median latency of 5 additional runs in recorded.

Overall, PostgreSQL’s tail performance has improved drastically, although versions 13 through 16 have been mostly stable. Comparing version 8 to version 16, PostgreSQL’s optimizer has dropped tail latency by nearly half in the last 10 years!

We can also investigate the entire query distribution (note the log scale):

Box plots of query latency for each major version. There is a slight slope downwards.

We can use regression analysis to (1) confirm that the downward slope in latency is significant, and (2) quantify how much improvement is brought by each version of PostgreSQL. If we regress the PostgreSQL major version number against query latency, we see that each new major version of PostgreSQL brings, on average, a 15% performance improvement on the Join Order Benchmark (). However, a linear model is arguably a poor measure of the change ().

Of course, not all of these improvements are attributable to the query optimizer. Improvements to the execution engine – from parallel workers to just-in-time (JIT) compilation – also play a role. It would be interesting to investigate how each query plan in JOB has changed over the year… maybe next time!

Quantifying the improvement aside:

  • Upgrade your database! Going from PostgreSQL 8 to 16 has the potential to massively improve your workload’s tail latency.
  • Researchers should note that PostgreSQL is a bit of a moving target. Learned query optimization research has compared with different versions of PostgreSQL over time (e.g., Neo and Bao compare with version 11, whereas newer work compare with version 14, 15, or 16.) So just because an older technique improves on PostgreSQL by 30%, and a newer technique only improves on PostgreSQL by 25%, the newer technique may be comparing against a stronger PostgreSQL.

You can also check out the raw data for yourself.

联系我们 contact @ memedata.com